Applied Data Mining
for Business and Industry
Second Edition
PAOLO GIUDICI
Department of Economics, University of Pavia, ItalySILVIA FIGINI
Faculty of Economics, University of Pavia, ItalyA John Wiley and Sons, Ltd., Publication
Applied Data Mining
for Business and Industry
Applied Data Mining
for Business and Industry
Second Edition
PAOLO GIUDICI
Department of Economics, University of Pavia, ItalySILVIA FIGINI
Faculty of Economics, University of Pavia, ItalyA John Wiley and Sons, Ltd., Publication
This edition first published c 2009
c 2009 John Wiley & Sons Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for
permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK
Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and
product names used in this book are trade names, service marks, trademarks or registered trademarks of their
respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is
sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice
or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication DataGiudici, Paolo.
Applied data mining for business and industry / Paolo Giudici, Silvia Figini. - 2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-05886-2 (cloth) - ISBN 978-0-470-05887-9 (pbk.)
1.
Data mining. 2.
Business-Data processing. 3.
Commercial statistics. I. Figini, Silvia. II. Title.
QA76.9.D343G75 2009
005.74068--dc22
2009008334
A catalogue record for this book is available from the British Library
ISBN: 978-0-470-05886-2 (Hbk)
ISBN: 978-0-470-05887-9 (Pbk)
Typeset in 10/12 Times-Roman by Laserwords Private Limited, Chennai, India
Printed and bound in Great Britain by TJ International, Padstow, Cornwall, UK
Contents
1Introduction1Part IMethodology52Organisation of the data72.1
Statistical units and statistical variables
7
2.2
Data matrices and their transformations
9
2.3
Complex data structures
10
2.4
Summary
11
3Summary statistics133.1
Univariate exploratory analysis
13
3.1.1
Measures of location
13
3.1.2
Measures of variability
15
3.1.3
Measures of heterogeneity
16
3.1.4
Measures of concentration
17
3.1.5
Measures of asymmetry
19
3.1.6
Measures of kurtosis
20
3.2
Bivariate exploratory analysis of quantitative data
22
3.3
Multivariate exploratory analysis of quantitative data
25
3.4
Multivariate exploratory analysis of qualitative data
27
3.4.1
Independence and association
28
3.4.2
Distance measures
29
3.4.3
Dependency measures
31
3.4.4
Model-based measures
32
3.5
Reduction of dimensionality
34
3.5.1
Interpretation of the principal components
36
3.6
Further reading
39
4Model specification414.1
Measures of distance
42
4.1.1
Euclidean distance
43
4.1.2
Similarity measures
44
4.1.3
Multidimensional scaling
46
viCONTENTS4.2
Cluster analysis
47
4.2.1
Hierarchical methods
49
4.2.2
Evaluation of hierarchical methods
53
4.2.3
Non-hierarchical methods
55
4.3
Linear regression
57
4.3.1
Bivariate linear regression
57
4.3.2
Properties of the residuals
60
4.3.3
Goodness of fit
62
4.3.4
Multiple linear regression
63
4.4
Logistic regression
67
4.4.1
Interpretation of logistic regression
68
4.4.2
Discriminant analysis
70
4.5
Tree models
71
4.5.1
Division criteria
73
4.5.2
Pruning
74
4.6
Neural networks
76
4.6.1
Architecture of a neural network
79
4.6.2
The multilayer perceptron
81
4.6.3
Kohonen networks
87
4.7
Nearest-neighbour models
89
4.8
Local models
90
4.8.1
Association rules
90
4.8.2
Retrieval by content
96
4.9
Uncertainty measures and inference
96
4.9.1
Probability
97
4.9.2
Statistical models
99
4.9.3
Statistical inference
103
4.10
Non-parametric modelling
109
4.11
The normal linear model
112
4.11.1
Main inferential results
113
4.12
Generalised linear models
116
4.12.1
The exponential family
117
4.12.2
Definition of generalised linear models
118
4.12.3
The logistic regression model
125
4.13
Log-linear models
126
4.13.1
Construction of a log-linear model
126
4.13.2
Interpretation of a log-linear model
128
4.13.3
Graphical log-linear models
129
4.13.4
Log-linear model comparison
132
4.14
Graphical models
133
4.14.1
Symmetric graphical models
135
4.14.2
Recursive graphical models
139
4.14.3
Graphical models and neural networks
141
4.15
Survival analysis models
142
4.16
Further reading
144
CONTENTSvii5Model evaluation1475.1
Criteria based on statistical tests
148
5.1.1
Distance between statistical models
148
5.1.2
Discrepancy of a statistical model
150
5.1.3
Kullback-Leibler discrepancy
151
5.2
Criteria based on scoring functions
153
5.3
Bayesian criteria
155
5.4
Computational criteria
156
5.5
Criteria based on loss functions
159
5.6
Further reading
162
Part IIBusiness case studies1636Describing website visitors1656.1
Objectives of the analysis
165
6.2
Description of the data
165
6.3
Exploratory analysis
167
6.4
Model building
167
6.4.1
Cluster analysis
168
6.4.2
Kohonen networks
169
6.5
Model comparison
171
6.6
Summary report
172
7Market basket analysis1757.1
Objectives of the analysis
175
7.2
Description of the data
176
7.3
Exploratory data analysis
178
7.4
Model building
181
7.4.1
Log-linear models
181
7.4.2
Association rules
184
7.5
Model comparison
186
7.6
Summary report
191
8Describing customer satisfaction1938.1
Objectives of the analysis
193
8.2
Description of the data
194
8.3
Exploratory data analysis
194
8.4
Model building
197
8.5
Summary
201
9Predicting credit risk of small businesses2039.1
Objectives of the analysis
203
9.2
Description of the data
203
9.3
Exploratory data analysis
205
9.4
Model building
206
Document Outline
- Applied Data Mining for Business and Industry
- Contents
- 1 Introduction
- Part I Methodology
- 2 Organisation of the data
- 2.1 Statistical units and statistical variables
- 2.2 Data matrices and their transformations
- 2.3 Complex data structures
- 2.4 Summary
- 3 Summary statistics
- 3.1 Univariate exploratory analysis
- 3.1.1 Measures of location
- 3.1.2 Measures of variability
- 3.1.3 Measures of heterogeneity
- 3.1.4 Measures of concentration
- 3.1.5 Measures of asymmetry
- 3.1.6 Measures of kurtosis
- 3.2 Bivariate exploratory analysis of quantitative data
- 3.3 Multivariate exploratory analysis of quantitative data
- 3.4 Multivariate exploratory analysis of qualitative data
- 3.4.1 Independence and association
- 3.4.2 Distance measures
- 3.4.3 Dependency measures
- 3.4.4 Model-based measures
- 3.5 Reduction of dimensionality
- 3.5.1 Interpretation of the principal components
- 3.6 Further reading
- 4 Model specification
- 4.1 Measures of distance
- 4.1.1 Euclidean distance
- 4.1.2 Similarity measures
- 4.1.3 Multidimensional scaling
- 4.2 Cluster analysis
- 4.2.1 Hierarchical methods
- 4.2.2 Evaluation of hierarchical methods
- 4.2.3 Non-hierarchical methods
- 4.3 Linear regression
- 4.3.1 Bivariate linear regression
- 4.3.2 Properties of the residuals
- 4.3.3 Goodness of fit
- 4.3.4 Multiple linear regression
- 4.4 Logistic regression
- 4.4.1 Interpretation of logistic regression
- 4.4.2 Discriminant analysis
- 4.5 Tree models
- 4.5.1 Division criteria
- 4.5.2 Pruning
- 4.6 Neural networks
- 4.6.1 Architecture of a neural network
- 4.6.2 The multilayer perceptron
- 4.6.3 Kohonen networks
- 4.7 Nearest-neighbour models
- 4.8 Local models
- 4.8.1 Association rules
- 4.8.2 Retrieval by content
- 4.9 Uncertainty measures and inference
- 4.9.1 Probability
- 4.9.2 Statistical models
- 4.9.3 Statistical inference
- 4.10 Non-parametric modelling
- 4.11 The normal linear model
- 4.11.1 Main inferential results
- 4.12 Generalised linear models
- 4.12.1 The exponential family
- 4.12.2 Definition of generalised linear models
- 4.12.3 The logistic regression model
- 4.13 Log-linear models
- 4.13.1 Construction of a log-linear model
- 4.13.2 Interpretation of a log-linear model
- 4.13.3 Graphical log-linear models
- 4.13.4 Log-linear model comparison
- 4.14 Graphical models
- 4.14.1 Symmetric graphical models
- 4.14.2 Recursive graphical models
- 4.14.3 Graphical models and neural networks
- 4.15 Survival analysis models
- 4.16 Further reading
- 5 Model evaluation
- 5.1 Criteria based on statistical tests
- 5.1.1 Distance between statistical models
- 5.1.2 Discrepancy of a statistical model
- 5.1.3 Kullback–Leibler discrepancy
- 5.2 Criteria based on scoring functions
- 5.3 Bayesian criteria
- 5.4 Computational criteria
- 5.5 Criteria based on loss functions
- 5.6 Further reading
- Part II Business case studies
- 6 Describing website visitors
- 6.1 Objectives of the analysis
- 6.2 Description of the data
- 6.3 Exploratory analysis
- 6.4 Model building
- 6.4.1 Cluster analysis
- 6.4.2 Kohonen networks
- 6.5 Model comparison
- 6.6 Summary report
- 7 Market basket analysis
- 7.1 Objectives of the analysis
- 7.2 Description of the data
- 7.3 Exploratory data analysis
- 7.4 Model building
- 7.4.1 Log-linear models
- 7.4.2 Association rules
- 7.5 Model comparison
- 7.6 Summary report
- 8 Describing customer satisfaction
- 8.1 Objectives of the analysis
- 8.2 Description of the data
- 8.3 Exploratory data analysis
- 8.4 Model building
- 8.5 Summary
- 9 Predicting credit risk of small businesses
- 9.1 Objectives of the analysis
- 9.2 Description of the data
- 9.3 Exploratory data analysis
- 9.4 Model building
- 9.5 Model comparison
- 9.6 Summary report
- 10 Predicting e-learning student performance
- 10.1 Objectives of the analysis
- 10.2 Description of the data
- 10.3 Exploratory data analysis
- 10.4 Model specification
- 10.5 Model comparison
- 10.6 Summary report
- 11 Predicting customer lifetime value
- 11.1 Objectives of the analysis
- 11.2 Description of the data
- 11.3 Exploratory data analysis
- 11.4 Model specification
- 11.5 Model comparison
- 11.6 Summary report
- 12 Operational risk management
- 12.1 Context and objectives of the analysis
- 12.2 Exploratory data analysis
- 12.3 Model building
- 12.4 Model comparison
- 12.5 Summary conclusions
- References
- Index
Add New Comment