About the Book
Learn how to turn mountains of raw data into
useful information with this guide.
The amount of information stored in corporate
databases is exploding exponentially. Data mining—finding meaningful patterns
in all that data—can give any organization a competitive advantage. This
book is the in-depth reference from Microsoft® for anyone who wants
to take full advantage of the powerful data-mining features in SQL Server™
2000. It examines the SQL Server 2000 Analysis Services architecture and
shows how data mining fits into its complete suite of information-extraction
technologies. Then it demonstrates how to structure and mine large databases
with the algorithms included with SQL Server 2000 to find nuggets of useful
information. It even shows how to create a practice data-mining model using
data downloaded from a database. Coverage includes:
• INTRODUCTION TO DATA MINING: What data
mining is and isn’t, plus important principles and definitions behind data-mining
methodologies, including the role of data-mining models, statistics, and
algorithms
• SQL SERVER 2000 ARCHITECTURE: How data
mining fits into the SQL Server 2000 Analysis Services architecture and
how it builds on the SQL Server 2000 relational database and its embedded
online analytical processing (OLAP) engine
• DATA-MINING METHODS: How to choose the
best data-mining method for the job—decision trees or clustering
• EASE OF USE FEATURES: How to use the
Mining Model Wizard and the OLAP Mining Model Editor to simplify creating,
training, and processing a model
• PROGRAMMING THE DATA-MINING SERVICES:
How to use data-mining models and Data Transformation Services, PivotTable®
Services, decision-support objects (DSO), PERL, Visual Basic®, Scripting
Edition, XML, and other tools and languages to work with the data-mining
engine
Related Books
Microsoft® SQL Server™ 2000 Resource Kit
Microsoft® SQL Server™ 2000 Administrator's
Pocket Consultant
Microsoft® SQL Server™ 2000 Administrator's
Companion
Table of Contents
Acknowledgments
xi xi
Introduction xiii
xiii
PART I INTRODUCING DATA MINING
1 Understanding Data Mining
3
What
Is Data Mining? 3
Why Use
Data Mining? 4
How Data
Mining Is Currently Used 6
Defining
the Terms 7
Data
Mining Methodology 9
Analyzing the Problem 10
Extracting and Cleansing the Data 10
Validating the Data 10
Creating and Training the Model 10
Querying the Data Mining Model Data 10
Maintaining the Validity of the Data-Mining Model 10
Overview
of Microsoft Data Mining 11
Data Mining vs. OLAP 11
Data-Mining Models 11
Data-Mining Algorithms 12
Using SQL Server Syntax to Data Mine 14
Summary
14
2 Microsoft SQL Server Analysis Services
Architecture 15
Introduction
to OLAP 16
MOLAP 18
ROLAP 18
HOLAP 19
Server
Architecture 20
Data Mining Services Within Analysis Services 20
Client
Architecture 21
PivotTable Service 22
OLE DB 23
Decision Support Objects (DSO) 24
Multidimensional Expressions (MDX) 25
Prediction Joins 25
Summary
26
3 Data Storage Models
27
Why Data
Mining Needs a Data Warehouse 27
Maintaining Data Integrity 28
Reporting
Against OLTP Data Can Be Hazardous to Your Performance 31
Data
Warehousing Architecture for Data Mining 33
Creating the Warehouse from OLTP Data 33
Optimizing Data for Mining 36
Physical Data Mining Structure 42
Three-Tier Architecture 43
Relational
Data Warehouse 43
Advantages of Relational Data Storage 44
Building Supporting Tables for Data Mining 45
OLAP
cubes 46
How Data Mining Uses OLAP Structures 46
Advantages of OLAP Storage 47
When OLAP Is Not Appropriate for Data Mining 49
Summary
49
4 Approaches to Data Mining
51
Directed
Data Mining 51
Undirected
Data Mining 52
Data Mining vs. Statistics 52
Learning from Historical Data 57
Predicting the Future 59
Training
Data-Mining Models 61
Evaluating the Models and Avoiding Errors 62
Summary
65
PART II DATA-MINING METHODS
5 Microsoft Decision Trees
69
Creating
the Model 69
Analysis Manager 70
Visualizing
the Model 87
Dependency Network Browser 94
Inside the Decision Tree Algorithm 97
How Predictions Are Derived 109
Navigating the Tree 109
Navigation vs. Rules 112
When to Use Decision Trees 113
Summary
114
6 Creating Decision Trees with OLAP
115
Creating
the Model 115
Select Source Type 116
Select Source Cube and Data-Mining Technique 116
Select Case 118
Select Predicted Entity 119
Select Training Data 121
Select Dimension and Virtual Cube 121
Completing the Data-Mining Model 123
OLAP
Mining Model Editor 125
Content Detail Pane 126
Structure Panel 126
Prediction Tree List 126
Analyzing
Data with the OLAP Data-Mining Model 126
Using the Generated Virtual Cube 128
Using the Generated Dimension 129
Summary
133
7 Microsoft Clustering
135
The Search
for Order 136
Looking
for Ways to Understand Data 136
Clustering
as an Undirected Data-Mining Technique 137
How Clustering
Works 138
Overview of the Algorithm 138
The K-Means Method Clustering Algorithm 138
What Is Being Measured Exactly? 142
Clustering Factors 142
Measuring "Closeness" 143
When
to Use Clustering 146
Visualize Relationships 146
Highlight Anomalies 146
Create Samples for Other Data-Mining Efforts 148
Weaknesses of Clustering 148
Creating
a Data-Mining Model Using Clustering 149
Select Source Type 150
Select the Table or Tables for Your Mining Model 150
Select the Data-Mining Technique 151
Edit Joins 152
Select the Case Key Column for Your Mining Model 152
Select the Input and Predictable Columns 152
Viewing
the Model 154
Organization of the Cluster Nodes 154
Order of the Cluster Nodes 156
Analyzing
the Data 156
Summary
158
PART III CREATING DATA–MINING APPLICATIONS
WITH CODE
8 Using Microsoft Data Transformation
Services (DTS) 161
What
Is DTS? 162
DTS Tasks
162
Transform 162
Bulk Insert 163
Data Driven Query 163
Execute Package 164
Connections
167
Sources 167
Configuring a Connection 168
DTS Package
Workflow 169
DTS Package Steps 169
Precedence Constraints 170
DTS Designer
171
Opening the DTS Designer 171
Saving a DTS Package 172
dtsrun
Utility 174
Using
DTS to Create a Data-Mining Model 177
Preparing the SQL Server Environment 178
Creating the Package 182
Summary
208
9 Using Decision Support Objects (DSO)
209
Scripting
vs. Visual Basic 210
The Server Object 211
The Database Object 219
Creating
the Relational Data-Mining Model Using DSO 221
Creating
the OLAP Data-Mining Model Using DSO 230
The DataSource Object 232
Data-Mining Model (Decision Support Objects) 233
Adding
a New Data Source 233
Analysis
Server Roles 234
Data-Mining Model Roles 235
Summary
236
10 Understanding Data-Mining Structures
237
The Structure
of the Data-Mining Model Case 237
Data-Mining Models Look Like Tables 237
Using
Code to Browse Data-Mining Models 238
Using
the Schema Rowsets 243
MINING_MODELS Schema Rowset 243
MINING_COLUMNS Schema Rowset 249
MINING_MODEL_CONTENT Schema Rowset 259
MINING_SERVICES Schema Rowset 262
SERVICE_PARAMETERS Schema Rowset 266
MODEL_CONTENT_PMML Schema Rowset 268
Summary
269
11 Data Mining Using PivotTable Service
271
Redistributing
Components 272
Installing
and Registering Components 273
File Locations 274
Installation Registry Settings 275
Redistribution Setup Programs 275
Connecting
to the PivotTable Service 276
Connect to Analysis Services Using PivotTable Service 276
Connect to Analysis Services Using HTTP 280
Building
a Local Data-Mining Model 280
Storage of Local Mining Models 284
SELECT INTO Statement 286
INSERT INTO Statement 286
OPENROWSET Syntax 287
Nested Tables and the SHAPE Statement 289
Using
XML in Data Mining 290
The PMML Standard 290
Summary
296
12 Data-Mining Queries
297
Components
of a Prediction Query 297
The Basic Prediction Query 298
Specifying the Test Case Source 298
Specifying Columns 300
The PREDICTION JOIN Clause 300
Using Functions as Columns 304
Using Tabular Values as Columns 304
The WHERE Clause 306
Prediction Functions 307
Predict 307
PredictProbability 308
PredictSupport 308
PredictVariance 309
PredictStdev 310
PredictProbabilityVariance 310
PredictProbabilityStdev 310
PredictHistogram 310
TopCount 313
TopSum 313
TopPercent 314
RangeMin 314
RangeMid 314
RangeMax 314
PredictScore 314
PredictNodeId 315
Prediction
Queries with Clustering Models 315
Cluster 315
ClusterProbability 316
ClusterDistance 316
Using
DTS to Run Prediction Queries 317
Summary
322
APPENDIX
325
GLOSSARY
349
INDEX
359 |