• Information – Data that are processed to be useful – Provides answers to "who", "what", "where", and "when" questions
• Knowledge – Application of data and information – Answers "how" questions
KDD • KDD (Knowledge Discovery in Database) – Describes the process of automatically searching large volumes of data that can be considered knowledge about the data for patterns • • • • •
Selection Preprocessing Transformation
Data Mining
Interpretation/Evaluation
Data mining • Definition – The analysis step of the Knowledge Discovery in Databases process – Discovering previously unknown pattern – Example • Home equity loan
Case : Home equity loan • Select subset of customer records who have received home equity loan offer Incoming
Number of children
Average Checking Account Balance
Response
$40,000
2
$1500
Yes
$75,000
0
$5000
No
$50,000
1
$3000
No
Case : Home equity loan • Find rules to predict whether a customer would respond to home equity loan offer note or note IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES
Case : Home equity loan • Group customers into clusters and investigate clusters Group 2 Group 3 Group 1
Group 4
Case : Home equity loan • Evaluate results – Many “uninteresting” clusters – One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents
Common classes of tasks • Association rule learning – Searches for relationship between variables
• Clustering – Discover groups and structures in data are in some way similar
• Anomaly detection – Identification of unusual data records
Common classes of tasks • Classfication – Generalizing known structure to apply to new data
• Regressions – Find a function which models the data with the least error
• Summarizations – Provide a more compact representation of the data set
Notable uses • Business - Customer management • Marketing • Identify purchase pattern - In human resource department • Identifying the characteristics of their most successful employees - In Decision making support • Integrated-circuit production line
Notable uses • Science and engineering - Human genetics • Relation between genetics and deseases - Electrical power engineering • Detect abnormal conditions • Estimate the nature of the abnormalities
Notable uses • Visual Data Mining – Large data set have been generated, collected, and stored – Find trends and information which is hidden in data set
Issues • Reliable data set – Overfitting • Training set which are not present in the general data set
Issues • Privacy concerns and ethics – The term data mining has no ethical implications – Compiled data cause anyone who has access • to the newly compiled data set • to be able to identify specific individuals, especially when originally the data were anonymous