Data mining in Wikipedia 2011-09-26

Sinchoo Kim

Terms • Data – Unorganized and unprocessed fact

• Information – Data that are processed to be useful – Provides answers to "who", "what", "where", and "when" questions

• Knowledge – Application of data and information – Answers "how" questions

KDD • KDD (Knowledge Discovery in Database) – Describes the process of automatically searching large volumes of data that can be considered knowledge about the data for patterns • • • • •

Selection Preprocessing Transformation

Data Mining


Data mining • Definition – The analysis step of the Knowledge Discovery in Databases process – Discovering previously unknown pattern – Example • Home equity loan

Case : Home equity loan • Select subset of customer records who have received home equity loan offer Incoming

Number of children

Average Checking Account Balance














Case : Home equity loan • Find rules to predict whether a customer would respond to home equity loan offer note or note IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES

Case : Home equity loan • Group customers into clusters and investigate clusters Group 2 Group 3 Group 1

Group 4

Case : Home equity loan • Evaluate results – Many “uninteresting” clusters – One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents

Common classes of tasks • Association rule learning – Searches for relationship between variables

• Clustering – Discover groups and structures in data are in some way similar

• Anomaly detection – Identification of unusual data records

Common classes of tasks • Classfication – Generalizing known structure to apply to new data

• Regressions – Find a function which models the data with the least error

• Summarizations – Provide a more compact representation of the data set

Notable uses • Business - Customer management • Marketing • Identify purchase pattern - In human resource department • Identifying the characteristics of their most successful employees - In Decision making support • Integrated-circuit production line

Notable uses • Science and engineering - Human genetics • Relation between genetics and deseases - Electrical power engineering • Detect abnormal conditions • Estimate the nature of the abnormalities

Notable uses • Visual Data Mining – Large data set have been generated, collected, and stored – Find trends and information which is hidden in data set

Issues • Reliable data set – Overfitting • Training set which are not present in the general data set

Issues • Privacy concerns and ethics – The term data mining has no ethical implications – Compiled data cause anyone who has access • to the newly compiled data set • to be able to identify specific individuals, especially when originally the data were anonymous

