Introduction to Data Mining

更新时间:2023-05-10 06:44:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Introduction to Data Mining

Deng Cai (蔡登)College of Computer Science Zhejiang University dengcai@

Short BioDr. Deng Cai (蔡登)

dengcai@ CS PhD, UIUC© Deng Cai, College of Computer Science, Zhejiang University

Associate professor at CS college (the state key lab of CAD&CG).

紫金港校区计算中心大楼508Research interests:

Data mining Machine learning Information retrieval…

/dengcai/

Course InformationWeb: http://10.76.7.166/Courses/DM/ Time: Tuesday& Thursday© Deng Cai, College of Computer Science, Zhejiang University

8:00am—8:45am, 8:50am—9:35pm 8:00am—9:30pmPlace:玉泉曹光彪西104

TA:林悦

zjudm2012@

Course InformationPrerequisite:

Linear algebra, analysis, probability theory Basic programming skills© Deng Cai, College of Computer Science, Zhejiang University

Textbook:

Pattern Classification (2nd Edition), by Richard O. Duda, Peter E. Hart, and David G. Stork Other materials are available at the class web pageObjective:

Basic understandings of some of the important data mining (machine learning) methods. Basic ability to use some data mining (machine learning) techniques to solve real world problems.

EvaluationQuizzes (15%) Four assignments (10% each)

© Deng Cai, College of Computer Science, Zhejiang University

Assignments 1& 2: exercises, everyone do it by himself Assignments 3& 4: programming, team by at most two

Codes and documents ("Problem","Algorithm","Code","Test" and"Analysis")

Final exam (45% ) Extra credit projects (send me emails) Programming language: Matlab

Tutorials

http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.html

Course PoliciesCheating

No.Homework:© Deng Cai, College of Computer Science, Zhejiang University

You have to write you own solution/program.Late Policy:

0~24 hours: 90% 24~48 hours: 50% 48 hours~: 25% Except assignment 4

Questions?

Why Take This Course?It is NOT

Easy course with high scores Recommendation letter for US school application© Deng Cai, College of Computer Science, Zhejiang University

You should

Work hard Be honest

What is Data Mining?Data mining (knowledge discovery from data)

© Deng Cai, College of Computer Science, Zhejiang University

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer?Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence, etc.Watch out: Is everything“data mining”?

Simple search and query processing

Why Data Mining?The Explosive Growth of Data: from terabytes to petabytes

Data collection and data

availability © Deng Cai, College of Computer Science, Zhejiang University

Automated data collection tools, database systems, Web, computerized societyBusiness: Web, e-commerce, transactions, stocks,… Science: Remote sensing, bioinformatics, scientific simulation,… Society and everyone: news, digital cameras, YouTube

Major sources of abundant data

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

Evolution of SciencesBefore 1600, empirical science 1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.1950s-1990s, computational science© Deng Cai, College of Computer Science, Zhejiang University

Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.1990-now, data science

The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

Evolution of Database Technology1960s:

Data collection, database creation, IMS and network DBMS1970s:

Relational data model, relational DBMS implementation© Deng Cai, College of Computer Science, Zhejiang University

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)1990s:

Data mining, data warehousing, multimedia databases, and Web databases ACM SIGKDD Conference on Knowledge Discovery and Data Mining (1995 - )2000s

11

Data mining and its applications

Knowledge Discovery (KDD) ProcessThis is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process

Pattern Evaluation

© Deng Cai, College of Computer Science, Zhejiang University

Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection

KDD Process: A Typical View from ML and Statistics

© Deng Cai, College of Computer Science, Zhejiang University

Input Data

Data PreProcessing

Data Mining

PostProcessing

Data integration Normalization Feature selection Dimension reduction

Pattern discovery Association& correlation Classification

Clustering Outlier analysis…………

Pattern Pattern Pattern Pattern

evaluation selection interpretation visualization

Data Mining vs. Machine LearningA lot of common topics

© Deng Cai, College of Computer Science, Zhejiang University

Clustering Classification Many othersDifferent focuses

ML focuses more on theory (statistics) DM focuses more on applications, efficiency

Focus of This CourseWhat are the typical data mining problems?

© Deng Cai, College of Computer Science, Zhejiang University

Classification (decision making) Cluster analysis Component analysis (dimensionality reduction)

What are the basic data mining tools (methods, algorithms)?

Matlab programming

© Deng Cai, College of Computer Science, Zhejiang University

Classification (Decision making) Example

Tid Refund Marital Status© Deng Cai, College of Computer Science, Zhejiang University

Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No10

Refund Marital Status No Yes No Yes No No Single Married Married

Taxable Income Loan 75K 50K 150K??????

1 2 3 4 5 6 7 8 9 1010

Yes No No Yes No No Yes No No No

Single Married Single Married

Divorced 90K Single Married 40K 80K

Divorced 95K Married 60K

Divorced 220K Single Married Single 85K 75K 90K

No Yes No Yes

Test Set

Training Set

Learn Classifier

Model

© Deng Cai, College of Computer Science, Zhejiang University

18

Cluster Analysis Example

© Deng Cai, College of Computer Science, Zhejiang University

19

Factor Analysis

Basics of ProbabilityAn experiment is a well-defined process with observable outcomes.

© Deng Cai, College of Computer Science, Zhejiang University

The set or collection of all outcomes of an experiment is called the sample space, S.

An event E is any subset of outcomes from S.

Probability of an event, P(E) is P(E)= number of outcomes in E/ number of outcomes in S.

21

Bayes’ TheoremConditional probability: P(A|B)= P(A, B)/P(B).

© Deng Cai, College of Computer Science, Zhejiang University

Test of Independence: A and B are said to be independent if and only if P(A, B)= P(A) P(B).

Bayes' Theorem: P(A|B)= P(B|A) P(A)/P(B).

22

本文来源:https://www.bwwdw.com/article/r6xe.html

Top