Introduction to Data Mining
更新时间:2023-05-10 06:44:01 阅读量: 实用文档 文档下载
- introduction推荐度:
- 相关推荐
Introduction to Data Mining
Deng Cai (蔡登)College of Computer Science Zhejiang University dengcai@
Short BioDr. Deng Cai (蔡登)
dengcai@ CS PhD, UIUC© Deng Cai, College of Computer Science, Zhejiang University
Associate professor at CS college (the state key lab of CAD&CG).
紫金港校区计算中心大楼508Research interests:
Data mining Machine learning Information retrieval…
/dengcai/
Course InformationWeb: http://10.76.7.166/Courses/DM/ Time: Tuesday& Thursday© Deng Cai, College of Computer Science, Zhejiang University
8:00am—8:45am, 8:50am—9:35pm 8:00am—9:30pmPlace:玉泉曹光彪西104
TA:林悦
zjudm2012@
Course InformationPrerequisite:
Linear algebra, analysis, probability theory Basic programming skills© Deng Cai, College of Computer Science, Zhejiang University
Textbook:
Pattern Classification (2nd Edition), by Richard O. Duda, Peter E. Hart, and David G. Stork Other materials are available at the class web pageObjective:
Basic understandings of some of the important data mining (machine learning) methods. Basic ability to use some data mining (machine learning) techniques to solve real world problems.
EvaluationQuizzes (15%) Four assignments (10% each)
© Deng Cai, College of Computer Science, Zhejiang University
Assignments 1& 2: exercises, everyone do it by himself Assignments 3& 4: programming, team by at most two
Codes and documents ("Problem","Algorithm","Code","Test" and"Analysis")
Final exam (45% ) Extra credit projects (send me emails) Programming language: Matlab
Tutorials
http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.html
Course PoliciesCheating
No.Homework:© Deng Cai, College of Computer Science, Zhejiang University
You have to write you own solution/program.Late Policy:
0~24 hours: 90% 24~48 hours: 50% 48 hours~: 25% Except assignment 4
Questions?
Why Take This Course?It is NOT
Easy course with high scores Recommendation letter for US school application© Deng Cai, College of Computer Science, Zhejiang University
You should
Work hard Be honest
What is Data Mining?Data mining (knowledge discovery from data)
© Deng Cai, College of Computer Science, Zhejiang University
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer?Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence, etc.Watch out: Is everything“data mining”?
Simple search and query processing
Why Data Mining?The Explosive Growth of Data: from terabytes to petabytes
Data collection and data
availability © Deng Cai, College of Computer Science, Zhejiang University
Automated data collection tools, database systems, Web, computerized societyBusiness: Web, e-commerce, transactions, stocks,… Science: Remote sensing, bioinformatics, scientific simulation,… Society and everyone: news, digital cameras, YouTube
Major sources of abundant data
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
Evolution of SciencesBefore 1600, empirical science 1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.1950s-1990s, computational science© Deng Cai, College of Computer Science, Zhejiang University
Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.1990-now, data science
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Evolution of Database Technology1960s:
Data collection, database creation, IMS and network DBMS1970s:
Relational data model, relational DBMS implementation© Deng Cai, College of Computer Science, Zhejiang University
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)1990s:
Data mining, data warehousing, multimedia databases, and Web databases ACM SIGKDD Conference on Knowledge Discovery and Data Mining (1995 - )2000s
11
Data mining and its applications
Knowledge Discovery (KDD) ProcessThis is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process
Pattern Evaluation
© Deng Cai, College of Computer Science, Zhejiang University
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
KDD Process: A Typical View from ML and Statistics
© Deng Cai, College of Computer Science, Zhejiang University
Input Data
Data PreProcessing
Data Mining
PostProcessing
Data integration Normalization Feature selection Dimension reduction
Pattern discovery Association& correlation Classification
Clustering Outlier analysis…………
Pattern Pattern Pattern Pattern
evaluation selection interpretation visualization
Data Mining vs. Machine LearningA lot of common topics
© Deng Cai, College of Computer Science, Zhejiang University
Clustering Classification Many othersDifferent focuses
ML focuses more on theory (statistics) DM focuses more on applications, efficiency
Focus of This CourseWhat are the typical data mining problems?
© Deng Cai, College of Computer Science, Zhejiang University
Classification (decision making) Cluster analysis Component analysis (dimensionality reduction)
What are the basic data mining tools (methods, algorithms)?
Matlab programming
© Deng Cai, College of Computer Science, Zhejiang University
Classification (Decision making) Example
Tid Refund Marital Status© Deng Cai, College of Computer Science, Zhejiang University
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No10
Refund Marital Status No Yes No Yes No No Single Married Married
Taxable Income Loan 75K 50K 150K??????
1 2 3 4 5 6 7 8 9 1010
Yes No No Yes No No Yes No No No
Single Married Single Married
Divorced 90K Single Married 40K 80K
Divorced 95K Married 60K
Divorced 220K Single Married Single 85K 75K 90K
No Yes No Yes
Test Set
Training Set
Learn Classifier
Model
© Deng Cai, College of Computer Science, Zhejiang University
18
Cluster Analysis Example
© Deng Cai, College of Computer Science, Zhejiang University
19
Factor Analysis
Basics of ProbabilityAn experiment is a well-defined process with observable outcomes.
© Deng Cai, College of Computer Science, Zhejiang University
The set or collection of all outcomes of an experiment is called the sample space, S.
An event E is any subset of outcomes from S.
Probability of an event, P(E) is P(E)= number of outcomes in E/ number of outcomes in S.
21
Bayes’ TheoremConditional probability: P(A|B)= P(A, B)/P(B).
© Deng Cai, College of Computer Science, Zhejiang University
Test of Independence: A and B are said to be independent if and only if P(A, B)= P(A) P(B).
Bayes' Theorem: P(A|B)= P(B|A) P(A)/P(B).
22
正在阅读:
Introduction to Data Mining05-10
张铁生什么原因入狱?张铁生为什么入狱?02-13
The dynamics of the ecological footprint concept06-05
哮喘是由什么原因而引起的08-14
关于放飞梦想的诗歌03-30
《从百草园到三味书屋》习题08-25
扒一扒十八大以来的反腐新词及其背后深意08-07
- 1Introduction color
- 2Opinion mining and sentiment analysis
- 3Introduction to Computational Linguistics
- 4My Self –introduction
- 5My Self –introduction
- 6An Introduction to Database Systems
- 7A Gentle Introduction to ROS
- 8An Introduction to Database Systems
- 9brief introduction to usa
- 10nutrition facts introduction
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Introduction
- Mining
- Data
- 外研版七年级上册第六模块语法练习(无答案)
- 智慧园区、智慧城市建设
- 急性肾衰的预后及治疗
- 常见的酸和碱2导学案
- 10.8高血压宣传资料表
- 地理教学反思《地形图的判读》
- 中国美术史-历代画家及作品
- RFID平安校园信息管理系统
- 试题:2013年重症医学科重要制度及岗位职责培训试题
- 2009陕西省西安中学高三第二次统考试卷英语高考模拟试题
- 浅析多元文化视域下传统文化和高校思想政治教育的融合
- L85新概念二册 never too old to learn
- 人教版六年级下册语文词语盘点(看拼音写词语、读读记记、四字词语)
- 橡胶加工过程的防火措施通用范本
- 汽车部品EMC测试标准
- 【推荐】 某领导干部三严三实专题三严以用权学习心得体会
- 杭州目达服饰有限公司员工手册
- 劳动保障事务所工作职责
- 全新版大学英语(第二版)综合教程1--U1-U8课后习题答案
- 最新小学英语教师的个人教学工作总结