COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week3chap6_basic_association_analysis
更新时间:2023-07-17 20:13:01 阅读量: 实用文档 文档下载
- comp5318老师推荐度:
- 相关推荐
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactionsTID Items
Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Implication means co-occurrence, not causality!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}TID Items
– k-itemset
An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Association Rule
Association Rule– An implication expression of the form X Y, where X and Y are itemsets
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
Example:
{Milk, Diaper} Beers
– Confidence (c)
Measures how often items in Y appear in transactions that contain X
(Milk , Diaper, Beer )|T|
2 0.4 5
(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association RulesTID Items
Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5
)
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence
Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association Rules
Two-step approach:1. Frequent Itemset Generation–
Generate all itemsets whose support minsup
2. Rule Generation–
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generationnull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets4/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation
Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
N
M
w
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Computational Complexity
Given d unique items:– Total number of itemsets = 2d – Total number of possible association rules:
d d k R k j 3 2 1d 1 k 1 d k j 1 d d 1
If d=6, R = 602 rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d – Use pruning techniques to reduce M
Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the support measure:
X ,Y : ( X Y ) s( X ) s(Y )– Support of an itemset never exceeds the support of its su
bsets – This is known as the anti-monotone property of support© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori Principlenull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
Found to be InfrequentABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Pruned supersets© Tan,Steinbach, Kumar Introduction to Data Mining
ABCDE
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori PrincipleItem Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1
Items (1-itemsets)
Minimum Support = 3If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Count 3 2 3 2 3 3
Pairs (2-itemsets)(No need to generate candidates involving Coke or Eggs)
Triplets (3-itemsets)Itemset {Bread,Milk,Diaper} Count 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Apriori Algorithm
Method:– Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate
length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Comparisons
Candidate counting:– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structureInstead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
TransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Hash Structure
N
k
Buckets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Introduction to Hash Functions
A Hash Function h is a mapping from a set X to a range of integers [0..k-1]. Thus each element of the set is mapped into one of k buckets. Each of the buckets will contain all the elements that are mapped by h into that bucket.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
A mod function is a good example of a hash function. For example suppose we use h(x) = xmod7. Then 0 to 6 gets mapped to 0 to 6 but 7 gets mapped to 0 and 8 to 1. Thus the range of mod7 is [0..6]. These are the buckets of mod7.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
Suppose X is the set of integers 1..1000 1 0,7,14,21…. 1,8,15,22….
23 4 5
6
6,13,20,27….
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Factors Affecting Complexity
Choice of mini
mum support threshold– – lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Dimensionality (number of items) of the data set– –
Size of database–
Average transaction width– transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10 Number of frequent itemsets 3 k 10 k 1
Need a compact representationIntroduction to Data Mining 4/18/2004 #
© Tan,Steinbach, Kumar
正在阅读:
COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week3chap6_basic_association_analysis07-17
汉字的构成---精制版幽美中国风05-25
题库05-19
防止液晶拼接大屏幕干扰的几点注意事项07-27
节目串词03-09
物流管理口试(12题)03-08
巡检报告模板--Sun主机模板01-27
直角三角形的边角关系05-12
11001工作面综掘作业规程08-17
- 1Received accepted Short title ANALYSIS OF MODEL DATA
- 2introduction - applied research methods- WEEK 1 (3)(3)
- 3Unit 6 Knowledge and Wisdom
- 4week6四级笔试模拟试卷3
- 5English writing (week 3)
- 6week3-HierarchicalStateMachines
- 7Arlequin(version 3.0) An integrated software package for population genetics data analysis
- 8English writing (week 3)
- 93rd Week
- 10week6练习 验证控件
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- association
- Knowledge
- Discovery
- COMP5318
- Semester
- analysis
- Mining
- basic
- Data
- 2011
- week
- chap
- 【理想树600分考点 700分考法】2016届高考历史二轮专题复习 专题17 第二次世界大战后世界政治格局的演变
- 初中语文教学案例反思举例
- 材料科学基础B固体结构之二A
- 美术基础起步教程—素描头像_演示文稿
- 党在我心中演讲比赛通知
- 中国现当代文学模拟试题1
- 学习“铁人精神”心得体会
- 2015新版PEP五年级英语下册Unit 3 My school calendar导学案
- 必修五文言文理解性默写
- 湖南大学会计学名词解释
- PN结正向伏安特性曲线随温度的变化
- 浅谈钢结构防腐处理
- 欢迎订阅2007年度《有色冶金节能》(双月刊)
- 600MW火电机组定期工作标准-封面及前言
- 广州公司注册、公司变更及其内容
- (3) 初设汇报幻灯
- Empirical project monitor A tool for mining multiple project data
- 2015年修订版《思想道德修养与法律基础》第三章领悟人生真谛 创造人生价值
- 从友文槟榔看湖南槟榔演化简史
- 人教版二年级语文上册生字描红字帖