Estimating the quality of data in relational databases
更新时间:2023-03-19 22:25:01 阅读量: 人文社科 文档下载
- estimating推荐度:
- 相关推荐
EstimatingtheQualityofDatainRelational
Databases
AmihaiMotroandIgorRakov
DepartmentofInformationandSoftwareSystemsEngineering
GeorgeMasonUniversity
Fairfax,VA22030-4444
{ami,irakov}@gmu.edu
Abstract
Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationsthatwillbea xedtoeachdatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.
ThisworkwassupportedinpartbyDARPAgrantsN0014-92-J-4038andN0060-96-D-3202.
1Introduction
Theimportanceofdataqualityintheinformationagecannotbeoverestimated.People,businesses,andgovernmentsrelymoreandmoreoninformationintheireverydayoperations,anddatabasesofdi erentkindsaretheprimarysourceofthisinformation.Ourdependenceondatabasesgrowssimultaneouslywiththeirsize,yetmostlargedatabasescontainerrorsandinconsistencies.Thereisagrowingawarenessinthedatabaseresearchcommunity
[13,19]andamongdatabasepractitioners[1]oftheproblemofdataquality.Bynow,theneedfordataqualitymetricsandformethodsforincorporatingthemindatabasesystemsiswellunderstood.Dataqualitycanbemetricizedinanumberofdi erentwaysdependingonwhichaspectofinformationareconsideredimportant[18,5].Theadditionofdataqualitycapabilitiestodatabasesystemswillenhancedecision-makingprocesses,improvethequalityofinformationservices,and,ingeneral,providemoreaccuratepicturesofreality.Ontheotherhand,thesenewcapabilitiesofdatabasesshouldnotbedemandingintermsofresources,e.g.,theymustnotaddtoomuchcomplexitytoqueryprocessingorrequiremuchmorememorythanexistingdatabases.
Therecentadvancesinthe eldofdataqualityconcerndataatanattributevaluelevel[18]andatarelationlevel[14].Thecomprehensivesurveyofthestate-of-the-artinthe eldisgivenin[19].Therelationalalgebraextendedwithdataaccuracyestimatesbasedontheassumptionsofuniformdistributionsofincorrectvaluesacrosstuplesandattributeswas rstdescribedin[14].
Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.
Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationthatwillbea xedtothedatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.
2OverallApproach
Ouroverallapproachforachievingthegoalsthatwerestatedintheintroductioncanbedescribedasasequenceofproblems.
Webegin,inSection3,bydescribingthedualmeasuresthatwillbeusedtogaugethequalityofdatabaseinformation.Weclaimthatthesemeasurescaptureinamostnaturalway,therelationshipofthestoredinformationtotruth,andarethereforeexcellentindicatorsofquality.
Ourmeasuresrequiretheauthenticationofdatabaseinformation,whichisaprocessthatneedstobedonebyhumans.However,weadvocatetheuseofstatisticalmethods(essentially,sampling)tokeepthemanualworkwithinacceptablelimits.ThissubjectisdiscussedinSection4.1.
Havingobtainedaccurateinformationaboutthequalityofthesamples,weproceedtopartitionthegivendatabasetoasetofcomponentsthatarehomogeneouswithrespecttoourqualitymeasures.Wethenestimatethequalityofthesecomponents,usingthesamples.Thisimpliesthatwheninformationisextractedfromasinglecomponent,itsqualityratingsareinheritedfromthecontainingcomponent.Thesemethods,describedinSections4.2and4.3,provideuswithqualityspeci cationsforthegivendatabases.
Finally,inSection5,wedescribetheprocessofinferringthequalityofanswerstoarbi-traryqueriesfromthequalityspeci cationsthathavebeenassignedtothedatabase.Ourtreatmentoftheproblemisinthecontextofrelationaldatabases,andweassumethestandardde nitionsoftherelationalmodel[17].Inparticular,thedatabasecomponentsmentionedearlierarede nedusingthemechanismofviews.Wealsomakethefollowingassumptions.
1.Queriesandviewsuseonlytheprojection,selection,andCartesianproductoperations,selectionsuseonlyrangeconditions,andprojectionsalwaysretainthekeyattribute(s).
2.Thestoredinformation(thedatabaseinstances)arerelativelystatic,andhencethequalityofdatadoesnotchangefrequently.
Becauseofspacelimitations,severalkeyissuesandsolutionsareonlysketchedinthispaper,andfullerdiscussionsareprovidedin[11].
3SoundnessandCompletenessasMeasuresofData
Quality
Wede netwomeasuresofdataqualitythataregeneralenoughtoencompassmostexistingmeasuresandaspectsofdataquality[5,19].Thebasicideasunderlyingthesemeasureswere rststatedin[7].Inthatpapertheauthorsuggestedthatdeclarationsoftheportionsofthedatabasethatareknowntobeperfectmodelsoftherealworld(andtherebytheportionsthatarepossiblyimperfect)beincludedinthede nitionofeachdatabase.Withthisinformation,thedatabasesystemcanqualifytheaccuracyoftheanswersitissuesinresponsetoqueries:eachanswerisaccompaniedbystatementsthatde netheportionsoftheanswerthatareguaranteedtobeperfect.Thisapproachusesviewstospecifytheportionsofthedatabaseortheportionsofanswersthatareperfectmodelsoftherealworld.
Morespeci cally,thisapproachinterpretsinformationquality,whichittermsintegrity,asacombinationofsoundnessandcompleteness.Adatabaseviewissoundifitincludesonlyinformationthatoccursintherealworld;adatabaseviewiscompleteifitincludesalltheinformationthatoccursintherealworld.Hence,adatabaseviewhasintegrity,ifitincludesthewholetruth(completeness)andnothingbutthetruth(soundness).Aprototypedatabasesystemthatisbasedontheseideasisdescribedin[10].Theseideaswerefurtherdevelopedin[9]andaresummarizedbelow.
GivenadatabaseschemeD,weassumetheexistenceofahypotheticaldatabaseinstanced0thatcapturesperfectlythatportionoftherealworldthatismodeledbyD(theidealortruedatabase).Inaddition,weassumeoneormoreactualinstancesdi(i≥1).Theactualinstancesareconsideredapproximationsoftheidealinstanced0.
GivenaviewV,wedenotebyv0itsextensionintheidealdatabased0(theidealortrueextensiontoV)andwedenotebyviitsextensionintheactualdatabasedi.Again,theextensionsviareapproximationsoftheidealextensionv0.
ConsiderviewV,itsidealextensionv0,andanapproximationv.Ifv v0,thenvisacompleteextension.Ifv v0,thenvisasoundextension.Obviously,anextensionwhichissoundandcompleteistheidealextension.
Withthesede nitions,eachviewextensioniseithercompleteorincomplete,andeithersoundornonsound.Wenowre nethesede nitionsbyassigningeachextensionavaluethatdenoteshowwellitapproximatestheidealextension.Weshalltermthisvaluethegoodnessoftheextension.Werequirethatthegoodnessofeachextensionbeavaluebetween0and1,thatthegoodnessoftheidealextensionbe1,andthatthegoodnessofextensionsthatareentirelydisjointfromtheidealextensionbe0.Formally,agoodnessmeasureisafunctiongonthesetofallpossibleextensionsthatsatis es
v:g(v)∈[0,1]
v:v∩v0= = g(v)=0
g(v0)=1
Asimpleapproachtogoodnessistoconsidertheintersectionoftheextensions;thatis,thetuplesthatappearinbothvandv0.Let|v|denotethenumberoftuplesinv.Then
|v∩v0|
|v|
expressestheproportionofthedatabaseextensionthatappearsinthetrueextension.Hence,itisameasureofthesoundnessofv.Similarly,
|v∩v0|
|v0|
expressestheproportionofthetrueextensionthatappearsinthedatabaseextension.Hence,itisameasureofthecompletenessofv.
Itiseasytoverifythatsoundnessandcompletenesssatisfyalltherequirementsofagood-nessmeasure.1Soundnessandcompletenessaresimilartoprecisionandrecallininformationretrieval[15].
Adisadvantageofthesemeasuresisthatadatabasetupleisassumedtobesound(andcontributetothesoundnessmeasure)onlyifitidenticaltoatupleoftheidealdatabase(sim-ilarlyinthecaseofcompleteness).Thus,atuplethatiscorrectinallbutoneattribute,andatuplethatisincorrectinallitsattributesaretreatedidentically.Anessentialre nementofthesemeasuresistoconsiderthegoodnessofindividualattributes.
AssumeaviewVhasattributesA0,A1,...,An,whereA0isthekey.2WedecomposeVintonkey-attributepairs(A0,Ai)(i=1,...,n),ingdecom-posedextensionsinthepreviously-de nedmeasuresimprovestheirusefulnessconsiderably,andweshallassumedecomposedextensionsthroughout.
Soundnessandcompletenesscanalsobeapproachedbymeansofprobabilitytheory[11].Forexample,thede nitionofsoundnesscanbeinterpretedastheprobabilityofdrawingacorrectpairfromagivenextension.Probabilisticinterpretationsgivenewinsightintothenotionsofsoundnessandcompletenessandalsohelpustoconnectthisresearchwithalargebodyofworkonuncertaintymanagementininformationsystems[8].
Thedataqualitymeasuresthathavebeenmentionedmostfrequentlyasessentialareaccuracy,completeness,currentness,andconsistency[5,18].Itispossibletorelatethesequalitymeasurestoourowngoodnessmeasures[11].
Whenvisempty,soundnessis0/0.Ifv0isalsoemptythensoundnessisde nedtobe1;otherwiseitisde nedtobe0.Similarlyforcompleteness,whenv0isempty.
2Weconsideratupleasarepresentationoftherealworldentityidenti edbyakeyattribute;thenonkeyattributesthencapturethepropertiesofthisentity.Forsimplicity,weassumethatkeysconsistofasingleattribute.1
4
4.1RatingtheQualityofDatabasesNecessaryProceduresforGoodnessEstimation
Theamountofdatainpracticaldatabasesisoftenlarge.Tocomputetheexactsoundnessandcompletenessofaparticularviewwewouldneedto(1)authenticateeveryvaluepairinthestoredview,and(2)determinehowmanypairsaremissingfromthisview.Thismethodisclearlyinfeasibleinanyrealsystem.Thus,wemustresorttosamplingtechniques[16,4].Samplingtechniquesallowustoestimatethemeanandvarianceofaparticularparameterofapopulationbyusingasamplewhichisusuallyonlyafractionofthesizeoftheentirepopulation.Thetheoryofstatisticsalsogivesusmethodsforestablishingasamplesizetoachievepredeterminedaccuracyoftheestimates.Itisthenpossibletosupplementourestimateswithcon denceintervals.Formoredetaileddiscussiononsamplingfromdatabasesthereaderisreferredtotheliteratureonthetopic(see,forexample,[12]foragoodsurvey).Notethattwodi erentpopulationsmustbesampled.Toestimatesoundnesswesamplethegiven(stored)view,whereastoestimatecompleteness,wesampletheidealview.Toestablishbothsoundnessandcompletenessitisnecessarytohaveaccesstotheidealdatabase.Forsoundness,weneedtodeterminewhetheraspeci cvaluepairofthestoreddatabaseisintheidealdatabase.Forcompleteness,itisnecessarytodeterminewhetheraspeci cpairfromtheidealdatabaseisinthestoreddatabase.Theseprocedures(verifyapairfromastoreddatabaseagainsttheidealdatabaseandretrieveanarbitrarypairfromtheidealdatabase)mustbeimplementedinanad-hocmanner[1].Foreachconcretedatabase,humanexpertisewillberequired.Theexpertwillaccessavarietyofavailablesourcestoperformthesetwoprocedures.Notethatthise ortisperformedonlyonceandonlyforasample,whichthenhelpsestimatetheoverallgoodness.
Acriticalstageofoursolutionistobuildasetofhomogeneousviewsonastoreddatabase,calledagoodnessbasis.Thegoodnessoftheviewsofthisbasiswillbemeasuredandthere-afterusedinestablishingthegoodnessofanswerstoarbitraryqueriesagainstthisdatabase.Sincewecannotguaranteeasinglesetofviewsthatwillbehomogeneouswithrespecttobothqualitymeasures,weconstructtwoseparatesets:asoundnessbasisandacompletenessbasis.Inconstructingeachbasis,weconsidereachdatabaserelationindividually.Eachre-lationmaybepartitionedbothhorizontally(byaselection)andvertically(byaprojection),andthebasiscomprisestheunionofallsuchpartitions.Selectionsarelimitedtoranges;i.e.,theselectioncriteriaisaconjunctionofconditions,whereeachindividualconditionspeci esanattributeandarangeofpermittedvaluesforthisattribute.
Weassigntoanincorrectvaluepairthevalue0andtoacorrectpairthevalue1.Thus,wecanrepresentanerrordistributionpatterninaviewextensionasatwo-dimensionalmatrixof0sand1s,inwhichrowscorrespondtothetuplesandcolumnscorrespondtotheattributesoftheview.Avalueinaparticularcellofthismatrixiseither0or1dependingonthecorrectnessofthecorrespondingpairofattributevalues.Wecallthisnewdatastructure
aviewmaporarelationmap,asappropriate.Now,thetaskistopartitionthistwo-dimensionalarrayintoareasinwhichelementsaredistributedhomogeneouslywithrespecttoourqualitymeasures.
Notethatthecorrectnessofaparticularnonkeyattributevaluecanbedeterminedonlyinreferencetothekeyattributeofthattuple,i.e.,indeterminingwhetheraspeci ccellshouldbe0or1weconsiderthecorrectnessofthepair:(keyvalue;nonkeyvalue)determiningthecorrectnessofanattributevalue.Thepairiscorrectifandonlyifbothelementsofthepairarecorrect.Thismeans,inparticular,thatifakeyattributevalueisincorrect,thenallpairscorrespondingtothiskeyattributevalueareconsideredincorrect.
ThetechniqueweuseforpartitioningtheviewmapisanonparametricstatisticalmethodcalledCART(Classi cationandRegressionTrees)[2].Thismethodhasbeenwidelyusedfordataanalysisinbiology,socialscience,environmentalresearch,andpatternrecognition.Closertoourarea,thismethodwasusedin[3]forestimatingtheselectivityofselectionqueries.Weassumethattuplesandattributesofarelationareordereduniquely.
4.2HomogeneityMeasure
Intuitively,aviewisperfectlyhomogeneouswithrespecttoagivenpropertyifeverysubviewoftheviewcontainsthesameproportionofpairswiththispropertyastheviewitself.Moreover,themorehomogeneousaview,thecloseritsdistributionofthepairswiththegivenpropertyistothedistributionintheperfectlyhomogeneousview.Hence,thedi erencebetweentheproportionofthepairswiththegivenpropertyintheviewitselfandineachofitssubviewscanbeusedtomeasurethedegreeofhomogeneityofthegivenview.
Speci cally,letv¯denoteanextensionofaviewofarelationinastoreddatabase,letv1,...,vNbethesetofallpossibleprojection-selectionviewsofv¯,lets(¯v)ands(vi)denotetheproportionofpairsinviewsv¯andvi(i=1,...,N),respectively,thatoccurintheircorrespondingidealrepresentations(i.e.,proportionsofcorrectpairsintheseviews).Then1 (s(¯v) s(vi))2
Nvi v¯
measuresthehomogeneityoftheviewv¯withrespecttosoundness.Thehomogeneitywithrespecttocompletenessisde nedanalogously.Similarmeasuresofhomogeneitywerepro-posedin[6,3].
Duetothelargenumberofpossibleviews,computationofthesemeasuresisoftenpro-hibitivelyexpensive.TheGiniindex[2,3]wasproposedasasimplealternativetothesehomogeneitymeasures.
Consideraviewv¯andarelationmapM.WecallthepartofMthatcorrespondsto
3v¯anode.TheGiniindexofthisnode,denotedG(¯v),is2p(1 p),wherepdenotesthe3Weusethetermsnodeandviewinterchangeably.
proportionof1sinthenode.4
Thesearchforhomogeneousnodesinvolvesrepeatedsplittingofnodes.TheGiniindexguaranteesthatanysplitimproves(ormaintains)thehomogeneityofdescendantnodes[2].Formally,letvbearelationmapnodewhichissplitintotwosubnodesv1andv2.ThenG(v)≥α1G(v1)+α2G(v2),whereαiis|vi|/|v|.Inotherwords,thereductionofasplit,de nedas G=G(v) α1G(v1) α2G(v2),isnon-negative.
Obviously,thebestsplitisasplitthatmaximizes G.Wecallsuchasplitamaximalsplit.Ifthenumberofpossiblesplitsis nite,therenecessarilyexistssuchasplit.Themethodofgeneratingsoundnessandcompletenessbasesisfoundedonthesearchforasplitthatmaximizesthegaininhomogeneity.Thismethodisdiscussednext.
4.3FindingaGoodnessBasis
Wedescribeaprocedureofbuildingasoundnessbasis.Theprocedureofbuildingacom-pletenessbasisissimilar.
Itisimportanttonotethattheprocedurestobediscussedinthissectionareperformedonsamplesoftherelations.Therefore,inthediscussionthatfollows,thetermsrelationandrelationmapusuallyrefertosamplesoftherelationsandmapsofthesesamples.Thus,althoughthebestsplitsarefoundusingonlysamples,theresultingviewsarelaterusedasagoodnessbasisfortheentirerelation.Careshouldbetakentoensurethatwedrawsampleswhosesizesaresu cientforrepresentingdistributionpatternsoftheoriginalrelation.
Asoundnessbasisisapartitionofthestoredrelations,inwhicheachrelationispar-titionedintoviewsthatarehomogeneouswithrespecttosoundness.Sincetheprocedureofpartitioningisrepeatedforeachrelation,itissu cienttoconsiderthisprocedureforasinglerelation.Weassumethatinformationonthecorrectnessofarelationinstancehasalreadybeenconvertedtoacorrespondingrelationmap.
Findingahomogeneouspartitionofarelationcanbeviewedasatree-buildingprocedure,wheretherootnodeofthetreeistheentirerelation,itsleafnodesarehomogeneousviewsofthisrelation,anditsintermediatenodesareviewsproducedbythesearchesformaximalsplits.Wecallthistreestructureasoundnesstree.Westartbylabelingtheentirerelationmapastherootofthetree.Wethenconsiderallthepossiblesplits,eitherhorizontalorvertical(butnotboth),andselectthesplitthatgivesmaximumgaininhomogeneity.Obviously,thebrute-forcetechniquedescribedhereisextremelyexpensive.Inpracticeweapplyseveral,substantiativeimprovements[11].
Whenthemaximalsplitisfound,webreaktherootnodeintothetwosubnodesthatachievedthemaximalsplit.Next,wesearchforamaximalsplitineachofthetwosubnodesoftherootanddividethemintwodescendentnodeseach.Theprocedureisrepeatedoneach
Ingeneral,theGiniindexisde nedformapswhoseelementsareofkdi erenttypes;theindexusedhereismuchsimpler,becauseourmapsarebinary.4
currentleafnodeofthetreeuntilaheuristicstop-splittingruleissatis edoneveryleafnode:splittingofanodestopswhenitcanprovideonlymarginalimprovementinhomogeneity.Thissituationusuallyariseswhenamaximalsplitonanodecannotseparateelementsofonetypefromelementsoftheothertypeinthisnode.Thisindicatesthatthisnodehasafairlyhomogeneousdistributionofbothtypesofelements.
Thestop-splittingrulesmentionedearlierarenecessary,becauseotherwiseatreecouldgrowuntilalltheelementsofeveryleafareofonetype.Thiscouldresultinalargenumberofsmallnodes.Italsomeansthattheremightbetoofewsampleelementsinthisnode,whichmakesthesoundnessestimateofthenodeunreliable.Ourstop-splittingruleis G·n≥threshold,wherenisthenumberofelementsinthenode[11].Ananalogousprocedureisusedforbuildingacompletenesstree.
Eachleafnodeofeverysoundnesstreecontributesoneviewtothesoundnessbasisandeachleafnodeofeverycompletenesstreecontributesoneviewtothecompletenessbasis.Together,thesesoundnessandcompletenessbasesformagoodnessbasis.Notethatthisprocessisperformedonlyonceoneveryrelation,andthegoodnessbasisneednotbechangedorupdatedlater.Theassumptionhereisthattheinformationisstatic.Whenaleafnodeisconvertedtoaview,inadditiontotherowsandcolumnsofthenode,theviewincludesthekeyattributeforthesetuples.
5
5.1EstimatingtheQualityofQueriesProjection-SelectionQueries
Assumenowaqueryissubmittedtothisdatabaseextension.Atthispoint,weconsideronlyselection-projectionqueriesonasinglerelation(andinwhichselectionsarebasedonranges).Inthissectionwediscusstheestimationofsoundnessofsuchqueries.Theconsiderationsforestimatingcompletenessarenearlyidentical.InthenextsectionwediscussqueriesthatinvolveCartesianproducts.
Becauseabasispartitionseachrelation,ananswertoaqueryintersectswithacertainnumberofbasisviews.Hence,eachofthesebasisviewscontainsacomponentoftheanswerasitssubview.Thekeyfeatureofbasisviewsistheirhomogeneitywithrespecttosoundness.Consequently,eachcomponentoftheanswerinheritsitssoundnessfromabasisview.AsshowninProposition1(see[11]forproof),thesoundnessofaviewwhichcomprisesdisjointcomponentsisaweightedsumofthesoundnessoftheindividualcomponents.Thisprovidesuswithaneasywaytodeterminethesoundnessoftheentireanswer.Asaspecialcase,whentheentireansweriscontainedinasinglebasisview,thesoundnessoftheanswerissimplythesoundnessofthecontainingview.
Proposition1Lett1andt2beleafnodesofasoundnesstreewithsoundnesss1ands2respectively,andletqbeananswertoaqueryQ.Supposealsothatq=(q∩t1)∪(q∩t2).
Thesoundnessofqis
s(q)=s1·|q∩t1||q∩t2|+s2·|q||q|
Thispropositioniseasilygeneralizedfornleafnodes,andtheanalogouspropositionistrueforcompleteness.Inpractice,weonlyhaveestimatesofs1ands2.Hence,theformulabecomes:|q∩t2||q∩t1|+s ·s (q)=s ·21|q||q|
Thevarianceoftheestimates (q)canbealsocomputed[11].
5.2EstimatingtheGoodnessofCartesianProducts
Toallowmoregeneralqueries,weconsidernowqueriesthatincludeCartesianproducts.Thefollowingproposition(see[11]forproof)describeshowtocomputethesoundnessandcompletenessoftheCartesianproductgiventhesoundnessandcompletenessofitsoperands.Proposition2Letr1andr2berelationswithsoundnessandcompletenesss1,c1ands2,c2respectively.Thesoundnessandcompletenessofther1×r2are
s(r1×r2)=k·s1+p·s2k·c1+p·c2,c(r1×r2)=k+pk+p
respectively,wherekandparethenumberofnon-keyattributesintherelationsr1andr2respectively.
Inpractice,wehaveonlyestimatesofthesoundnessandcompleteness,andtheformulasfromthepropositionbecome:
s (r1×r2)=k·c 1+p·c 2k·s 1+p·s 2,c (r1×r2)=k+pk+p
wheres 1,s 2,c 1,c 2areestimatesforsoundnessandcompletenessofthecorrespondingrela-tions.Forderivationofthevarianceoftheestimatessee[11].
5.3EstimatingtheGoodnessofGeneralQueries
Sofarwehaveshownhowtoestimatethesoundnessandcompletenessofselection-projectionqueriesonasinglerelation,andofCartesianproductsoftworelations.TocomputesoundnessandcompletenessofarbitraryCartesianproduct-selection-projectionqueriesitisnecessarytoshowhowtocomputegoodnessestimatesoversequencesofrelationalalgebraoperations.Theestimationofeachoperationinasequencerequiressoundnessandcompletenessbaseswitheachviewhavingitsassociatedsoundnessorcompletenessestimate.In[11]we
extendourmethodssothateachoperationdelivers,inadditiontoagoodnessestimateofitsresult,thenecessarybasesforfutureoperations.Thisprovidesuswiththeabilitytoperformsequencesofoperations.
Alegitimatequestionatthispointiswhethertheseestimatesdependontheorderinwhichtheyarecomputed,i.e.,whethertheestimatesofthegoodnessofequivalentrelationalalgebraexpressionsarethesame.Theanswertothisquestionisthattheestimatesareindependentoftheparticularexpressionused[11].
6ConclusionsandFutureResearch
Weintroducedanewmodelfordataqualityinrelationaldatabases,whichisbasedonthedualmeasuresofsoundnessandcompleteness.Thepurposeofthismodelistoprovideanswerstoarbitraryquerieswithanestimationoftheirquality.Weachievedthisbyadoptingtheconceptofabasis,whichisapartitionofthedatabaseintoviewsthatarehomogeneouswithrespecttothegoodnessmeasures.Thesebasesareconstructedusingdatabasesamples,whosegoodnessisestablishedmanually.Oncethebasesandtheirgoodnessestimatesareinplace,thegoodnessofanswerstoarbitraryqueriesisinferredrathersimply.
Weplantodevelopthecompletesetofproceduresforcalculatingsoundnessandcom-pletenessoftheanswerstootherrelationalalgebraoperations;i.e.,addproceduresforunion,di erence,andintersectionofviews.Oneofourmajorgoalsistousethesemethodstoesti-matethegoodnessofanswerstoqueriesagainstmultidatabases,wherethesamequerycouldbeanswereddi erentlybydi erentdatabases,andgoodnessinformationcanhelpresolvesuchinconsistencies.
Wehavealreadydiscussedtheadvantageofconsideringthecorrectnessofindividualattributesoverthecorrectnessofentiretuples.Still,anindividualvalueiseithercorrectorincorrect,and,whenincorrect,wedonotconsidertheproximityofastoredvaluetothetruevalue.Thisdirection,whichiscloselyrelatedtoseveraluncertaintymodelingtechniques,meritsfurtherinvestigation.
Becauseofthecostofestablishinggoodnessestimations,wehavenotedthatourmeth-odsaremostsuitableforstaticinformation.Whentheinformationisdynamic,itwouldbeadvisabletotimestamptheestimationsatthetimethattheywereobtainedandattachthesetimestampstoallqualityinferences.Onemayalsoconsidertheautomaticattenua-tionofqualityestimationsastimeprogresses.Thisdirectionisstilloutsideourimmediateobjectives.
References
[1]World,17(51),December1995.
[2]L.Breiman,J.Friedman,R.Olshen,andCh.Stone.Classi cationandRegressionTrees.
WadsworthInternationalGroup,1984.
[3]M.C.Chen,L.McNamee,andN.Matlo .Selectivityestimationusinghomogeneity
measurement.InProceedingoftheInternationalConferenceonDataEngineering,1990.
[4]W.Cochran.SamplingTechniques.JohnWiley&Sons,1963.
[5]C.Fox,A.Levitin,andT.Redman.Thenotionofdataanditsqualitydimensions.
Informationprocessingandmanagement,30(1),1994.
[6]N.KamelandR.King.Exploitingdatadistributionpatternsinmodelingtupleselec-
rmationSciences,69(1-2),1993.
[7]A.Motro.Integrity=validity+completeness.ACMTransactionsonDatabaseSystems,
14(4):480–502,December1989.
[8]A.Motro.Sourcesofuncertaintyininformationsystems.InA.MotroandPh.Smets,
editors,ProceedingsoftheWorkshoponUncertaintyManagementinInformationSys-tems:FromNeedstoSolutions,1992.
[9]A.Motro.Aformalframeworkforintegratinginconsistentanswersfrommultiplein-
formationsources.TechnicalReportISSE-TR-93-106,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1993.
[10]A.Motro.Panorama:Adatabasesystemthatannotatesitsanswerstoquerieswith
theirproperties.JournalofIntelligentInformationSystems,7(1),1996.
[11]A.MotroandI.Rakov.Onthespeci cation,measurement,andinferenceofthequality
ofdata.Technicalreport,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1996.
[12]F.OlkenandD.Rotem.Randomsamplingfromdatabases—asurvey.Statisticsand
Computing,5(1),1995.
[13]K.ParsayeandM.Chignell.IntelligentDatabaseToolsandApplications.JohnWiley
&Sons,1993.
[14]M.P.ReddyandR.Wang.Estimatingdataaccuracyinafederateddatabaseenviron-
ment.InProceedingsofCISMOD,1995.
[15]G.SaltonandM.J.McGill.IntroductiontoModernInformationRetrieval.McGraw-
Hill,NewYork,NewYork,1983.
[16]S.Thompson.Sampling.JohnWiley&Sons,1992.
[17]J.D.Ullman.DatabaseandKnowledge-BaseSystems,puterScience
Press,Rockville,Maryland,1988.
[18]R.Wang,M.Reddy,andH.Kon.Towardqualitydata:Anattribute-basedapproach.
DecisionSupportSystems,13(3-4),1995.
[19]R.Wang,V.Storey,andCh.Firth.Aframeworkforanalysisofdataqualityresearch.
IEEETransactionsonKnowledgeandDataEngineering,7(4),August1995.
正在阅读:
Estimating the quality of data in relational databases03-19
2019考研数学一真题及答案解析参考10-16
高中数列知识点、解题方法和题型大全05-05
英语六级考试短期速记词汇04-15
五年级奥数题奥数题及答案04-23
2018-2019-2018年党支部换届选举主持词-范文word版(8页)01-17
教你污汁的去除方法 - hufu03-08
歌曲火苗歌词串词报幕词02-08
青蛙是我们的好朋友作文300字07-07
- 1data guard 操作指南
- 2abaqus operating on xy data
- 3data guard 操作指南
- 4Effects of Pointers on Data Dependences
- 5Effects of Pointers on Data Dependences
- 6Reading Ability- Lexical Quality to Comprehension
- 7Recursive XML schemas, recursive XML queries, and relational
- 8Automatic evaluation of translation quality Outline of methodology and report on pilot expe
- 9Natrosol 250 Hydroxyethylcellulose Product Data
- 10data是什么意思?
- 粮油储藏基础知识
- 论文范文(包括统一封面和内容的格式)
- 经典解题方法
- 综合部后勤办公用品管理办法+领用表
- 学生宿舍突发事件应急预案
- 16秋浙大《生理学及病理生理学》在线作业
- 四分比丘尼戒本(诵戒专用)
- 浙江财经大学高财题库第一章习题
- 九大员岗位职责(项目经理、技术负责人、施工员、安全员、质检员、资料员、材料员、造价员、机管员)
- 旅游财务管理习题(学生版)
- 德阳外国语高二秋期入学考试题
- 投资学 精要版 第九版 第11章 期权市场
- 控制性详细规划城市设计认识
- bl03海运提单3国际贸易答案
- 2010-2011学年湖北省武汉市武珞路中学七年级(上)期中数学试卷
- VB程序填空改错设计题库全
- 教师心理健康案例分析 - 年轻班主任的心理困惑
- 民间借贷司法解释溯及力是否适用?
- 三联书店推荐的100本好书
- 《化工原理》(第三版)复习思考题及解答
- Estimating
- relational
- databases
- quality
- data
- 会展旅游的营销战略研究
- 英国文学literary terms
- 外汇高手交易记录1
- 劳动合同法实施条例逐条完全解读
- 证券公司试用期转正申请工作总结
- 11. 一汽-大众经销商基础信息管理系统指导手册
- 职工薪酬的常见舞弊及审计对策
- 2014年合肥市金融学校“文明风尚传播”活动方案及部署
- 药品购进、验收、储存、养护、不合格药品处理程序
- 新村党组织、村委会工作报告制度
- 新汶矿业集团公司
- 恭王府(和第)景点展示
- simulink模块介绍
- 论顾客价值理论对金融营销的创新
- 萤火虫 苏运莹 歌词
- Multiple Description Vector Quantization with Lattice Codebooks Design and Analysis
- 人教版初中人教初二物理下册第二学期期末考试试卷
- 理性的批判和道义的批判
- 快递配送中心配送模型及应用研究
- 世家风范牛仔品牌加盟申请书212