Estimating the quality of data in relational databases

更新时间:2023-05-31 00:31:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

EstimatingtheQualityofDatainRelational

Databases

AmihaiMotroandIgorRakov

DepartmentofInformationandSoftwareSystemsEngineering

GeorgeMasonUniversity

Fairfax,VA22030-4444

{ami,irakov}@gmu.edu

Abstract

Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationsthatwillbea xedtoeachdatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.

ThisworkwassupportedinpartbyDARPAgrantsN0014-92-J-4038andN0060-96-D-3202.

1Introduction

Theimportanceofdataqualityintheinformationagecannotbeoverestimated.People,businesses,andgovernmentsrelymoreandmoreoninformationintheireverydayoperations,anddatabasesofdi erentkindsaretheprimarysourceofthisinformation.Ourdependenceondatabasesgrowssimultaneouslywiththeirsize,yetmostlargedatabasescontainerrorsandinconsistencies.Thereisagrowingawarenessinthedatabaseresearchcommunity

[13,19]andamongdatabasepractitioners[1]oftheproblemofdataquality.Bynow,theneedfordataqualitymetricsandformethodsforincorporatingthemindatabasesystemsiswellunderstood.Dataqualitycanbemetricizedinanumberofdi erentwaysdependingonwhichaspectofinformationareconsideredimportant[18,5].Theadditionofdataqualitycapabilitiestodatabasesystemswillenhancedecision-makingprocesses,improvethequalityofinformationservices,and,ingeneral,providemoreaccuratepicturesofreality.Ontheotherhand,thesenewcapabilitiesofdatabasesshouldnotbedemandingintermsofresources,e.g.,theymustnotaddtoomuchcomplexitytoqueryprocessingorrequiremuchmorememorythanexistingdatabases.

Therecentadvancesinthe eldofdataqualityconcerndataatanattributevaluelevel[18]andatarelationlevel[14].Thecomprehensivesurveyofthestate-of-the-artinthe eldisgivenin[19].Therelationalalgebraextendedwithdataaccuracyestimatesbasedontheassumptionsofuniformdistributionsofincorrectvaluesacrosstuplesandattributeswas rstdescribedin[14].

Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.

Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationthatwillbea xedtothedatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.

2OverallApproach

Ouroverallapproachforachievingthegoalsthatwerestatedintheintroductioncanbedescribedasasequenceofproblems.

Webegin,inSection3,bydescribingthedualmeasuresthatwillbeusedtogaugethequalityofdatabaseinformation.Weclaimthatthesemeasurescaptureinamostnaturalway,therelationshipofthestoredinformationtotruth,andarethereforeexcellentindicatorsofquality.

Ourmeasuresrequiretheauthenticationofdatabaseinformation,whichisaprocessthatneedstobedonebyhumans.However,weadvocatetheuseofstatisticalmethods(essentially,sampling)tokeepthemanualworkwithinacceptablelimits.ThissubjectisdiscussedinSection4.1.

Havingobtainedaccurateinformationaboutthequalityofthesamples,weproceedtopartitionthegivendatabasetoasetofcomponentsthatarehomogeneouswithrespecttoourqualitymeasures.Wethenestimatethequalityofthesecomponents,usingthesamples.Thisimpliesthatwheninformationisextractedfromasinglecomponent,itsqualityratingsareinheritedfromthecontainingcomponent.Thesemethods,describedinSections4.2and4.3,provideuswithqualityspeci cationsforthegivendatabases.

Finally,inSection5,wedescribetheprocessofinferringthequalityofanswerstoarbi-traryqueriesfromthequalityspeci cationsthathavebeenassignedtothedatabase.Ourtreatmentoftheproblemisinthecontextofrelationaldatabases,andweassumethestandardde nitionsoftherelationalmodel[17].Inparticular,thedatabasecomponentsmentionedearlierarede nedusingthemechanismofviews.Wealsomakethefollowingassumptions.

1.Queriesandviewsuseonlytheprojection,selection,andCartesianproductoperations,selectionsuseonlyrangeconditions,andprojectionsalwaysretainthekeyattribute(s).

2.Thestoredinformation(thedatabaseinstances)arerelativelystatic,andhencethequalityofdatadoesnotchangefrequently.

Becauseofspacelimitations,severalkeyissuesandsolutionsareonlysketchedinthispaper,andfullerdiscussionsareprovidedin[11].

3SoundnessandCompletenessasMeasuresofData

Quality

Wede netwomeasuresofdataqualitythataregeneralenoughtoencompassmostexistingmeasuresandaspectsofdataquality[5,19].Thebasicideasunderlyingthesemeasureswere rststatedin[7].Inthatpapertheauthorsuggestedthatdeclarationsoftheportionsofthedatabasethatareknowntobeperfectmodelsoftherealworld(andtherebytheportionsthatarepossiblyimperfect)beincludedinthede nitionofeachdatabase.Withthisinformation,thedatabasesystemcanqualifytheaccuracyoftheanswersitissuesinresponsetoqueries:eachanswerisaccompaniedbystatementsthatde netheportionsoftheanswerthatareguaranteedtobeperfect.Thisapproachusesviewstospecifytheportionsofthedatabaseortheportionsofanswersthatareperfectmodelsoftherealworld.

Morespeci cally,thisapproachinterpretsinformationquality,whichittermsintegrity,asacombinationofsoundnessandcompleteness.Adatabaseviewissoundifitincludesonlyinformationthatoccursintherealworld;adatabaseviewiscompleteifitincludesalltheinformationthatoccursintherealworld.Hence,adatabaseviewhasintegrity,ifitincludesthewholetruth(completeness)andnothingbutthetruth(soundness).Aprototypedatabasesystemthatisbasedontheseideasisdescribedin[10].Theseideaswerefurtherdevelopedin[9]andaresummarizedbelow.

GivenadatabaseschemeD,weassumetheexistenceofahypotheticaldatabaseinstanced0thatcapturesperfectlythatportionoftherealworldthatismodeledbyD(theidealortruedatabase).Inaddition,weassumeoneormoreactualinstancesdi(i≥1).Theactualinstancesareconsideredapproximationsoftheidealinstanced0.

GivenaviewV,wedenotebyv0itsextensionintheidealdatabased0(theidealortrueextensiontoV)andwedenotebyviitsextensionintheactualdatabasedi.Again,theextensionsviareapproximationsoftheidealextensionv0.

ConsiderviewV,itsidealextensionv0,andanapproximationv.Ifv v0,thenvisacompleteextension.Ifv v0,thenvisasoundextension.Obviously,anextensionwhichissoundandcompleteistheidealextension.

Withthesede nitions,eachviewextensioniseithercompleteorincomplete,andeithersoundornonsound.Wenowre nethesede nitionsbyassigningeachextensionavaluethatdenoteshowwellitapproximatestheidealextension.Weshalltermthisvaluethegoodnessoftheextension.Werequirethatthegoodnessofeachextensionbeavaluebetween0and1,thatthegoodnessoftheidealextensionbe1,andthatthegoodnessofextensionsthatareentirelydisjointfromtheidealextensionbe0.Formally,agoodnessmeasureisafunctiongonthesetofallpossibleextensionsthatsatis es

v:g(v)∈[0,1]

v:v∩v0= = g(v)=0

g(v0)=1

Asimpleapproachtogoodnessistoconsidertheintersectionoftheextensions;thatis,thetuplesthatappearinbothvandv0.Let|v|denotethenumberoftuplesinv.Then

|v∩v0|

|v|

expressestheproportionofthedatabaseextensionthatappearsinthetrueextension.Hence,itisameasureofthesoundnessofv.Similarly,

|v∩v0|

|v0|

expressestheproportionofthetrueextensionthatappearsinthedatabaseextension.Hence,itisameasureofthecompletenessofv.

Itiseasytoverifythatsoundnessandcompletenesssatisfyalltherequirementsofagood-nessmeasure.1Soundnessandcompletenessaresimilartoprecisionandrecallininformationretrieval[15].

Adisadvantageofthesemeasuresisthatadatabasetupleisassumedtobesound(andcontributetothesoundnessmeasure)onlyifitidenticaltoatupleoftheidealdatabase(sim-ilarlyinthecaseofcompleteness).Thus,atuplethatiscorrectinallbutoneattribute,andatuplethatisincorrectinallitsattributesaretreatedidentically.Anessentialre nementofthesemeasuresistoconsiderthegoodnessofindividualattributes.

AssumeaviewVhasattributesA0,A1,...,An,whereA0isthekey.2WedecomposeVintonkey-attributepairs(A0,Ai)(i=1,...,n),ingdecom-posedextensionsinthepreviously-de nedmeasuresimprovestheirusefulnessconsiderably,andweshallassumedecomposedextensionsthroughout.

Soundnessandcompletenesscanalsobeapproachedbymeansofprobabilitytheory[11].Forexample,thede nitionofsoundnesscanbeinterpretedastheprobabilityofdrawingacorrectpairfromagivenextension.Probabilisticinterpretationsgivenewinsightintothenotionsofsoundnessandcompletenessandalsohelpustoconnectthisresearchwithalargebodyofworkonuncertaintymanagementininformationsystems[8].

Thedataqualitymeasuresthathavebeenmentionedmostfrequentlyasessentialareaccuracy,completeness,currentness,andconsistency[5,18].Itispossibletorelatethesequalitymeasurestoourowngoodnessmeasures[11].

Whenvisempty,soundnessis0/0.Ifv0isalsoemptythensoundnessisde nedtobe1;otherwiseitisde nedtobe0.Similarlyforcompleteness,whenv0isempty.

2Weconsideratupleasarepresentationoftherealworldentityidenti edbyakeyattribute;thenonkeyattributesthencapturethepropertiesofthisentity.Forsimplicity,weassumethatkeysconsistofasingleattribute.1

4

4.1RatingtheQualityofDatabasesNecessaryProceduresforGoodnessEstimation

Theamountofdatainpracticaldatabasesisoftenlarge.Tocomputetheexactsoundnessandcompletenessofaparticularviewwewouldneedto(1)authenticateeveryvaluepairinthestoredview,and(2)determinehowmanypairsaremissingfromthisview.Thismethodisclearlyinfeasibleinanyrealsystem.Thus,wemustresorttosamplingtechniques[16,4].Samplingtechniquesallowustoestimatethemeanandvarianceofaparticularparameterofapopulationbyusingasamplewhichisusuallyonlyafractionofthesizeoftheentirepopulation.Thetheoryofstatisticsalsogivesusmethodsforestablishingasamplesizetoachievepredeterminedaccuracyoftheestimates.Itisthenpossibletosupplementourestimateswithcon denceintervals.Formoredetaileddiscussiononsamplingfromdatabasesthereaderisreferredtotheliteratureonthetopic(see,forexample,[12]foragoodsurvey).Notethattwodi erentpopulationsmustbesampled.Toestimatesoundnesswesamplethegiven(stored)view,whereastoestimatecompleteness,wesampletheidealview.Toestablishbothsoundnessandcompletenessitisnecessarytohaveaccesstotheidealdatabase.Forsoundness,weneedtodeterminewhetheraspeci cvaluepairofthestoreddatabaseisintheidealdatabase.Forcompleteness,itisnecessarytodeterminewhetheraspeci cpairfromtheidealdatabaseisinthestoreddatabase.Theseprocedures(verifyapairfromastoreddatabaseagainsttheidealdatabaseandretrieveanarbitrarypairfromtheidealdatabase)mustbeimplementedinanad-hocmanner[1].Foreachconcretedatabase,humanexpertisewillberequired.Theexpertwillaccessavarietyofavailablesourcestoperformthesetwoprocedures.Notethatthise ortisperformedonlyonceandonlyforasample,whichthenhelpsestimatetheoverallgoodness.

Acriticalstageofoursolutionistobuildasetofhomogeneousviewsonastoreddatabase,calledagoodnessbasis.Thegoodnessoftheviewsofthisbasiswillbemeasuredandthere-afterusedinestablishingthegoodnessofanswerstoarbitraryqueriesagainstthisdatabase.Sincewecannotguaranteeasinglesetofviewsthatwillbehomogeneouswithrespecttobothqualitymeasures,weconstructtwoseparatesets:asoundnessbasisandacompletenessbasis.Inconstructingeachbasis,weconsidereachdatabaserelationindividually.Eachre-lationmaybepartitionedbothhorizontally(byaselection)andvertically(byaprojection),andthebasiscomprisestheunionofallsuchpartitions.Selectionsarelimitedtoranges;i.e.,theselectioncriteriaisaconjunctionofconditions,whereeachindividualconditionspeci esanattributeandarangeofpermittedvaluesforthisattribute.

Weassigntoanincorrectvaluepairthevalue0andtoacorrectpairthevalue1.Thus,wecanrepresentanerrordistributionpatterninaviewextensionasatwo-dimensionalmatrixof0sand1s,inwhichrowscorrespondtothetuplesandcolumnscorrespondtotheattributesoftheview.Avalueinaparticularcellofthismatrixiseither0or1dependingonthecorrectnessofthecorrespondingpairofattributevalues.Wecallthisnewdatastructure

aviewmaporarelationmap,asappropriate.Now,thetaskistopartitionthistwo-dimensionalarrayintoareasinwhichelementsaredistributedhomogeneouslywithrespecttoourqualitymeasures.

Notethatthecorrectnessofaparticularnonkeyattributevaluecanbedeterminedonlyinreferencetothekeyattributeofthattuple,i.e.,indeterminingwhetheraspeci ccellshouldbe0or1weconsiderthecorrectnessofthepair:(keyvalue;nonkeyvalue)determiningthecorrectnessofanattributevalue.Thepairiscorrectifandonlyifbothelementsofthepairarecorrect.Thismeans,inparticular,thatifakeyattributevalueisincorrect,thenallpairscorrespondingtothiskeyattributevalueareconsideredincorrect.

ThetechniqueweuseforpartitioningtheviewmapisanonparametricstatisticalmethodcalledCART(Classi cationandRegressionTrees)[2].Thismethodhasbeenwidelyusedfordataanalysisinbiology,socialscience,environmentalresearch,andpatternrecognition.Closertoourarea,thismethodwasusedin[3]forestimatingtheselectivityofselectionqueries.Weassumethattuplesandattributesofarelationareordereduniquely.

4.2HomogeneityMeasure

Intuitively,aviewisperfectlyhomogeneouswithrespecttoagivenpropertyifeverysubviewoftheviewcontainsthesameproportionofpairswiththispropertyastheviewitself.Moreover,themorehomogeneousaview,thecloseritsdistributionofthepairswiththegivenpropertyistothedistributionintheperfectlyhomogeneousview.Hence,thedi erencebetweentheproportionofthepairswiththegivenpropertyintheviewitselfandineachofitssubviewscanbeusedtomeasurethedegreeofhomogeneityofthegivenview.

Speci cally,letv¯denoteanextensionofaviewofarelationinastoreddatabase,letv1,...,vNbethesetofallpossibleprojection-selectionviewsofv¯,lets(¯v)ands(vi)denotetheproportionofpairsinviewsv¯andvi(i=1,...,N),respectively,thatoccurintheircorrespondingidealrepresentations(i.e.,proportionsofcorrectpairsintheseviews).Then1 (s(¯v) s(vi))2

Nvi v¯

measuresthehomogeneityoftheviewv¯withrespecttosoundness.Thehomogeneitywithrespecttocompletenessisde nedanalogously.Similarmeasuresofhomogeneitywerepro-posedin[6,3].

Duetothelargenumberofpossibleviews,computationofthesemeasuresisoftenpro-hibitivelyexpensive.TheGiniindex[2,3]wasproposedasasimplealternativetothesehomogeneitymeasures.

Consideraviewv¯andarelationmapM.WecallthepartofMthatcorrespondsto

3v¯anode.TheGiniindexofthisnode,denotedG(¯v),is2p(1 p),wherepdenotesthe3Weusethetermsnodeandviewinterchangeably.

proportionof1sinthenode.4

Thesearchforhomogeneousnodesinvolvesrepeatedsplittingofnodes.TheGiniindexguaranteesthatanysplitimproves(ormaintains)thehomogeneityofdescendantnodes[2].Formally,letvbearelationmapnodewhichissplitintotwosubnodesv1andv2.ThenG(v)≥α1G(v1)+α2G(v2),whereαiis|vi|/|v|.Inotherwords,thereductionofasplit,de nedas G=G(v) α1G(v1) α2G(v2),isnon-negative.

Obviously,thebestsplitisasplitthatmaximizes G.Wecallsuchasplitamaximalsplit.Ifthenumberofpossiblesplitsis nite,therenecessarilyexistssuchasplit.Themethodofgeneratingsoundnessandcompletenessbasesisfoundedonthesearchforasplitthatmaximizesthegaininhomogeneity.Thismethodisdiscussednext.

4.3FindingaGoodnessBasis

Wedescribeaprocedureofbuildingasoundnessbasis.Theprocedureofbuildingacom-pletenessbasisissimilar.

Itisimportanttonotethattheprocedurestobediscussedinthissectionareperformedonsamplesoftherelations.Therefore,inthediscussionthatfollows,thetermsrelationandrelationmapusuallyrefertosamplesoftherelationsandmapsofthesesamples.Thus,althoughthebestsplitsarefoundusingonlysamples,theresultingviewsarelaterusedasagoodnessbasisfortheentirerelation.Careshouldbetakentoensurethatwedrawsampleswhosesizesaresu cientforrepresentingdistributionpatternsoftheoriginalrelation.

Asoundnessbasisisapartitionofthestoredrelations,inwhicheachrelationispar-titionedintoviewsthatarehomogeneouswithrespecttosoundness.Sincetheprocedureofpartitioningisrepeatedforeachrelation,itissu cienttoconsiderthisprocedureforasinglerelation.Weassumethatinformationonthecorrectnessofarelationinstancehasalreadybeenconvertedtoacorrespondingrelationmap.

Findingahomogeneouspartitionofarelationcanbeviewedasatree-buildingprocedure,wheretherootnodeofthetreeistheentirerelation,itsleafnodesarehomogeneousviewsofthisrelation,anditsintermediatenodesareviewsproducedbythesearchesformaximalsplits.Wecallthistreestructureasoundnesstree.Westartbylabelingtheentirerelationmapastherootofthetree.Wethenconsiderallthepossiblesplits,eitherhorizontalorvertical(butnotboth),andselectthesplitthatgivesmaximumgaininhomogeneity.Obviously,thebrute-forcetechniquedescribedhereisextremelyexpensive.Inpracticeweapplyseveral,substantiativeimprovements[11].

Whenthemaximalsplitisfound,webreaktherootnodeintothetwosubnodesthatachievedthemaximalsplit.Next,wesearchforamaximalsplitineachofthetwosubnodesoftherootanddividethemintwodescendentnodeseach.Theprocedureisrepeatedoneach

Ingeneral,theGiniindexisde nedformapswhoseelementsareofkdi erenttypes;theindexusedhereismuchsimpler,becauseourmapsarebinary.4

currentleafnodeofthetreeuntilaheuristicstop-splittingruleissatis edoneveryleafnode:splittingofanodestopswhenitcanprovideonlymarginalimprovementinhomogeneity.Thissituationusuallyariseswhenamaximalsplitonanodecannotseparateelementsofonetypefromelementsoftheothertypeinthisnode.Thisindicatesthatthisnodehasafairlyhomogeneousdistributionofbothtypesofelements.

Thestop-splittingrulesmentionedearlierarenecessary,becauseotherwiseatreecouldgrowuntilalltheelementsofeveryleafareofonetype.Thiscouldresultinalargenumberofsmallnodes.Italsomeansthattheremightbetoofewsampleelementsinthisnode,whichmakesthesoundnessestimateofthenodeunreliable.Ourstop-splittingruleis G·n≥threshold,wherenisthenumberofelementsinthenode[11].Ananalogousprocedureisusedforbuildingacompletenesstree.

Eachleafnodeofeverysoundnesstreecontributesoneviewtothesoundnessbasisandeachleafnodeofeverycompletenesstreecontributesoneviewtothecompletenessbasis.Together,thesesoundnessandcompletenessbasesformagoodnessbasis.Notethatthisprocessisperformedonlyonceoneveryrelation,andthegoodnessbasisneednotbechangedorupdatedlater.Theassumptionhereisthattheinformationisstatic.Whenaleafnodeisconvertedtoaview,inadditiontotherowsandcolumnsofthenode,theviewincludesthekeyattributeforthesetuples.

5

5.1EstimatingtheQualityofQueriesProjection-SelectionQueries

Assumenowaqueryissubmittedtothisdatabaseextension.Atthispoint,weconsideronlyselection-projectionqueriesonasinglerelation(andinwhichselectionsarebasedonranges).Inthissectionwediscusstheestimationofsoundnessofsuchqueries.Theconsiderationsforestimatingcompletenessarenearlyidentical.InthenextsectionwediscussqueriesthatinvolveCartesianproducts.

Becauseabasispartitionseachrelation,ananswertoaqueryintersectswithacertainnumberofbasisviews.Hence,eachofthesebasisviewscontainsacomponentoftheanswerasitssubview.Thekeyfeatureofbasisviewsistheirhomogeneitywithrespecttosoundness.Consequently,eachcomponentoftheanswerinheritsitssoundnessfromabasisview.AsshowninProposition1(see[11]forproof),thesoundnessofaviewwhichcomprisesdisjointcomponentsisaweightedsumofthesoundnessoftheindividualcomponents.Thisprovidesuswithaneasywaytodeterminethesoundnessoftheentireanswer.Asaspecialcase,whentheentireansweriscontainedinasinglebasisview,thesoundnessoftheanswerissimplythesoundnessofthecontainingview.

Proposition1Lett1andt2beleafnodesofasoundnesstreewithsoundnesss1ands2respectively,andletqbeananswertoaqueryQ.Supposealsothatq=(q∩t1)∪(q∩t2).

Thesoundnessofqis

s(q)=s1·|q∩t1||q∩t2|+s2·|q||q|

Thispropositioniseasilygeneralizedfornleafnodes,andtheanalogouspropositionistrueforcompleteness.Inpractice,weonlyhaveestimatesofs1ands2.Hence,theformulabecomes:|q∩t2||q∩t1|+s ·s (q)=s ·21|q||q|

Thevarianceoftheestimates (q)canbealsocomputed[11].

5.2EstimatingtheGoodnessofCartesianProducts

Toallowmoregeneralqueries,weconsidernowqueriesthatincludeCartesianproducts.Thefollowingproposition(see[11]forproof)describeshowtocomputethesoundnessandcompletenessoftheCartesianproductgiventhesoundnessandcompletenessofitsoperands.Proposition2Letr1andr2berelationswithsoundnessandcompletenesss1,c1ands2,c2respectively.Thesoundnessandcompletenessofther1×r2are

s(r1×r2)=k·s1+p·s2k·c1+p·c2,c(r1×r2)=k+pk+p

respectively,wherekandparethenumberofnon-keyattributesintherelationsr1andr2respectively.

Inpractice,wehaveonlyestimatesofthesoundnessandcompleteness,andtheformulasfromthepropositionbecome:

s (r1×r2)=k·c 1+p·c 2k·s 1+p·s 2,c (r1×r2)=k+pk+p

wheres 1,s 2,c 1,c 2areestimatesforsoundnessandcompletenessofthecorrespondingrela-tions.Forderivationofthevarianceoftheestimatessee[11].

5.3EstimatingtheGoodnessofGeneralQueries

Sofarwehaveshownhowtoestimatethesoundnessandcompletenessofselection-projectionqueriesonasinglerelation,andofCartesianproductsoftworelations.TocomputesoundnessandcompletenessofarbitraryCartesianproduct-selection-projectionqueriesitisnecessarytoshowhowtocomputegoodnessestimatesoversequencesofrelationalalgebraoperations.Theestimationofeachoperationinasequencerequiressoundnessandcompletenessbaseswitheachviewhavingitsassociatedsoundnessorcompletenessestimate.In[11]we

extendourmethodssothateachoperationdelivers,inadditiontoagoodnessestimateofitsresult,thenecessarybasesforfutureoperations.Thisprovidesuswiththeabilitytoperformsequencesofoperations.

Alegitimatequestionatthispointiswhethertheseestimatesdependontheorderinwhichtheyarecomputed,i.e.,whethertheestimatesofthegoodnessofequivalentrelationalalgebraexpressionsarethesame.Theanswertothisquestionisthattheestimatesareindependentoftheparticularexpressionused[11].

6ConclusionsandFutureResearch

Weintroducedanewmodelfordataqualityinrelationaldatabases,whichisbasedonthedualmeasuresofsoundnessandcompleteness.Thepurposeofthismodelistoprovideanswerstoarbitraryquerieswithanestimationoftheirquality.Weachievedthisbyadoptingtheconceptofabasis,whichisapartitionofthedatabaseintoviewsthatarehomogeneouswithrespecttothegoodnessmeasures.Thesebasesareconstructedusingdatabasesamples,whosegoodnessisestablishedmanually.Oncethebasesandtheirgoodnessestimatesareinplace,thegoodnessofanswerstoarbitraryqueriesisinferredrathersimply.

Weplantodevelopthecompletesetofproceduresforcalculatingsoundnessandcom-pletenessoftheanswerstootherrelationalalgebraoperations;i.e.,addproceduresforunion,di erence,andintersectionofviews.Oneofourmajorgoalsistousethesemethodstoesti-matethegoodnessofanswerstoqueriesagainstmultidatabases,wherethesamequerycouldbeanswereddi erentlybydi erentdatabases,andgoodnessinformationcanhelpresolvesuchinconsistencies.

Wehavealreadydiscussedtheadvantageofconsideringthecorrectnessofindividualattributesoverthecorrectnessofentiretuples.Still,anindividualvalueiseithercorrectorincorrect,and,whenincorrect,wedonotconsidertheproximityofastoredvaluetothetruevalue.Thisdirection,whichiscloselyrelatedtoseveraluncertaintymodelingtechniques,meritsfurtherinvestigation.

Becauseofthecostofestablishinggoodnessestimations,wehavenotedthatourmeth-odsaremostsuitableforstaticinformation.Whentheinformationisdynamic,itwouldbeadvisabletotimestamptheestimationsatthetimethattheywereobtainedandattachthesetimestampstoallqualityinferences.Onemayalsoconsidertheautomaticattenua-tionofqualityestimationsastimeprogresses.Thisdirectionisstilloutsideourimmediateobjectives.

References

[1]World,17(51),December1995.

[2]L.Breiman,J.Friedman,R.Olshen,andCh.Stone.Classi cationandRegressionTrees.

WadsworthInternationalGroup,1984.

[3]M.C.Chen,L.McNamee,andN.Matlo .Selectivityestimationusinghomogeneity

measurement.InProceedingoftheInternationalConferenceonDataEngineering,1990.

[4]W.Cochran.SamplingTechniques.JohnWiley&Sons,1963.

[5]C.Fox,A.Levitin,andT.Redman.Thenotionofdataanditsqualitydimensions.

Informationprocessingandmanagement,30(1),1994.

[6]N.KamelandR.King.Exploitingdatadistributionpatternsinmodelingtupleselec-

rmationSciences,69(1-2),1993.

[7]A.Motro.Integrity=validity+completeness.ACMTransactionsonDatabaseSystems,

14(4):480–502,December1989.

[8]A.Motro.Sourcesofuncertaintyininformationsystems.InA.MotroandPh.Smets,

editors,ProceedingsoftheWorkshoponUncertaintyManagementinInformationSys-tems:FromNeedstoSolutions,1992.

[9]A.Motro.Aformalframeworkforintegratinginconsistentanswersfrommultiplein-

formationsources.TechnicalReportISSE-TR-93-106,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1993.

[10]A.Motro.Panorama:Adatabasesystemthatannotatesitsanswerstoquerieswith

theirproperties.JournalofIntelligentInformationSystems,7(1),1996.

[11]A.MotroandI.Rakov.Onthespeci cation,measurement,andinferenceofthequality

ofdata.Technicalreport,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1996.

[12]F.OlkenandD.Rotem.Randomsamplingfromdatabases—asurvey.Statisticsand

Computing,5(1),1995.

[13]K.ParsayeandM.Chignell.IntelligentDatabaseToolsandApplications.JohnWiley

&Sons,1993.

[14]M.P.ReddyandR.Wang.Estimatingdataaccuracyinafederateddatabaseenviron-

ment.InProceedingsofCISMOD,1995.

[15]G.SaltonandM.J.McGill.IntroductiontoModernInformationRetrieval.McGraw-

Hill,NewYork,NewYork,1983.

[16]S.Thompson.Sampling.JohnWiley&Sons,1992.

[17]J.D.Ullman.DatabaseandKnowledge-BaseSystems,puterScience

Press,Rockville,Maryland,1988.

[18]R.Wang,M.Reddy,andH.Kon.Towardqualitydata:Anattribute-basedapproach.

DecisionSupportSystems,13(3-4),1995.

[19]R.Wang,V.Storey,andCh.Firth.Aframeworkforanalysisofdataqualityresearch.

IEEETransactionsonKnowledgeandDataEngineering,7(4),August1995.

本文来源:https://www.bwwdw.com/article/c134.html

Top