Estimating the quality of data in relational databases
更新时间:2023-05-31 00:31:01 阅读量: 实用文档 文档下载
- estimating推荐度:
- 相关推荐
EstimatingtheQualityofDatainRelational
Databases
AmihaiMotroandIgorRakov
DepartmentofInformationandSoftwareSystemsEngineering
GeorgeMasonUniversity
Fairfax,VA22030-4444
{ami,irakov}@gmu.edu
Abstract
Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationsthatwillbea xedtoeachdatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.
ThisworkwassupportedinpartbyDARPAgrantsN0014-92-J-4038andN0060-96-D-3202.
1Introduction
Theimportanceofdataqualityintheinformationagecannotbeoverestimated.People,businesses,andgovernmentsrelymoreandmoreoninformationintheireverydayoperations,anddatabasesofdi erentkindsaretheprimarysourceofthisinformation.Ourdependenceondatabasesgrowssimultaneouslywiththeirsize,yetmostlargedatabasescontainerrorsandinconsistencies.Thereisagrowingawarenessinthedatabaseresearchcommunity
[13,19]andamongdatabasepractitioners[1]oftheproblemofdataquality.Bynow,theneedfordataqualitymetricsandformethodsforincorporatingthemindatabasesystemsiswellunderstood.Dataqualitycanbemetricizedinanumberofdi erentwaysdependingonwhichaspectofinformationareconsideredimportant[18,5].Theadditionofdataqualitycapabilitiestodatabasesystemswillenhancedecision-makingprocesses,improvethequalityofinformationservices,and,ingeneral,providemoreaccuratepicturesofreality.Ontheotherhand,thesenewcapabilitiesofdatabasesshouldnotbedemandingintermsofresources,e.g.,theymustnotaddtoomuchcomplexitytoqueryprocessingorrequiremuchmorememorythanexistingdatabases.
Therecentadvancesinthe eldofdataqualityconcerndataatanattributevaluelevel[18]andatarelationlevel[14].Thecomprehensivesurveyofthestate-of-the-artinthe eldisgivenin[19].Therelationalalgebraextendedwithdataaccuracyestimatesbasedontheassumptionsofuniformdistributionsofincorrectvaluesacrosstuplesandattributeswas rstdescribedin[14].
Withmoreandmoreelectronicinformationsourcesbecomingwidelyavailable,theissueofthequalityofthese,often-competing,sourceshasbecomegermane.Weproposeastandardforratinginformationsourceswithrespecttotheirquality.Animportantconsiderationisthatthequalityofinformationsourcesoftenvariesconsiderablywhenspeci careaswithinthesesourcesareconsidered.Thisimpliesthattheassignmentofasingleratingofqualitytoaninformationsourceisusuallyunsatisfactory.Ofcourse,totheuserofaninformationsourcetheoverallqualityofthesourcemaynotbeasimportantasthequalityofthespeci cinformationthatthisuserisextractingfromthesource.Therefore,methodsmustbedevelopedthatwillderivereliableestimatesofthequalityoftheinformationprovidedtousers,fromthequalityspeci cationsthathavebeenassignedtothesources.
Ourworkherebearsonalltheseconcerns.Wedescribeanapproachthatusesdualqualitymeasuresthatgaugethedistanceoftheinformationinadatabasefromthetruth.Wethenproposetocombinemanualveri cationwithstatisticalmethodstoarriveatusefulestimatesofthequalityofdatabases.Weconsiderthevarianceinqualitybyisolatingareasofdatabasesthatarehomogeneouswithrespecttoquality,andthenestimatingthequalityofeachseparatearea.Thesecompositeestimatesmayberegardedasqualityspeci cationthatwillbea xedtothedatabase.Finally,weshowhowtoderivequalityestimatesforindividualqueriesfromsuchqualityspeci cations.
2OverallApproach
Ouroverallapproachforachievingthegoalsthatwerestatedintheintroductioncanbedescribedasasequenceofproblems.
Webegin,inSection3,bydescribingthedualmeasuresthatwillbeusedtogaugethequalityofdatabaseinformation.Weclaimthatthesemeasurescaptureinamostnaturalway,therelationshipofthestoredinformationtotruth,andarethereforeexcellentindicatorsofquality.
Ourmeasuresrequiretheauthenticationofdatabaseinformation,whichisaprocessthatneedstobedonebyhumans.However,weadvocatetheuseofstatisticalmethods(essentially,sampling)tokeepthemanualworkwithinacceptablelimits.ThissubjectisdiscussedinSection4.1.
Havingobtainedaccurateinformationaboutthequalityofthesamples,weproceedtopartitionthegivendatabasetoasetofcomponentsthatarehomogeneouswithrespecttoourqualitymeasures.Wethenestimatethequalityofthesecomponents,usingthesamples.Thisimpliesthatwheninformationisextractedfromasinglecomponent,itsqualityratingsareinheritedfromthecontainingcomponent.Thesemethods,describedinSections4.2and4.3,provideuswithqualityspeci cationsforthegivendatabases.
Finally,inSection5,wedescribetheprocessofinferringthequalityofanswerstoarbi-traryqueriesfromthequalityspeci cationsthathavebeenassignedtothedatabase.Ourtreatmentoftheproblemisinthecontextofrelationaldatabases,andweassumethestandardde nitionsoftherelationalmodel[17].Inparticular,thedatabasecomponentsmentionedearlierarede nedusingthemechanismofviews.Wealsomakethefollowingassumptions.
1.Queriesandviewsuseonlytheprojection,selection,andCartesianproductoperations,selectionsuseonlyrangeconditions,andprojectionsalwaysretainthekeyattribute(s).
2.Thestoredinformation(thedatabaseinstances)arerelativelystatic,andhencethequalityofdatadoesnotchangefrequently.
Becauseofspacelimitations,severalkeyissuesandsolutionsareonlysketchedinthispaper,andfullerdiscussionsareprovidedin[11].
3SoundnessandCompletenessasMeasuresofData
Quality
Wede netwomeasuresofdataqualitythataregeneralenoughtoencompassmostexistingmeasuresandaspectsofdataquality[5,19].Thebasicideasunderlyingthesemeasureswere rststatedin[7].Inthatpapertheauthorsuggestedthatdeclarationsoftheportionsofthedatabasethatareknowntobeperfectmodelsoftherealworld(andtherebytheportionsthatarepossiblyimperfect)beincludedinthede nitionofeachdatabase.Withthisinformation,thedatabasesystemcanqualifytheaccuracyoftheanswersitissuesinresponsetoqueries:eachanswerisaccompaniedbystatementsthatde netheportionsoftheanswerthatareguaranteedtobeperfect.Thisapproachusesviewstospecifytheportionsofthedatabaseortheportionsofanswersthatareperfectmodelsoftherealworld.
Morespeci cally,thisapproachinterpretsinformationquality,whichittermsintegrity,asacombinationofsoundnessandcompleteness.Adatabaseviewissoundifitincludesonlyinformationthatoccursintherealworld;adatabaseviewiscompleteifitincludesalltheinformationthatoccursintherealworld.Hence,adatabaseviewhasintegrity,ifitincludesthewholetruth(completeness)andnothingbutthetruth(soundness).Aprototypedatabasesystemthatisbasedontheseideasisdescribedin[10].Theseideaswerefurtherdevelopedin[9]andaresummarizedbelow.
GivenadatabaseschemeD,weassumetheexistenceofahypotheticaldatabaseinstanced0thatcapturesperfectlythatportionoftherealworldthatismodeledbyD(theidealortruedatabase).Inaddition,weassumeoneormoreactualinstancesdi(i≥1).Theactualinstancesareconsideredapproximationsoftheidealinstanced0.
GivenaviewV,wedenotebyv0itsextensionintheidealdatabased0(theidealortrueextensiontoV)andwedenotebyviitsextensionintheactualdatabasedi.Again,theextensionsviareapproximationsoftheidealextensionv0.
ConsiderviewV,itsidealextensionv0,andanapproximationv.Ifv v0,thenvisacompleteextension.Ifv v0,thenvisasoundextension.Obviously,anextensionwhichissoundandcompleteistheidealextension.
Withthesede nitions,eachviewextensioniseithercompleteorincomplete,andeithersoundornonsound.Wenowre nethesede nitionsbyassigningeachextensionavaluethatdenoteshowwellitapproximatestheidealextension.Weshalltermthisvaluethegoodnessoftheextension.Werequirethatthegoodnessofeachextensionbeavaluebetween0and1,thatthegoodnessoftheidealextensionbe1,andthatthegoodnessofextensionsthatareentirelydisjointfromtheidealextensionbe0.Formally,agoodnessmeasureisafunctiongonthesetofallpossibleextensionsthatsatis es
v:g(v)∈[0,1]
v:v∩v0= = g(v)=0
g(v0)=1
Asimpleapproachtogoodnessistoconsidertheintersectionoftheextensions;thatis,thetuplesthatappearinbothvandv0.Let|v|denotethenumberoftuplesinv.Then
|v∩v0|
|v|
expressestheproportionofthedatabaseextensionthatappearsinthetrueextension.Hence,itisameasureofthesoundnessofv.Similarly,
|v∩v0|
|v0|
expressestheproportionofthetrueextensionthatappearsinthedatabaseextension.Hence,itisameasureofthecompletenessofv.
Itiseasytoverifythatsoundnessandcompletenesssatisfyalltherequirementsofagood-nessmeasure.1Soundnessandcompletenessaresimilartoprecisionandrecallininformationretrieval[15].
Adisadvantageofthesemeasuresisthatadatabasetupleisassumedtobesound(andcontributetothesoundnessmeasure)onlyifitidenticaltoatupleoftheidealdatabase(sim-ilarlyinthecaseofcompleteness).Thus,atuplethatiscorrectinallbutoneattribute,andatuplethatisincorrectinallitsattributesaretreatedidentically.Anessentialre nementofthesemeasuresistoconsiderthegoodnessofindividualattributes.
AssumeaviewVhasattributesA0,A1,...,An,whereA0isthekey.2WedecomposeVintonkey-attributepairs(A0,Ai)(i=1,...,n),ingdecom-posedextensionsinthepreviously-de nedmeasuresimprovestheirusefulnessconsiderably,andweshallassumedecomposedextensionsthroughout.
Soundnessandcompletenesscanalsobeapproachedbymeansofprobabilitytheory[11].Forexample,thede nitionofsoundnesscanbeinterpretedastheprobabilityofdrawingacorrectpairfromagivenextension.Probabilisticinterpretationsgivenewinsightintothenotionsofsoundnessandcompletenessandalsohelpustoconnectthisresearchwithalargebodyofworkonuncertaintymanagementininformationsystems[8].
Thedataqualitymeasuresthathavebeenmentionedmostfrequentlyasessentialareaccuracy,completeness,currentness,andconsistency[5,18].Itispossibletorelatethesequalitymeasurestoourowngoodnessmeasures[11].
Whenvisempty,soundnessis0/0.Ifv0isalsoemptythensoundnessisde nedtobe1;otherwiseitisde nedtobe0.Similarlyforcompleteness,whenv0isempty.
2Weconsideratupleasarepresentationoftherealworldentityidenti edbyakeyattribute;thenonkeyattributesthencapturethepropertiesofthisentity.Forsimplicity,weassumethatkeysconsistofasingleattribute.1
4
4.1RatingtheQualityofDatabasesNecessaryProceduresforGoodnessEstimation
Theamountofdatainpracticaldatabasesisoftenlarge.Tocomputetheexactsoundnessandcompletenessofaparticularviewwewouldneedto(1)authenticateeveryvaluepairinthestoredview,and(2)determinehowmanypairsaremissingfromthisview.Thismethodisclearlyinfeasibleinanyrealsystem.Thus,wemustresorttosamplingtechniques[16,4].Samplingtechniquesallowustoestimatethemeanandvarianceofaparticularparameterofapopulationbyusingasamplewhichisusuallyonlyafractionofthesizeoftheentirepopulation.Thetheoryofstatisticsalsogivesusmethodsforestablishingasamplesizetoachievepredeterminedaccuracyoftheestimates.Itisthenpossibletosupplementourestimateswithcon denceintervals.Formoredetaileddiscussiononsamplingfromdatabasesthereaderisreferredtotheliteratureonthetopic(see,forexample,[12]foragoodsurvey).Notethattwodi erentpopulationsmustbesampled.Toestimatesoundnesswesamplethegiven(stored)view,whereastoestimatecompleteness,wesampletheidealview.Toestablishbothsoundnessandcompletenessitisnecessarytohaveaccesstotheidealdatabase.Forsoundness,weneedtodeterminewhetheraspeci cvaluepairofthestoreddatabaseisintheidealdatabase.Forcompleteness,itisnecessarytodeterminewhetheraspeci cpairfromtheidealdatabaseisinthestoreddatabase.Theseprocedures(verifyapairfromastoreddatabaseagainsttheidealdatabaseandretrieveanarbitrarypairfromtheidealdatabase)mustbeimplementedinanad-hocmanner[1].Foreachconcretedatabase,humanexpertisewillberequired.Theexpertwillaccessavarietyofavailablesourcestoperformthesetwoprocedures.Notethatthise ortisperformedonlyonceandonlyforasample,whichthenhelpsestimatetheoverallgoodness.
Acriticalstageofoursolutionistobuildasetofhomogeneousviewsonastoreddatabase,calledagoodnessbasis.Thegoodnessoftheviewsofthisbasiswillbemeasuredandthere-afterusedinestablishingthegoodnessofanswerstoarbitraryqueriesagainstthisdatabase.Sincewecannotguaranteeasinglesetofviewsthatwillbehomogeneouswithrespecttobothqualitymeasures,weconstructtwoseparatesets:asoundnessbasisandacompletenessbasis.Inconstructingeachbasis,weconsidereachdatabaserelationindividually.Eachre-lationmaybepartitionedbothhorizontally(byaselection)andvertically(byaprojection),andthebasiscomprisestheunionofallsuchpartitions.Selectionsarelimitedtoranges;i.e.,theselectioncriteriaisaconjunctionofconditions,whereeachindividualconditionspeci esanattributeandarangeofpermittedvaluesforthisattribute.
Weassigntoanincorrectvaluepairthevalue0andtoacorrectpairthevalue1.Thus,wecanrepresentanerrordistributionpatterninaviewextensionasatwo-dimensionalmatrixof0sand1s,inwhichrowscorrespondtothetuplesandcolumnscorrespondtotheattributesoftheview.Avalueinaparticularcellofthismatrixiseither0or1dependingonthecorrectnessofthecorrespondingpairofattributevalues.Wecallthisnewdatastructure
aviewmaporarelationmap,asappropriate.Now,thetaskistopartitionthistwo-dimensionalarrayintoareasinwhichelementsaredistributedhomogeneouslywithrespecttoourqualitymeasures.
Notethatthecorrectnessofaparticularnonkeyattributevaluecanbedeterminedonlyinreferencetothekeyattributeofthattuple,i.e.,indeterminingwhetheraspeci ccellshouldbe0or1weconsiderthecorrectnessofthepair:(keyvalue;nonkeyvalue)determiningthecorrectnessofanattributevalue.Thepairiscorrectifandonlyifbothelementsofthepairarecorrect.Thismeans,inparticular,thatifakeyattributevalueisincorrect,thenallpairscorrespondingtothiskeyattributevalueareconsideredincorrect.
ThetechniqueweuseforpartitioningtheviewmapisanonparametricstatisticalmethodcalledCART(Classi cationandRegressionTrees)[2].Thismethodhasbeenwidelyusedfordataanalysisinbiology,socialscience,environmentalresearch,andpatternrecognition.Closertoourarea,thismethodwasusedin[3]forestimatingtheselectivityofselectionqueries.Weassumethattuplesandattributesofarelationareordereduniquely.
4.2HomogeneityMeasure
Intuitively,aviewisperfectlyhomogeneouswithrespecttoagivenpropertyifeverysubviewoftheviewcontainsthesameproportionofpairswiththispropertyastheviewitself.Moreover,themorehomogeneousaview,thecloseritsdistributionofthepairswiththegivenpropertyistothedistributionintheperfectlyhomogeneousview.Hence,thedi erencebetweentheproportionofthepairswiththegivenpropertyintheviewitselfandineachofitssubviewscanbeusedtomeasurethedegreeofhomogeneityofthegivenview.
Speci cally,letv¯denoteanextensionofaviewofarelationinastoreddatabase,letv1,...,vNbethesetofallpossibleprojection-selectionviewsofv¯,lets(¯v)ands(vi)denotetheproportionofpairsinviewsv¯andvi(i=1,...,N),respectively,thatoccurintheircorrespondingidealrepresentations(i.e.,proportionsofcorrectpairsintheseviews).Then1 (s(¯v) s(vi))2
Nvi v¯
measuresthehomogeneityoftheviewv¯withrespecttosoundness.Thehomogeneitywithrespecttocompletenessisde nedanalogously.Similarmeasuresofhomogeneitywerepro-posedin[6,3].
Duetothelargenumberofpossibleviews,computationofthesemeasuresisoftenpro-hibitivelyexpensive.TheGiniindex[2,3]wasproposedasasimplealternativetothesehomogeneitymeasures.
Consideraviewv¯andarelationmapM.WecallthepartofMthatcorrespondsto
3v¯anode.TheGiniindexofthisnode,denotedG(¯v),is2p(1 p),wherepdenotesthe3Weusethetermsnodeandviewinterchangeably.
proportionof1sinthenode.4
Thesearchforhomogeneousnodesinvolvesrepeatedsplittingofnodes.TheGiniindexguaranteesthatanysplitimproves(ormaintains)thehomogeneityofdescendantnodes[2].Formally,letvbearelationmapnodewhichissplitintotwosubnodesv1andv2.ThenG(v)≥α1G(v1)+α2G(v2),whereαiis|vi|/|v|.Inotherwords,thereductionofasplit,de nedas G=G(v) α1G(v1) α2G(v2),isnon-negative.
Obviously,thebestsplitisasplitthatmaximizes G.Wecallsuchasplitamaximalsplit.Ifthenumberofpossiblesplitsis nite,therenecessarilyexistssuchasplit.Themethodofgeneratingsoundnessandcompletenessbasesisfoundedonthesearchforasplitthatmaximizesthegaininhomogeneity.Thismethodisdiscussednext.
4.3FindingaGoodnessBasis
Wedescribeaprocedureofbuildingasoundnessbasis.Theprocedureofbuildingacom-pletenessbasisissimilar.
Itisimportanttonotethattheprocedurestobediscussedinthissectionareperformedonsamplesoftherelations.Therefore,inthediscussionthatfollows,thetermsrelationandrelationmapusuallyrefertosamplesoftherelationsandmapsofthesesamples.Thus,althoughthebestsplitsarefoundusingonlysamples,theresultingviewsarelaterusedasagoodnessbasisfortheentirerelation.Careshouldbetakentoensurethatwedrawsampleswhosesizesaresu cientforrepresentingdistributionpatternsoftheoriginalrelation.
Asoundnessbasisisapartitionofthestoredrelations,inwhicheachrelationispar-titionedintoviewsthatarehomogeneouswithrespecttosoundness.Sincetheprocedureofpartitioningisrepeatedforeachrelation,itissu cienttoconsiderthisprocedureforasinglerelation.Weassumethatinformationonthecorrectnessofarelationinstancehasalreadybeenconvertedtoacorrespondingrelationmap.
Findingahomogeneouspartitionofarelationcanbeviewedasatree-buildingprocedure,wheretherootnodeofthetreeistheentirerelation,itsleafnodesarehomogeneousviewsofthisrelation,anditsintermediatenodesareviewsproducedbythesearchesformaximalsplits.Wecallthistreestructureasoundnesstree.Westartbylabelingtheentirerelationmapastherootofthetree.Wethenconsiderallthepossiblesplits,eitherhorizontalorvertical(butnotboth),andselectthesplitthatgivesmaximumgaininhomogeneity.Obviously,thebrute-forcetechniquedescribedhereisextremelyexpensive.Inpracticeweapplyseveral,substantiativeimprovements[11].
Whenthemaximalsplitisfound,webreaktherootnodeintothetwosubnodesthatachievedthemaximalsplit.Next,wesearchforamaximalsplitineachofthetwosubnodesoftherootanddividethemintwodescendentnodeseach.Theprocedureisrepeatedoneach
Ingeneral,theGiniindexisde nedformapswhoseelementsareofkdi erenttypes;theindexusedhereismuchsimpler,becauseourmapsarebinary.4
currentleafnodeofthetreeuntilaheuristicstop-splittingruleissatis edoneveryleafnode:splittingofanodestopswhenitcanprovideonlymarginalimprovementinhomogeneity.Thissituationusuallyariseswhenamaximalsplitonanodecannotseparateelementsofonetypefromelementsoftheothertypeinthisnode.Thisindicatesthatthisnodehasafairlyhomogeneousdistributionofbothtypesofelements.
Thestop-splittingrulesmentionedearlierarenecessary,becauseotherwiseatreecouldgrowuntilalltheelementsofeveryleafareofonetype.Thiscouldresultinalargenumberofsmallnodes.Italsomeansthattheremightbetoofewsampleelementsinthisnode,whichmakesthesoundnessestimateofthenodeunreliable.Ourstop-splittingruleis G·n≥threshold,wherenisthenumberofelementsinthenode[11].Ananalogousprocedureisusedforbuildingacompletenesstree.
Eachleafnodeofeverysoundnesstreecontributesoneviewtothesoundnessbasisandeachleafnodeofeverycompletenesstreecontributesoneviewtothecompletenessbasis.Together,thesesoundnessandcompletenessbasesformagoodnessbasis.Notethatthisprocessisperformedonlyonceoneveryrelation,andthegoodnessbasisneednotbechangedorupdatedlater.Theassumptionhereisthattheinformationisstatic.Whenaleafnodeisconvertedtoaview,inadditiontotherowsandcolumnsofthenode,theviewincludesthekeyattributeforthesetuples.
5
5.1EstimatingtheQualityofQueriesProjection-SelectionQueries
Assumenowaqueryissubmittedtothisdatabaseextension.Atthispoint,weconsideronlyselection-projectionqueriesonasinglerelation(andinwhichselectionsarebasedonranges).Inthissectionwediscusstheestimationofsoundnessofsuchqueries.Theconsiderationsforestimatingcompletenessarenearlyidentical.InthenextsectionwediscussqueriesthatinvolveCartesianproducts.
Becauseabasispartitionseachrelation,ananswertoaqueryintersectswithacertainnumberofbasisviews.Hence,eachofthesebasisviewscontainsacomponentoftheanswerasitssubview.Thekeyfeatureofbasisviewsistheirhomogeneitywithrespecttosoundness.Consequently,eachcomponentoftheanswerinheritsitssoundnessfromabasisview.AsshowninProposition1(see[11]forproof),thesoundnessofaviewwhichcomprisesdisjointcomponentsisaweightedsumofthesoundnessoftheindividualcomponents.Thisprovidesuswithaneasywaytodeterminethesoundnessoftheentireanswer.Asaspecialcase,whentheentireansweriscontainedinasinglebasisview,thesoundnessoftheanswerissimplythesoundnessofthecontainingview.
Proposition1Lett1andt2beleafnodesofasoundnesstreewithsoundnesss1ands2respectively,andletqbeananswertoaqueryQ.Supposealsothatq=(q∩t1)∪(q∩t2).
Thesoundnessofqis
s(q)=s1·|q∩t1||q∩t2|+s2·|q||q|
Thispropositioniseasilygeneralizedfornleafnodes,andtheanalogouspropositionistrueforcompleteness.Inpractice,weonlyhaveestimatesofs1ands2.Hence,theformulabecomes:|q∩t2||q∩t1|+s ·s (q)=s ·21|q||q|
Thevarianceoftheestimates (q)canbealsocomputed[11].
5.2EstimatingtheGoodnessofCartesianProducts
Toallowmoregeneralqueries,weconsidernowqueriesthatincludeCartesianproducts.Thefollowingproposition(see[11]forproof)describeshowtocomputethesoundnessandcompletenessoftheCartesianproductgiventhesoundnessandcompletenessofitsoperands.Proposition2Letr1andr2berelationswithsoundnessandcompletenesss1,c1ands2,c2respectively.Thesoundnessandcompletenessofther1×r2are
s(r1×r2)=k·s1+p·s2k·c1+p·c2,c(r1×r2)=k+pk+p
respectively,wherekandparethenumberofnon-keyattributesintherelationsr1andr2respectively.
Inpractice,wehaveonlyestimatesofthesoundnessandcompleteness,andtheformulasfromthepropositionbecome:
s (r1×r2)=k·c 1+p·c 2k·s 1+p·s 2,c (r1×r2)=k+pk+p
wheres 1,s 2,c 1,c 2areestimatesforsoundnessandcompletenessofthecorrespondingrela-tions.Forderivationofthevarianceoftheestimatessee[11].
5.3EstimatingtheGoodnessofGeneralQueries
Sofarwehaveshownhowtoestimatethesoundnessandcompletenessofselection-projectionqueriesonasinglerelation,andofCartesianproductsoftworelations.TocomputesoundnessandcompletenessofarbitraryCartesianproduct-selection-projectionqueriesitisnecessarytoshowhowtocomputegoodnessestimatesoversequencesofrelationalalgebraoperations.Theestimationofeachoperationinasequencerequiressoundnessandcompletenessbaseswitheachviewhavingitsassociatedsoundnessorcompletenessestimate.In[11]we
extendourmethodssothateachoperationdelivers,inadditiontoagoodnessestimateofitsresult,thenecessarybasesforfutureoperations.Thisprovidesuswiththeabilitytoperformsequencesofoperations.
Alegitimatequestionatthispointiswhethertheseestimatesdependontheorderinwhichtheyarecomputed,i.e.,whethertheestimatesofthegoodnessofequivalentrelationalalgebraexpressionsarethesame.Theanswertothisquestionisthattheestimatesareindependentoftheparticularexpressionused[11].
6ConclusionsandFutureResearch
Weintroducedanewmodelfordataqualityinrelationaldatabases,whichisbasedonthedualmeasuresofsoundnessandcompleteness.Thepurposeofthismodelistoprovideanswerstoarbitraryquerieswithanestimationoftheirquality.Weachievedthisbyadoptingtheconceptofabasis,whichisapartitionofthedatabaseintoviewsthatarehomogeneouswithrespecttothegoodnessmeasures.Thesebasesareconstructedusingdatabasesamples,whosegoodnessisestablishedmanually.Oncethebasesandtheirgoodnessestimatesareinplace,thegoodnessofanswerstoarbitraryqueriesisinferredrathersimply.
Weplantodevelopthecompletesetofproceduresforcalculatingsoundnessandcom-pletenessoftheanswerstootherrelationalalgebraoperations;i.e.,addproceduresforunion,di erence,andintersectionofviews.Oneofourmajorgoalsistousethesemethodstoesti-matethegoodnessofanswerstoqueriesagainstmultidatabases,wherethesamequerycouldbeanswereddi erentlybydi erentdatabases,andgoodnessinformationcanhelpresolvesuchinconsistencies.
Wehavealreadydiscussedtheadvantageofconsideringthecorrectnessofindividualattributesoverthecorrectnessofentiretuples.Still,anindividualvalueiseithercorrectorincorrect,and,whenincorrect,wedonotconsidertheproximityofastoredvaluetothetruevalue.Thisdirection,whichiscloselyrelatedtoseveraluncertaintymodelingtechniques,meritsfurtherinvestigation.
Becauseofthecostofestablishinggoodnessestimations,wehavenotedthatourmeth-odsaremostsuitableforstaticinformation.Whentheinformationisdynamic,itwouldbeadvisabletotimestamptheestimationsatthetimethattheywereobtainedandattachthesetimestampstoallqualityinferences.Onemayalsoconsidertheautomaticattenua-tionofqualityestimationsastimeprogresses.Thisdirectionisstilloutsideourimmediateobjectives.
References
[1]World,17(51),December1995.
[2]L.Breiman,J.Friedman,R.Olshen,andCh.Stone.Classi cationandRegressionTrees.
WadsworthInternationalGroup,1984.
[3]M.C.Chen,L.McNamee,andN.Matlo .Selectivityestimationusinghomogeneity
measurement.InProceedingoftheInternationalConferenceonDataEngineering,1990.
[4]W.Cochran.SamplingTechniques.JohnWiley&Sons,1963.
[5]C.Fox,A.Levitin,andT.Redman.Thenotionofdataanditsqualitydimensions.
Informationprocessingandmanagement,30(1),1994.
[6]N.KamelandR.King.Exploitingdatadistributionpatternsinmodelingtupleselec-
rmationSciences,69(1-2),1993.
[7]A.Motro.Integrity=validity+completeness.ACMTransactionsonDatabaseSystems,
14(4):480–502,December1989.
[8]A.Motro.Sourcesofuncertaintyininformationsystems.InA.MotroandPh.Smets,
editors,ProceedingsoftheWorkshoponUncertaintyManagementinInformationSys-tems:FromNeedstoSolutions,1992.
[9]A.Motro.Aformalframeworkforintegratinginconsistentanswersfrommultiplein-
formationsources.TechnicalReportISSE-TR-93-106,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1993.
[10]A.Motro.Panorama:Adatabasesystemthatannotatesitsanswerstoquerieswith
theirproperties.JournalofIntelligentInformationSystems,7(1),1996.
[11]A.MotroandI.Rakov.Onthespeci cation,measurement,andinferenceofthequality
ofdata.Technicalreport,rmationandSoftwareSystemsEngineering,GeorgeMasonUniversity,1996.
[12]F.OlkenandD.Rotem.Randomsamplingfromdatabases—asurvey.Statisticsand
Computing,5(1),1995.
[13]K.ParsayeandM.Chignell.IntelligentDatabaseToolsandApplications.JohnWiley
&Sons,1993.
[14]M.P.ReddyandR.Wang.Estimatingdataaccuracyinafederateddatabaseenviron-
ment.InProceedingsofCISMOD,1995.
[15]G.SaltonandM.J.McGill.IntroductiontoModernInformationRetrieval.McGraw-
Hill,NewYork,NewYork,1983.
[16]S.Thompson.Sampling.JohnWiley&Sons,1992.
[17]J.D.Ullman.DatabaseandKnowledge-BaseSystems,puterScience
Press,Rockville,Maryland,1988.
[18]R.Wang,M.Reddy,andH.Kon.Towardqualitydata:Anattribute-basedapproach.
DecisionSupportSystems,13(3-4),1995.
[19]R.Wang,V.Storey,andCh.Firth.Aframeworkforanalysisofdataqualityresearch.
IEEETransactionsonKnowledgeandDataEngineering,7(4),August1995.
正在阅读:
Estimating the quality of data in relational databases05-31
财务室安全生产责任状05-22
2022年最新安全员C证考试题库及答案08-01
船舶建造流程8-试验及交船 - 图文05-08
药学基础(中专起)模拟试卷03-08
金庸作品中的女性01-21
浅谈国内外绘本研究12-09
各省军区独立师历史沿革03-10
安徽财经大学商学院2018年专升本《大学语文》考试大纲03-08
学习贯彻《公务员法》实施方案07-31
- 1data guard 操作指南
- 2abaqus operating on xy data
- 3data guard 操作指南
- 4Effects of Pointers on Data Dependences
- 5Effects of Pointers on Data Dependences
- 6Reading Ability- Lexical Quality to Comprehension
- 7Recursive XML schemas, recursive XML queries, and relational
- 8Automatic evaluation of translation quality Outline of methodology and report on pilot expe
- 9Natrosol 250 Hydroxyethylcellulose Product Data
- 10data是什么意思?
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Estimating
- relational
- databases
- quality
- data
- 复式楼装修该怎么装修
- 咨询工程师继续教育工程测量卫星定位测量试卷80分
- 2015年中国硝酸银行业市场分析报告
- 好孩子家规新十条
- 8、商业银行理财客户风险评估问卷基本模版
- 小学英语教师培训总结
- 社情民意调查工作
- 人教版初中人教初二物理下册第二学期期末考试试卷
- 11. 一汽-大众经销商基础信息管理系统指导手册
- 新视野大学英语视听说教程(第二版)3册quiz2
- 双城记英语读后感A Tale of Two City
- 胖哥槟榔市场调研策划方案2
- 国内外关于_胜任力_研究综述
- 汕头职业技术学院
- 【全国百强校】山西省太原市第五中学2015-2016学年高二上学期期末考试语文试题
- 新村党组织、村委会工作报告制度
- 孙永福院士中国快速铁路网介绍
- 职工薪酬的常见舞弊及审计对策
- 宝洁的职业发展通道及胜任力模型(荐)
- MTBF寿命试验程序文件