《數(shù)據(jù)庫(kù)系統(tǒng)》教學(xué)課件
《數(shù)據(jù)庫(kù)系統(tǒng)》教學(xué)課件,數(shù)據(jù)庫(kù)系統(tǒng),數(shù)據(jù)庫(kù),系統(tǒng),教學(xué),課件
Implementation of Relational OperationsJianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYIntroductionnNext topic:QUERY PROCESSINGnSome database operations are EXPENSIVEnHuge performance gained by being“smart”qWell see 1,000,000 x over nave approachnMain weapons are:qclever implementation techniques for operatorsqexploiting relational algebra“equivalences”qusing statistics and cost models to chooseSimple SQL RefreshernSELECT FROM WHERE SELECT S.name,E.cid FROM Students S,Enrolled E WHERE S.sid=E.sid AND E.grade=AA Really Bad Query OptimizernFor each Select-From-Where query blockqCreate a plan that:nForms the cross product of the FROM clausenApplies the WHERE clausen(Then,as needed:qApply the GROUP BY clauseqApply the HAVING clauseqApply any projections and output expressionsqApply duplicate elimination and/or ORDER BY)predicatestablesCost-based Query Sub-System Query ParserQuery OptimizerPlan GeneratorPlan Cost EstimatorQuery Plan EvaluatorCatalog ManagerSchemaStatisticsSelect*From Blah BWhere B.blah=blahQueriesThe Query Optimization GamenGoal is to pick a“good”planqGood=low expected cost,under cost modelqDegrees of freedom:naccess methodsnphysical operatorsnoperator ordersnRoadmap for this topic:qFirst:implementing individual operatorsqThen:optimizing multiple operatorsRelational OperationsnWe will consider how to implement:qSelection ()Select a subset of rows.qProjection ()Remove unwanted columns.qJoin ()Combine two relations.qSet-difference ()Tuples in reln.1,but not in reln.2.qUnion ()Tuples in reln.1 and in reln.2.nQ:What about Intersection?Schema for ExamplesnSailors:qEach tuple is 50 bytes long,80 tuples per page,500 pages.qS=500,pS=80.nReserves:qEach tuple is 40 bytes,100 tuples per page,1000 pages.qR=1000,pR=100.Sailors(sid:integer,sname:string,rating:integer,age:real)Reserves(sid:integer,bid:integer,day:dates,rname:string)Simple SelectionsnHow best to perform?Depends on:qwhat indexes are availableqexpected size of resultnSize of result approximated as(size of R)*selectivity qselectivity estimated via statistics we will discuss shortly.SELECT *FROM Reserves RWHERE R.rname C%Our options nIf no appropriate index exists:Must scan the whole relationcost=R.For“reserves”=1000 I/Os.nWith index on selection attribute:1.Use index to find qualifying data entries2.Retrieve corresponding data recordsTotal cost=cost of step 1+cost of step 2qFor“reserves”,if selectivity=10%(100 pages,10000 tuples):nIf clustered index,cost is a little over 100 I/Os;nIf unclustered,could be up to 10000 I/Os!unless Our options Index entriesData entriesdirect search for(Index File)(Data file)Data Recordsdata entriesData entriesData RecordsCLUSTEREDUNCLUSTEREDRefinement for unclustered indexes1.Find qualifying data entries.2.Sort the rids of the data records to be retrieved.3.Fetch rids in order.Each data page is looked at just once(though#of such pages likely to be higher than with clustering).(Index File)(Data file)Data entriesData RecordsUNCLUSTEREDGeneral Selection ConditionsnFirst,convert to conjunctive normal form(CNF):q(day8/9/94 OR bid=5 OR sid=3)AND (rname=Paul OR bid=5 OR sid=3)nWe only discuss the case with no ORsnTerminology:qA B-tree index matches terms that involve only attributes in a prefix of the search key.e.g.:qIndex on matches a=5 AND b=3,but not b=3.*(day8/9/94 AND rname=Paul)OR bid=5 OR sid=32 Approaches to General SelectionsApproach I:1.Find the cheapest access path2.retrieve tuples using it3.Apply any remaining terms that dont match the indexqCheapest access path:An index or file scan that we estimate will require the fewest page I/Os.Cheapest Access Path-Examplequery:day 8/9/94 AND bid=5 AND sid=3some options:B+tree index on day;check bid=5 and sid=3 afterward.hash index on;check day8/9/94 afterward.nHow about a B+tree on?nHow about a B+tree on?nHow about a Hash index on?Approach II:use 2 or more matching indexes.1.From each index,get set of rids2.Compute intersection of rid sets3.Retrieve records for rids in intersection4.Apply any remaining termsEXAMPLE:day8/9/94 AND bid=5 AND sid=3Suppose we have an index on day,and another index on sid.qGet rids of records satisfying day8/9/94.qAlso get rids of records satisfying sid=3.qFind intersection,then retrieve records,then check bid=5.2 Approaches to General Selections(Contd.)ProjectionnIssue is removing duplicates.nUse sorting!1.Scan R,extract only the needed attributes2.Sort the resulting set3.Remove adjacent duplicatesCost:Ramakrishnan/Gehrke writes to temp table at each step!Reserves with size ratio 0.25=250 pages.With 20 buffer pages can sort in 2 passes,so:1000+250+2*2*250+250=2500 I/OsSELECT DISTINCT R.sid,R.bidFROM Reserves RProjection-improvednAvoid the temp files,work on the fly:qModify Pass 0 of sort to eliminate unwanted fields.qModify Passes 1+to eliminate duplicates.Cost:Reserves with size ratio 0.25=250 pages.With 20 buffer pages can sort in 2 passes,so:1.Read 1000 pages2.Write 250(in runs of 40 pages each)=7 runs3.Read and merge runs(20 buffers,so 1 merge pass!)Total cost=1000+250+250=1500.Other Projection TricksIf an index search key contains all wanted attrs:nDo index-only scanqApply projection techniques to data entries(much smaller!)If a B+Tree index search key prefix has all wanted attrs:nDo in-order index-only scanqCompare adjacent tuples on the fly(no sorting required!)JoinsnJoins are very common.nR S is large;so,R S followed by a selection is inefficient.nMany approaches to reduce join cost.nJoin techniques we will cover today:1.Nested-loops join2.Index-nested loops join3.Sort-merge joinSELECT *FROM Reserves R1,Sailors S1WHERE R1.sid=S1.sidR S:foreach tuple r in R doforeach tuple s in S doif ri=sj then add to resultSimple Nested Loops JoinCost=(pR*R)*S+R=100*1000*500+1000 IOsqAt 10ms/IO,Total time:?nWhat if smaller relation(S)was“outer”?nWhat assumptions are being made here?nWhat is cost if one relation can fit entirely in memory?R S:Page-Oriented Nested Loops JoinCost=R*S+R=1000*500+1000nIf smaller relation(S)is outer,cost=500*1000+500nMuch better than nave per-tuple approach!foreach page bR in R do foreach page bS in S do foreach tuple r in bR doforeach tuple s in bSdoif ri=sj then add to resultBlock Nested Loops JoinnPage-oriented NL doesnt exploit extra buffers:(nIdea to use memory efficiently:.R&Sblock of R tuples(B-2 pages)Input buffer for SOutput buffer.Join ResultCost:Scan outer+(#outer blocks*scan inner)#outer blocks=Examples of Block Nested Loops JoinnSay we have B=100+2 memory buffersnJoin cost=outer+(#outer blocks*inner)#outer blocks=outer/100nWith R as outer(R=1000):qScanning R costs 1000 IOs(done in 10 blocks)qPer block of R,we scan S;costs 10*500 I/OsqTotal=1000+10*500.nWith S as outer(S=500):qScanning S costs 500 IOs(done in 5 blocks)qPer block of S,we can R;costs 5*1000 IOsqTotal=500+5*1000.R S:Index Nested Loops Joinforeach tuple r in R doforeach tuple s in S where ri=sj doadd to resultData entriesData RecordsINDEX on Slookup(ri)sjS:R:riR S:Cost=R+(R*pR)*cost to find matching S tuples If index uses Alt.1,cost=cost to traverse tree from root to leaf.For Alt.2 or 3:1.Cost to lookup RID(s);typically 2-4 IOs for B+Tree.2.Cost to retrieve records from RID(s);depends on clustering.qClustered index:1 I/O per page of matching S tuples.qUnclustered:up to 1 I/O per matching S tuple.Index Nested Loops Joinforeach tuple r in R doforeach tuple s in S where ri=sj doadd to resultReminder:Schema for ExamplesnSailors:qEach tuple is 50 bytes long,80 tuples per page,500 pages.qS=500,pS=80.nReserves:qEach tuple is 40 bytes,100 tuples per page,1000 pages.qR=1000,pR=100.Sailors(sid:integer,sname:string,rating:integer,age:real)Reserves(sid:integer,bid:integer,day:dates,rname:string)Sort-Merge Join Example:SELECT *FROM Reserves R1,Sailors S1WHERE R1.sid=S1.sid1.Sort R on join attr(s)2.Sort S on join attr(s)3.Scan sorted-R and sorted-S in tandem,to find matchesCost of Sort-Merge JoinnCost:Sort R+Sort S+(R+S)qBut in worst case,last term could be R*S (very unlikely!)qQ:what is worst case?Suppose B=35 buffer pages:nBoth R and S can be sorted in 2 passesnTotal join cost=4*1000+4*500+(1000+500)=7500Suppose B=300 buffer pages:nAgain,both R and S sorted in 2 passesnTotal join cost=7500Block-Nested-Loop cost=2500 15,000Other Considerations 1.An important refinement:Do the join during the final merging pass of sort!If have enough memory,can do:1.Read R and write out sorted runs2.Read S and write out sorted runs3.Merge R-runs and S-runs,while finding R S matchesCost=3*R+3*SQ:how much memory is“enough”(will answer next time)2.Sort-merge join an especially good choice if:one or both inputs are already sorted on join attribute(s)output is required to be sorted on join attributes(s)Q:how to take these savings into account?(stay tuned)SummarynA virtue of relational DBMSs:queries are composed of a few basic operatorsqThe implementation of these operators can be carefully tunednMany alternative implementation techniques for each operatorqNo universally superior technique for most operators.nMust consider available alternativesqCalled“Query optimization”-we will study this topic soon!
收藏