《數(shù)據(jù)庫(kù)系統(tǒng)》英文教學(xué)課件
《數(shù)據(jù)庫(kù)系統(tǒng)》英文教學(xué)課件,數(shù)據(jù)庫(kù)系統(tǒng),數(shù)據(jù)庫(kù),系統(tǒng),英文,教學(xué),課件
Implementation of Relational OperationsR&G-Chapters 12 and 14IntroductionTodays topic:QUERY PROCESSINGSome database operations are EXPENSIVECan greatly improve performance by being“smart”e.g.,can speed up 1,000,000 x over nave approachMain weapons are:1.clever implementation techniques for operators2.exploiting relational algebra“equivalences”3.using statistics and cost models to choose among these.A Really Bad Query OptimizerFor each Select-From-Where query blockCreate a plan that:Forms the cross product of the FROM clauseApplies the WHERE clauseThen,as needed:Apply the GROUP BY clauseApply the HAVING clauseApply any projections and output expressionsApply duplicate elimination and/or ORDER BYpredicatestablesCost-based Query Sub-System Query ParserQuery OptimizerPlan GeneratorPlan Cost EstimatorQuery Plan EvaluatorCatalog ManagerSchemaStatisticsSelect*From Blah BWhere B.blah=blahQueriesThe Query Optimization GameGoal is to pick a“good”planGood=low expected cost,under cost modelDegrees of freedom:access methodsphysical operatorsoperator ordersRoadmap for this topic:First:implementing individual operatorsThen:optimizing multiple operatorsRelational OperationsWe will consider how to implement:Selection ()Select a subset of rows.Projection ()Remove unwanted columns.Join ()Combine two relations.Set-difference (-)Tuples in reln.1,but not in reln.2.Union ()Tuples in reln.1 and in reln.2.Q:What about Intersection?Schema for ExamplesSimilar to old schema;rname added for variations.Sailors:Each tuple is 50 bytes long,80 tuples per page,500 pages.S=500,pS=80.Reserves:Each tuple is 40 bytes,100 tuples per page,1000 pages.R=1000,pR=100.Sailors(sid:integer,sname:string,rating:integer,age:real)Reserves(sid:integer,bid:integer,day:dates,rname:string)Simple SelectionsHow best to perform?Depends on:what indexes are availableexpected size of resultSize of result approximated as(size of R)*selectivity selectivity estimated via statistics we will discuss shortly.SELECT *FROM Reserves RWHERE R.rname C%Our options If no appropriate index exists:Must scan the whole relationcost=R.For“reserves”=1000 I/Os.With index on selection attribute:1.Use index to find qualifying data entries2.Retrieve corresponding data recordsTotal cost=cost of step 1+cost of step 2For“reserves”,if selectivity=10%(100 pages,10000 tuples):If clustered index,cost is a little over 100 I/Os;If unclustered,could be up to 10000 I/Os!unless Our options Index entriesData entriesdirect search for(Index File)(Data file)Data Recordsdata entriesData entriesData RecordsCLUSTEREDUNCLUSTEREDRefinement for unclustered indexes1.Find qualifying data entries.2.Sort the rids of the data records to be retrieved.3.Fetch rids in order.Each data page is looked at just once(though#of such pages likely to be higher than with clustering).(Index File)(Data file)Data entriesData RecordsUNCLUSTEREDGeneral Selection ConditionsFirst,convert to conjunctive normal form(CNF):(day8/9/94 OR bid=5 OR sid=3)AND (rname=Paul OR bid=5 OR sid=3)We only discuss the case with no ORsTerminology:(ref 12.2.2)A B-tree index matches terms that involve only attributes in a prefix of the search key.e.g.:Index on matches a=5 AND b=3,but not b=3.*(day8/9/94 AND rname=Paul)OR bid=5 OR sid=32 Approaches to General SelectionsApproach I:1.Find the cheapest access path2.retrieve tuples using it3.Apply any remaining terms that dont match the indexCheapest access path:An index or file scan that we estimate will require the fewest page I/Os.Cheapest Access Path-Examplequery:day 8/9/94 AND bid=5 AND sid=3some options:B+tree index on day;check bid=5 and sid=3 afterward.hash index on;check day8/9/94 afterward.How about a B+tree on?How about a B+tree on?How about a Hash index on?Approach II:use 2 or more matching indexes.1.From each index,get set of rids2.Compute intersection of rid sets3.Retrieve records for rids in intersection4.Apply any remaining termsEXAMPLE:day8/9/94 AND bid=5 AND sid=3Suppose we have an index on day,and another index on sid.Get rids of records satisfying day8/9/94.Also get rids of records satisfying sid=3.Find intersection,then retrieve records,then check bid=5.2 Approaches to General SelectionsProjectionIssue is removing duplicates.Use sorting!1.Scan R,extract only the needed attributes2.Sort the resulting set3.Remove adjacent duplicatesCost:Reserves with size ratio 0.25=250 pages.With 20 buffer pages can sort in 2 passes,so:1000+250+2*2*250+250=2500 I/OsSELECT DISTINCT R.sid,R.bidFROM Reserves RProjection-improvedModify the external sort algorithm:Modify Pass 0 to eliminate unwanted fields.Modify Passes 1+to eliminate duplicates.Cost:Reserves with size ratio 0.25=250 pages.With 20 buffer pages can sort in 2 passes,so:1.Read 1000 pages2.Write 250(in runs of 40 pages each,Ref 13.3.1)=7 runs3.Read and merge runs(20 buffers,so 1 merge pass!)Total cost=1000+250+250=1500.Other Projection TricksIf an index search key contains all wanted attrs:Do index-only scanApply projection techniques to data entries(much smaller!)If a B+Tree index search key prefix has all wanted attrs:Do in-order index-only scanCompare adjacent tuples on the fly(no sorting required!)JoinsJoins are very common.R S is large;so,R S followed by a selection is inefficient.Many approaches to reduce join cost.Join techniques we will cover today:1.Nested-loops join2.Index-nested loops join3.Sort-merge joinSELECT *FROM Reserves R1,Sailors S1WHERE R1.sid=S1.sidforeach tuple r in R doforeach tuple s in S doif ri=sj then add to resultSimple Nested Loops JoinCost=(pR*R)*S+R=100*1000*500+1000 IOsAt 10ms/IO,Total time:?What if smaller relation(S)was“outer”?What assumptions are being made here?What is cost if one relation can fit entirely in memory?R S:Page-Oriented Nested Loops JoinCost=R*S+R=1000*500+1000If smaller relation(S)is outer,cost=500*1000+500Much better than nave per-tuple approach!foreach page bR in R do foreach page bS in S do foreach tuple r in bR doforeach tuple s in bSdoif ri=sj then add to resultR S:Block Nested Loops JoinPage-oriented NL doesnt exploit extra buffers:(Idea to use memory efficiently:.R&Sblock of R tuples(B-2 pages)Input buffer for SOutput buffer.Join ResultCost:Scan outer+(#outer blocks*scan inner)#outer blocks=Examples of Block Nested Loops JoinSay we have B=100+2 memory buffersJoin cost=outer+(#outer blocks*inner)#outer blocks=outer/100With R as outer(R=1000):Scanning R costs 1000 IOs(done in 10 blocks)Per block of R,we scan S;costs 10*500 I/OsTotal=1000+10*500.With S as outer(S=500):Scanning S costs 500 IOs(done in 5 blocks)Per block of S,we can R;costs 5*1000 IOsTotal=500+5*1000.Index Nested Loops Joinforeach tuple r in R doforeach tuple s in S where ri=sj doadd to resultR S:Data entriesData RecordsINDEX on Slookup(ri)sjS:R:ri Cost=R+(R*pR)*cost to find matching S tuples If index uses Alt.1,cost=cost to traverse tree from root to leaf.For Alt.2 or 3:1.Cost to lookup RID(s);typically 2-4 IOs for B+Tree.2.Cost to retrieve records from RID(s);depends on clustering.Clustered index:1 I/O per page of matching S tuples.Unclustered:up to 1 I/O per matching S tuple.Index Nested Loops Joinforeach tuple r in R doforeach tuple s in S where ri=sj doadd to resultR S:Sort-Merge Join Example:SELECT *FROM Reserves R1,Sailors S1WHERE R1.sid=S1.sid1.Sort R on join attr(s)2.Sort S on join attr(s)3.Scan sorted-R and sorted-S in tandem,to find matchesCost of Sort-Merge JoinCost:Sort R+Sort S+(R+S)But in worst case,last term could be R*S (very unlikely!)Q:what is worst case?Suppose B=35 buffer pages:Both R and S can be sorted in 2 passesTotal join cost=4*1000+4*500+(1000+500)=7500Suppose B=300 buffer pages:Again,both R and S sorted in 2 passesTotal join cost=7500Block-Nested-Loop cost=2500 15,000Other Considerations 1.An important refinement:Do the join during the final merging pass of sort!If have enough memory,can do:1.Read R and write out sorted runs2.Read S and write out sorted runs3.Merge R-runs and S-runs,and find R S matchesCost=3*R+3*SQ:how much memory is“enough”(will answer next time)2.Sort-merge join an especially good choice if:one or both inputs are already sorted on join attribute(s)output is required to be sorted on join attributes(s)Q:how to take these savings into account?(stay tuned)SummaryA virtue of relational DBMSs:queries are composed of a few basic operatorsThe implementation of these operators can be carefully tunedMany alternative implementation techniques for each operatorNo universally superior technique for most operators.Must consider available alternativesCalled“Query optimization”-we will study this topic soon!
收藏