《數(shù)據(jù)庫(kù)系統(tǒng)》教學(xué)課件
《數(shù)據(jù)庫(kù)系統(tǒng)》教學(xué)課件,數(shù)據(jù)庫(kù)系統(tǒng),數(shù)據(jù)庫(kù),系統(tǒng),教學(xué),課件
Final Exam ReviewJianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYcourtesy of Joe Hellerstein for some slidesTopics CoverednRelational ModelnDisks,Files,Buffers,IndexesnLanguages:Algebra,Calculus,SQLnQuery Processing:Sorting,Hashing,Join AlgorithmsnQuery OptimizationnSchema Refinement and NormalizationnConcurrency ControlnCrash RecoverynSecuritynData WarehousingOverviewnPurpose of this course:give students bothqAn understanding of what systems do,why we use them,and how to use common databases efficiently,qand an understanding of how databases work internally.IntroductionnWhat are databases?nData models nWhat does a DBMS provide that the OS does not?q Levels of Abstraction,Data Independence,Concurrency Control,Crash Recovery,etc.Introduction(Contd.)nschemas&data independenceqconceptual schemaqphysical schemaqexternal schema(view)qlogical&physical data independenceThe Relational Data Model BasicsnComponents of the model:qRelations,Attributes,TuplesnSQL Data Definition LanguagenIntegrity Constraintsqhow do they come into being?qunderstand what you can learn from schema vs.instance!nReferential Integrityqa state which holds when all foreign key constraints are enforced.Relational Model(Contd.)nKeys,Primary Keys,Foreign Keys,Candidate Keys nForeign key:Set of fields in one relation that is used to refer to a tuple in another relation.qMust correspond to primary key of the second relation.qLike a logical pointer.nA set of fields is a key for a relation if:1.No two tuples can have same values in all key fields,and2.This is not true for any subset of the key.qPart 2 false?A superkey.qIf theres 1 key for a relation,all are candidate keys.One of the candidate keys is chosen to be primary key.nE.g.,sid is a key for Students.(What about name?)The set sid,gpa is a superkey.The Entity-Relationship Data ModelnTypically also tool for conceptual design nComponents of the model:qEntities,Attributes,Relationships qCardinality of relationships,key constraints qParticipation ConstraintsqWeak entities nBe able understand an E-R diagramnBe able to translate an E-R Diagram into Relational schemanBe able to modify an E-R diagramMemory ManagementnHierarchy of storage:RAM,Disk,TapenAdvantages/disadvantages of different types of storage.nBuffer management qYou should know basics from project!nUnderstand data structures requirednnotions of replacement,dirty pages,pinning pagesqUnderstand different replacement policiesnLRU,MRU,CLOCKnbe able to simulate each,understand pros,cons!Memory Management(Contd.)nOrganizing records in pages qFixed&variable-length fields in tuplesn2 alternatives for variable-length fieldsqFixed&variable-length tuples on pagesnknow what a RID is,how it interacts with page layoutnknow details of“slotted page”with slot directorynOrganizing pages in filesFile OrganizationnDifferent file organizations:heap files,sorted files,hashed files qbe able to compute costs of operations over each!nAccess costs for different organizations File Organization:IndexesnUnderstand search keys(vs.key constraints!)n3 alternatives for data entriesData record with key value k qChoice of alternative orthogonal to indexing technique!Tree Structured IndexesnTrees do range and equality searchnISAM,rules for adding and removing entries nB-Trees,rules for adding entries qYou do not have to do“coalescing”on B+-tree deletionB+Tree StructurenEach node contains d=m =2d entries(index or data)qThe parameter d is called the order of the tree.qEach internal node contains m index entries:.qEach leaf node contains m data entries:nThe ROOT node contains between 1 and 2d index entries.qIt is a leaf or has at least two children.nEach path from the ROOT to any leaf has the same length.nSupports equality and range-searches efficiently.Index EntriesData Entries(Sequence set)(Direct search)External Sorting,HashingnHow to sort any file using 3 memory Pages nHow to sort/hash in as few passes given some amount of memory nRelationship between buffers of memory,size of file,and number of passes to mergenDuality of sort and hashnApplication to duplicate elimination,group byRelational AlgebranQuery language operating on relations nKnow the operators!q,r,qknow schemas of output relationsqknow varieties of joinsn(conditional vs.equi vs.natural),divisionqbe able to express complex operations in terms of simple onesnUse relational algebra to express queries written in English,and vice versa.Relational CalculusnDeclarative query language for tuples nformulas,operators,in,not,there-exists,for-all nUse relational calculus to express simple queries written in English,and vice versa SQLnDDL:“Create Table”nDML:Delete From,Insert Into,UpdatenBasic Query:select from where nUse of Distinct clausenSet operations:Union,Except,IntersectSQL Query Languagenbasic queriesnconceptual evaluation nrange variables nexpressions nstrings nUnion,Intersect,In,Except,Exists nnested queries,correlated and not nset comparisonnAggregationqoperators:Count,Avg,Any qGroup By,Having Conceptual SQL EvaluationSELECT DISTINCT target-listFROM relation-listWHERE qualificationGROUP BY grouping-listHAVING group-qualificationSELECTRelation cross-product Apply selections(eliminate rows)Project away columns(just keep those used in SELECT,GBY,HAVING)WHEREFROMGROUP BYForm groups&aggregateHAVINGEliminate groupsDISTINCTEliminate duplicatesConceptual SQL EvaluationSELECT DISTINCT target-listFROM relation-listWHERE qualificationGROUP BY grouping-listHAVING group-qualificationSELECTRelation cross-product Apply selections(eliminate rows)Project away columns(just keep those used in SELECT,GBY,HAVING)WHEREFROMGROUP BYForm groups&aggregateHAVINGEliminate groupsDISTINCTEliminate duplicatesORDER BYSQL ContinuednAggregationqCount,Sum,Avg,Max,MinqGroup By,Having clausesnNested Queriesqin Where or From Clauseqset comparison:n In,Exists,Uniquenop Any,op Allqcorrelated vs.uncorrelatedImplementation of Relational OperatorsnImportant Operators:qSelectionqProjectionqSet operationsnSet-Difference,UnionqAggregationqJoinnUnderstand cost estimation,selectivitySelectionnwith no index,scan entire relationnwith clustered index,use indexnwith unclustered index,sort ridsnhash index only good for equality selectionnwith multiple selection conditions,eitherqscan entire table testing all conditionsquse index on most restrictive condition first,scan result for other conditionsquse index for each condition,do set intersection on RIDsProjectionnHard part:removing duplicates(if necessary)nCan remove duplicates by sorting or hashingnIf index contains projected attr,can do index-only scanSet Operations&AggregationnSet operations:qIntersection,cross product treated like joinsqUnion(Distinct)and Set Difference both involve finding duplicates between two sets,treat similar to ProjectnAggregationqwithout Group By,must scan entire relationqwith Group By,could sort,then scannOr?JoinsnTwo relations:inner N and outer MnSimple Nested Loops:qfor each outer tuple,scan inner for matchesqcost:M+#tuples in M*NnPaged Nested Loopsqfor each page in M,scan inner for matches to any tuple in that M pageqcost:M+M*NnBlocked Nested Loopsqlike paged,except put as much of M in memory as possible,leaving 1 page for N and 1 page for outputqcost:M+(M/(block size)*NJoins(Contd.)nIndexed Nested Loopsqfor each tuple in M,use index to find matches in Nqcost:M+#tuples in M*cost to use index to get tuplesnSort-Merge JoinqSort each table,merge finding like values,qcan be bad if many duplicatesqcost:M log M+N log N+M+NnHash Joinqpartition both relations into buckets,read in one bucket from M at a time,match with elements from same bucket in Nqcost:partioning 2*(M+N)plus matching(M+N)Query OptimizationnSome operations are commutative,associativenMay change the order of many operations in querynDramatic changes in cost depending on operation ordern“Query Plan”-a tree of operations indicating operation implementations and ordernIdeally find optimal plannIn reality avoid terrible plansOptimizer Implementationnneed iterator interface so each operation passes tuples on to the nextnneed cost estimator to determine costs of different plansqReduction Factor of operationsnneed statistics and catalogs to estimate costsSystem R Optimizer in Actionnconvert SQL to relational algebranfind alternate plansqfor each relation,consider all access pathsqfor multiple relations,consider different join algorithms,access pathsqfor join orders,only consider left-deep treesFunctional Dependencies(FDs)nAn FD X Y holds over relation schema R if,for every allowable instance r of R:t1 r,t2 r,X(t1)=X(t2)implies Y(t1)=Y(t2)(t1,t2 are tuples;X,Y are sets of attributes)nIn other words:X Y meansqGiven any two tuples in r,if the X values are the same,then the Y values must be the same.(but not vice versa)nRead“”as“determines”P(pán)roblems Due to R WnUpdate anomaly:Can we modify W in only the 1st tuple of SNLRWH?nInsertion anomaly:What if we want to insert an employee and dont know the hourly wage for his or her rating?(or we get it wrong?)nDeletion anomaly:If we delete all employees with rating 5,we lose the information about the wage for rating 5!Hourly_EmpsRules of InferencenArmstrongs Axioms(X,Y,Z are sets of attributes):qReflexivity:If X Y,then X Y qAugmentation:If X Y,then XZ YZ for any ZqTransitivity:If X Y and Y Z,then X ZnSound and complete inference rules for FDs!qusing AA you get only the FDs in F+and all these FDs.nSome additional rules(that follow from AA):qUnion:If X Y and X Z,then X YZqDecomposition:If X YZ,then X Y and X ZAttribute ClosurenComputing closure F+of a set of FDs F is hard:qexponential in#attrs!nTypically,just check if X Y is in F+.Efficient!qCompute attribute closure of X(denoted X+)wrt F.X+=Set of all attributes A such that X A is in F+nX+:=XnRepeat until no change:if U V F,U X+,then add V to X+qCheck if Y is in X+qApproach can also be used to find the keys of a relation.nIf X+=R,then X is a superkey for R.nQ:How to check if X is a“candidate key”?Boyce-Codd Normal Form (BCNF)nReln R with FDs F is in BCNF if,for all X A in F+qA X (called a trivial FD),orqX is a superkey for R.nIn other words:“R is in BCNF if the only non-trivial FDs over R are key constraints.”Decomposition of a Relation SchemenHow to normalize a relation?qdecompose into multiple normalized relationsnSuppose R contains attributes A1.An.A decomposition of R consists of replacing R by two or more relations such that:qEach new relation scheme contains a subset of the attributes of R,andqEvery attribute of R appears as an attribute of at least one of the new relations.Lossless Join DecompositionsnDecomposition of R into X and Y is lossless-join w.r.t.a set of FDs F if,for every instance r that satisfies F:px(r)pY(r)=rnIt is always true that r px(r)px(r)qIn general,the other direction does not hold!qIf it does,the decomposition is lossless-join.nDefinition extended to decomposition into 3 or more relations in a straightforward way.nIt is essential that all decompositions used to deal with redundancy be lossless!(Avoids Problem#1)Lossy Decomposition(example)A B;C B=Lossless Decomposition(example)A B;C BBut,now we cant check A B without doing a join!=Dependency Preserving Decompositions(Contd.)nDecomposition of R into X and Y is dependency preserving if (FX FY)+=F+qi.e.,if we consider only dependencies in the closure F+that can be checked in X without considering Y,and in Y without considering X,these imply all dependencies in F+.q(just the formalism of our intuition above)nImportant to consider F+in this definition:qABC,A B,B C,C A,decomposed into AB and BC.qIs this dependency preserving?Is C A preserved?nNote:F+contains F A C,B A,C B,soqFAB A B,B A;FBC B C,C BqSo,(FAB FBC)+C AThird Normal Form (3NF)nReln R with FDs F is in 3NF if,for all X A in F+A X (called a trivial FD),orX is a superkey of R,orA is part of some candidate key(not superkey!)for R.(sometimes stated as“A is prime”)nMinimality of a candidate key is crucial in third condition above!nIf R is in BCNF,obviously in 3NF.nIf R is in 3NF,some redundancy is possible.It is a compromise,used when BCNF not achievable(e.g.,no good decomp,or performance considerations).qLossless-join,dependency-preserving decomposition of R into a collection of 3NF relations always possible.Minimal Cover for a Set of FDsnMinimal cover G for a set of FDs F:qClosure of F =closure of G.qRight hand side of each FD in G is a single attribute.qIf we modify G by deleting an FD or by deleting attributes from an FD in G,the closure changes.nIntuitively,every FD in G is needed,and as small as possible in order to get the same closure as F.ne.g.,A B,ABCD E,EF GH,ACDF EG has the following minimal cover:qA B,ACD E,EF G and EF HnM.C.implies Lossless-Join,Dep.Pres.Decomp!q(in book)Concurrency ControlnTransaction:basic unit of operationqmade up of reads and writesnGoal:ACID TransactionsnA&D are provided by Crash RecoverynC&I are provided by Concurrency ControlnBottom line:reads and writes for various transactions MUST be ordered such that the final state of the database is the same as some serial ordering of the transactionsApproaches to Concurrency Controln2PL-all objects have Shared and eXclusive locksqonce one lock is released,no more locks may be acquiredqStrict 2PL:dont release locks until commit timenLocking issuesqmust either prevent or detect deadlockqmay want multiple granularity locks(table,page,record)using IS,IX,SIX,S,X locks(check compatibility matrix!)qlocking in B-trees usually not 2PLqphantom problem:locking all records of a given criteria(e.g.,age 20)Crash RecoverynACID-need way to ensure A&DnWe studied approach of Aries systemnBuffer management Steal,no ForcenEvery Write to a page is first logged in WALqlog record is in stable storage before data page on diskqlog record has Xact#,before value,after valuenCheckpoints record which pages dirty,which XActs runningTransaction Commitnwrite Commit record to lognflush log tail to stable storagenremove Xact from Xact tablenwrite End record to logTransaction Abortnwrite Abort record to logngo back through log,undoing each write(and add CLR to log)nwhen done,write End record to logCrash Recovery-3 phasesnAnalysis:starting from checkpoint,go forward in the log to see:qwhat pages were dirtyqwhat transactions were active at time of crashnRedo:start from oldest log record(oldest recLSN)that wrote to a dirty page,and redo all writes to dirty pages.nUndo:start at the end of the log(time of crash),work backward undoing all writes made by transactions that were active at time of crashSecurity and AuthorizationnThree main security objectives:qsecrecy,integrity,availability.nDBA is responsible for overall security.qDesigns security policy,maintains an audit trail,or history of users accesses to DB.nTwo main approaches to DBMS security:discretionary and mandatory access control.qDiscretionary control based on notion of privileges.qMandatory control based on notion of security classes.WarehousingnData warehouses are large,consolidated data repositories built for decision support.nWarehouses exploited using sophisticated analysis techniques:qData Warehousing Concepts and Architectures;qMulti-Dimensional Data Model;qData Cube;ROLL-UP;DRILL-DOWN;qBitmap indexqView Materialization.
收藏