《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件
《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件,數(shù)據(jù)庫系統(tǒng),數(shù)據(jù)庫,系統(tǒng),教學(xué),課件
Overview of File Organizations and IndexingJianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYcourtesy of Joe Hellerstein for some slidesContextQuery Optimizationand ExecutionRelational OperatorsFiles and Access MethodsBuffer ManagementDisk Space ManagementDBGoal for TodaynBig picture of overheads for data accessqWell simplify things to get focusedqStill,a bit of discipline:nClearly identify assumptions nThen estimate cost in a principled waynFoundation for query optimizationqCant choose the fastest scheme without an estimate of speed!Alternative File OrganizationslMany alternatives exist,each good for some situations,and not so good in others:qHeap files:Suitable when typical access is a file scan retrieving all records.qSorted Files:Best for retrieval in search key order,or only a“range”of records is needed.qClustered Files(with Indexes):Coming soonCost Model for AnalysisnB:The number of data blocksnR:Number of records per blocknD:(Average)time to read or write disk blocknAverage-case analyses for uniform random workloadsnWe will ignore:qSequential vs.Random I/O qPre-fetchingqAny in-memory costs*Good enough to show the overall trends!More AssumptionsnSingle record insert and delete.nEquality selectionqexactly one match nFor Heap Files:qInsert always appends to end of file.nFor Sorted Files:qFiles compacted after deletions.qSelections on search key.Cost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsEquality SearchRange SearchInsertDeleteCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBDEquality SearchRange SearchInsertDeleteCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBDEquality Search0.5 BD(log2 B)*DRange SearchInsertDeleteCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBDEquality Search0.5 BD(log2 B)*DRange SearchBD(log2 B)+#match pg*DInsertDeleteCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBDEquality Search0.5 BD(log2 B)*DRange SearchBD(log2 B)+#match pg*DInsert2D(log2B)+B)DDeleteCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBDEquality Search0.5 BD(log2 B)*DRange SearchBD(log2 B)+#match pg*DInsert2D(log2B)+B)DDelete0.5BD+D(log2B)+B)DIndexesnAllow record retrieval by value in one or more fieldsqFind all students in the“CS”departmentqFind all students with a gpa 3nIndex:disk-based data structure for fast lookup by valueqSearch key:any subset of columns in the relation.qSearch key need not be a key of the relationnCan have multiple items matching a lookupnIndex contains a collection of data entriesqItems associated with each search key value kqData entries come in various forms,as well see1st Question to Ask About IndexesnWhat kinds of selections(lookups)do they support?qSelection:qEquality selections(op is=)?qRange selections(op is one of,=,BETWEEN)?qMore exotic selections?n2-dimensional ranges(“east of Berkeley and west of Truckee and North of Fresno and South of Eureka”)qOr n-dimensionaln2-dimensional radii(“within 2 miles of Soda Hall”)qOr n-dimensionalnRanking queries(“10 restaurants closest to Berkeley”)nRegular expression matches,genome string matches,etc.nOne common n-dimensional index:R-treeIndex BreakdownnWhat selections does the index supportnRepresentation of data entries in indexqi.e.,what kind of info is the index actually storing?n3 alternatives herenClustered vs.Unclustered IndexesnSingle Key vs.Composite IndexesnTree-based,hash-based,otherAlternatives for Data Entry k*in IndexnThree alternatives:1.Actual data record(with key value k)2.,rid:record id3.nChoice is orthogonal to the indexing technique.qB+trees,hash-based structures,R trees,GiSTs,nCan have multiple(different)indexes per file.qE.g.file sorted by age,with a hash index on salary,and a B+tree index on name.Alternatives for Data Entries(Contd.)nAlternative 1:Actual data record(with key value k)qIndex as a file organization for records nAlongside Heap files or sorted filesqAt most one Alternative 1 index per relation qNo“pointer lookups”to get data records Alternatives for Data Entries(Contd.)Alternative 2 and Alternative 3 qMust use Alternatives 2 or 3 to support 1 index per relation.qAlternative 3 more compact than Alternative 2,but variable sized data entries neven if search keys are of fixed length.qFor large rid lists,data entry spans multiple blocks!Index ClassificationnClustered vs.Unclustered:nClustered index:qorder of data records the same as,or close to,order of index data entriesqA file can be clustered on at most one search key.qCost of retrieving data records through index varies greatly based on whether index is clustered or not!qAlternative 1 implies clustered,but not vice-versa.nNote:another definition of“clustering”qData mining,AI,statisticsClustered vs.Unclustered IndexnAlternative 2 data entries,data records in a Heap file.qTo build clustered index,first sort the Heap file nwith some free space on each block for future insertsqOverflow blocks may be needed for inserts.nThus,order of data records is close to,but not identical to,the sort order.Index entriesData entriesdirect search for(Index File)(Data file)Data Recordsdata entriesData entriesData RecordsCLUSTEREDUNCLUSTEREDUnclustered vs.Clustered IndexesnClustered ProsqEfficient for range searchesqSupports some types of compressionnMore soonqPossible locality benefitsnDisk scheduling,prefetching,etc.nClustered ConsqMore expensive to maintain non the fly or“sloppily”via reorganizationsnHeap file usually only packed to 2/3 to accommodate insertsCost of Operations B:The number of data pagesR:Number of records per pageD:(Average)time to read or write disk pageHeap FileSorted FileClustered FileScan all recordsBDBD1.5 BDEquality Search0.5 BD(log2 B)*D(logF 1.5B+1)*DRange SearchBD(log2 B)+#match pg*D(logF 1.5B)+#match pg*DInsert2D(log2B)+B)D(logF 1.5B)+2)*DDelete0.5BD+D(log2B)+B)D (because R,W 0.5)(logF 1.5B)+2)*DComposite Search KeysnSearch on a combination of fields.qEquality query:Every field value is equal to a constant value.E.g.wrt index:nage=20 and sal=75qRange query:Some field value is not a constant.E.g.:nage 20;or age=20 and sal 10nData entries in index can be sorted by search key to support range queries.qLexicographic order qLike the dictionary,but on fields,not letters!sue 1375bobcaljoe121020801112name age sal12,2012,1011,8013,7520,1210,1275,1380,111112121310207580Data recordssorted by nameData entries in indexsorted by Data entriessorted by Examples of composite keyindexes using lexicographic order.SummarynFile Layer manages access to records in pages.qRecord and page formats depend on fixed vs.variable-length.qFree space management is an important issue.qSlotted page format supports variable length records and allows records to move on page.nMany alternative file organizations exist,each appropriate in some situation.nIf selection queries are frequent,sorting the file or building an index is important.qHash-based indexes only good for equality search.qSorted files and tree-based indexes best for range search;also good for equality search.(Files rarely kept sorted in practice;B+tree index is better.)nIndex is a collection of data entries plus a way to quickly find entries with given key values.Summary(Contd.)nData entries in index can be one of 3 alternatives:(1)actual data records,(2)pairs,or(3)pairs.qChoice orthogonal to indexing structure(i.e.,tree,hash,etc.).nUsually have several indexes on a given file of data records,each with a different search key.nIndexes can be classified as clustered vs.unclusterednDifferences have important consequences for utility/performance.nCatalog relations store information about relations,indexes and views.
收藏