Data warehouses are often very large, so you have to deal with performance problems. You can use index, data compression, columnstored indexed, T-SQL query skill ... to enhance performance.
Indexing Dimensions and Fact Tables
Indexing
clustered indexes
store every table with a clustered index, and surrogate keys are ideal for clustered indexes.
Nonclustered indexes
Nonclustered indexes generally don't help DW queries much. However, this does not mean that you shouldn't create any nonclustered indexes in your DW. An attribute of a dimension is not a good candidate for a nonclustered index key. You should create indexes only if these columns are used in report queries
filtered index
You can create a filtered nonclustered index which are useful when some values in a column occur rarely, whereas other values occur frequently .
Join
Hash Join
SQL Server Query Optimizer recognizes star join patterns and uses bitmap filtered hash joins. Query Optimizer uses hash joins when you join on non-sorted columns from both tables involved in a join. Hash joins can work in parallel threads. With bitmap filtering, they can work on a subset of rows from a dimension and from a fact table in each thread. Such queries are typical for data warehousing environments. For hash joins, you do not index the foreign keys of a fact table.
Nested Loops Joins
If one join input is small (fewer than 10 rows) and the other join input is fairly large and indexed on its join columns, an index nested loops join is the fastest join operation because they require the least I/O and the fewest comparisons.
Merge Joins
If the two join inputs are not small but are sorted on their join column (for example, if they were obtained by scanning sorted indexes), a merge join is the fastest join operation.
Indexed Views
Indexed views are especially useful for following case:
- queries that aggregate data.
- queries that perform multiple joins.
However, to speed up data loads, you can drop or disable the index before load and then recreate or rebuild it after the load.
SET STATISTICS IO ON USE AdventureWorksDW2012; GO SELECT ProductKey, SUM(SalesAmount) AS Sales, COUNT_BIG(*) AS NumberOfRows FROM dbo.FactInternetSales GROUP BY ProductKey; GO
CREATE VIEW SalesByProduct WITH SCHEMABINDING AS SELECT ProductKey, SUM(SalesAmount) AS Sales, COUNT_BIG(*) AS NumberOfRows FROM dbo.FactInternetSales GROUP BY ProductKey; GO CREATE UNIQUE CLUSTERED INDEX CLU_SalesByProduct ON dbo.SalesByProduct (ProductKey); GO SELECT ProductKey, SUM(SalesAmount) AS Sales, COUNT_BIG(*) AS NumberOfRows FROM dbo.FactInternetSales GROUP BY ProductKey; GO
Note that the view must be created with the SCHEMABINDING option if you want to index it. In addition, you must use the COUNT_BIG aggregate function. Nevertheless, after creating the view and the index, execute the query again.
Using Appropriate Query Techniques
若你寫出沒有效率的 TSQL ,資料庫引擎也幫不了你了。例如下面這個總計的查詢:
需求:依訂單號碼順序,列出訂單明細中,各個品項的金額,並且後面加入一個累計金額欄位,如下面結果:
你可以使用 non-equi join 的查詢方式,例如:
SET STATISTICS IO ON SET STATISTICS TIME ON GO WITH InternetSalesGender AS ( SELECT ISA.CustomerKey, C.Gender, ISA.SalesOrderNumber + CAST(ISA.SalesOrderLineNumber AS CHAR(1)) AS OrderLineNumber, ISA.SalesAmount FROM dbo.FactInternetSales AS ISA INNER JOIN dbo.DimCustomer AS C ON ISA.CustomerKey = C.CustomerKey WHERE ISA.CustomerKey <= 12000 ) SELECT ISG1.OrderLineNumber, ISG1.Gender, Min(ISG1.SalesAmount) AS SalesAmount, SUM(ISG2.SalesAmount) AS RunningTotal FROM InternetSalesGender AS ISG1 INNER JOIN InternetSalesGender AS ISG2 ON ISG1.Gender = ISG2.Gender AND ISG1.OrderLineNumber >= ISG2.OrderLineNumber GROUP BY ISG1.Gender, ISG1.OrderLineNumber ORDER BY ISG1.Gender, ISG1.OrderLineNumber
下圖是執行結果, 回傳 6434 列, FactInternetSales 邏輯讀取 4534 , DimCustomer 邏輯讀取 196 , Worktable 邏輯讀取 21314。
下面範例,我們套用 OVER 子句,重新撰寫 TSQL :
SET STATISTICS IO ON SET STATISTICS TIME ON GO WITH InternetSalesGender AS ( SELECT ISA.CustomerKey, C.Gender, ISA.SalesOrderNumber + CAST(ISA.SalesOrderLineNumber AS CHAR(1)) AS OrderLineNumber, ISA.SalesAmount FROM dbo.FactInternetSales AS ISA INNER JOIN dbo.DimCustomer AS C ON ISA.CustomerKey = C.CustomerKey WHERE ISA.CustomerKey <= 12000 ) SELECT ISG1.OrderLineNumber, ISG1.Gender, ISG1.SalesAmount , SUM(ISG1.SalesAmount) OVER( PARTITION BY ISG1.Gender ORDER BY ISG1.OrderLineNumber ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS RunningTotal FROM InternetSalesGender AS ISG1 ORDER BY ISG1.Gender, ISG1.OrderLineNumber;
下圖是執行結果, 回傳 6434 列, FactInternetSales 邏輯讀取 2062 , DimCustomer 邏輯讀取 87 , Worktable 邏輯讀取 0。 整個效能提升很多:
Columnstore Indexes(資料行存放區索引)and Batch Mode
Columnstore Index 和 Batch Mode Processing 都是 SQL 2012 才加進來的新功能,可用來提升資料查詢的效率。
Columnstore Indexes
Columnstore indexes is a new method of storing nonclustered indexes in SQL Server 2012. which can store index data column by column . Columnstore indexes can speed up data warehousing queries by a large factor, from 10 to even 100 times!
- Columnstore indexes accelerate data warehouse queries but are not suitable for OLTP workloads
- If you want to update a table with a columnstore index, you must first drop the columnstore index.
- If you use table partitioning, you can switch a partition to a different table without a columnstore index, update the data there, create a columnstore index on that table (which has a smaller subset of the data), and then switch the new table data back to a partition of the original table.
- In addition to fact tables, very large dimensions could benefit from columnstore indexes as well. Do not use columnstore indexes for small dimensions.
Use columnstore indexes for
- Read-mostly workloads.
- Updates that append new data.
- Workflows that permit partitioning or index drop/rebuild.
- Queries that often scan and aggregate lots of data.
Don't use columnstore indexes when
- You update the data frequently.
- Partition switching or rebuilding indexes doesn’t fit your workflow.
- Your workload includes mostly small lookup queries.
Batch Mode
SQL Server 2012 includes another important improvement for query processing. In batch mode processing, SQL Server processes data in batches rather than processing one row at a time.
ref. [SQL SERVER][Denali] Columnstore Index 執行模式 (Row VS Batch)
Data Compression
Data compression reduces the size of the database, which helps improve query performance because queries on compressed data read fewer pages from disk and thus use less IO.
SQL Server supports three compression implementations:
- Row compression :將欄位的資料型別轉換成可變長度的資料型別
- Page compression:repeated prefixes of values, dictionay compression ..
- Unicode compression:substitutes single-byte storage for Unicode characters that don't truly require two bytes.
You can gain quite a lot from data compression in a data warehouse. Foreign keys are often repeated many times in a fact table. Large dimensions that have Unicode strings in name columns, member properties, and attributes can benefit from Unicode compression.
沒有留言:
張貼留言