Wednesday, February 2, 2022

The Fully Denormalized, United, Columnar Approach

Star schema considered harmful.  They were born for ROLAP using row-based DBMS.  In today’s MPP columnar systems using One Big Table (OTB) is the better approach.  You may think this is madness, and I would have agreed with you—until I tried it.

Compared to the use of multiple star schemas, a single-table columnar approach provides improved ease of use, lower maintenance cost, and better query performance.  Read on to find out why.

Some technical knowledge is needed to understand the driving trade-offs between normalized, star (columnar or not), and denormalized columnar structures, and why the history of OLAP architecture varied the span-of-join from many, to one, and now, to none.

Performance Target: Tuned for FTS

The goal of schema design for performance tuning is to establish physical structures that provide acceptable response time when it is most needed. In the case of business intelligence that would be when ad hoc reports are being iteratively crafted for weighing business strategy.  This performance is achieved by building performant physical structures prior to user access to provide the best performance when the user later demands the data, in essence to shift the time cost from “read” time to “write” time to the extent possible for write-once read-many (WORM) applications.

The slowest component in a computer is its only moving part, the disk drive.  The goal of performance tuning is to maximize Ram over disk, and if the amount of data exceeds Ram (which is usually the case for data warehouses) the goal is to is to minimize disk Physical Input/Output (PIO) time.  The insight here is to build structures based on the performance characteristics of random vs. sequential PIO.  Random reads are when the read/write head of the disk drive jumps back and forth seeking out data at random locations.  This happens when performing index access (for predicate or join).  99% of the time is waisted as the system waits for disk seek, rotation & transfer.  Sequential PIO is when the read/write head is positioned only once, and the disk blocks are read sequentially off the disk without moving the head to new locations.  This happens when doing Full Table Scans (FTS).  Sequential PIO is 7,000 times faster than random PIO.  Random PIO runs at 300 reads per second while random PIO runs at 200 MB per second.  In less time than it takes to read 2 random blocks (8K ea) you can read 128 contiguous blocks (1MB).  However, the 1 MB needs to contain useful data or else the time is wasted, for if only 8K of useful info is found then you would be better off to have made a random read.  Optimally using sequential structures requires a knowledge of the anticipated data and writing it in the order needed.  An example is the use of materialized views that pre-loads a fully denormalized data structure ahead of its retrieval (at the cost of storage).

Access to a Storage Area Network (SAN) can also be a bottleneck, especially if you use time-stamped shadow blocks as these can silently convert sequential PIO to random PIO.  The solution is to kill the SAN and go with hyperconverged hardware, which is what many cloud-based systems and all TPC-H benchmarks do.  The next significant performance target is node-to-node network bandwidth, which can be reduced by replicating dimension over nodes or correlating rows between tables using partition keys.

Normalized vs. Star: Span-of-Join

Normalized structures are best used for transaction processing where small amounts of random data (in KB, not MB) need to be written and read from the disk in about equal measure.  Normalization eliminates data redundancy and thus prevents update anomalies (redundant data falling out of sync).  It makes heavy use of indexes and nested loop index joins to navigate across tables.  Normalized structures are not recommended for reporting over a large volume of data because navigational access with a large span-of-joins would offer poor performance.

Relational online analytical processing (ROLAP) using row-based structures for reporting favors star schemas.  Their secret is the reduction of the span-of-join to 1.  One table is designated the fact table and all other tables are directly joined to it.  When queried the cost-based optimizer (CBO) loads the dimension tables via FTS into an in-memory structure arranged for a hash table join.  The fact table, which can be larger than working Ram, is then scanned using an FTS, filtering & aggregating results as it scans.  Once a row in the fact table is merged/summarized into the growing aggregate the data from the fact row is discarded while the hashed dimensions remain in memory.  Thus, every block of data is read off the disk only once in a single pass using sequential PIO.  Surprisingly, many do not know that this is the rationale behind star schemas.  If a dimension table becomes too large to fit in memory it must be accessed using the much slower nested-loop index join (with lots of random PIO) or the sort-merge join (with temporary intermediate materialized disk PIO).  Either way, the performance of the query suffers.

Business Intelligence Footprint

BI queries typically aggregates a large quantity rows to produce a time-series view of events and their metrics.  Only a few columns are needed to filter, group by, and summarize the output in a graph.  Graphs limit the number of ‘group by’ columns to around 5 or less (based on #axises, colors, marks, size, etc.) and the amount of aggregated data points to that which can be perceived.

Enter Columnar Technology

Row-stores favor I/O requiring a few rows and many columns.  Column-stores favor queries that require many rows and few columns; exactly the type of activity used in BI dashboards.  Instead of storing a few rows with all their columns in a disk block a column-store will store only the values of one column in a block.  This eliminates the wasteful reading in of columns that are ignored.

Various extreme data compression techniques are applied depending on the column datatype.  For example, a string datatype may apply global and local dictionary compression with tokenization.  Run Length Encoding (RLE) is further applied to compress redundant values and differential encoding is applied to similar values.  Query filtering is applied by matching the compressed values, without the need to decompress, and a content min/max zone map performs up-front block elimination.  Query predicates are scanned into to memory pages of bitmap vectors that are Boolean merged using GPU instructions permitting “late materialization” of the selected array index location (i.e. row#).  None of these performance tricks are possible in a row-store.

ACM A.M. Turing Award winner Michael Stonebraker famously predicted “The only successful ‘Big Data-Small Analytics’ architecture will be a column executor on shared-nothing hardware.  All the successful vendors will have to get there sooner or later.”  He was proven right as all TPC-H benchmark winners today are columnar.

Join Elimination

The cost of doing joins can be significant.  That is why many relational databases use materialized views to tune their performance with denormalized physical structures.  The columnar database Vertica automatically denormalizes its physical structures.  Their whitepaper states the following:

So what about optimizing JOINs in a columnar database? What optimizations exist for a columnar store to do JOINs? Normalized database design often uses a star or snowflake schema model, comprising multiple large fact tables and many smaller dimension tables. Queries typically involve joins between a large fact table and multiple dimension tables. These queries can incur significant overhead, especially as the analytical needs of the organization change but the database design remains the same.

Moreover, it’s common for database schemas to need to be changed. This becomes clear when you ask yourself a simple question—are you running the same reports this year as you were just one year ago? What about two years ago? Schemas are designed to be optimized for some form of reporting, but the data changes, the needs of analysts change and your business changes. When this happens, you can choose to refactor your database, or you can use a feature like flattened tables.

Vertica flattened tables are denormalized tables that that get their values by querying other tables. To the analyst, it appears that all of the columns needed are in one table. What is actually happening is that Vertica can create a copy of the data in a flattened schema and still be very efficient in storage requirement. Again, using the efficiencies of compression, late stage materialization and predicate push-down, these normalized tables have two interesting benefits: 1) It makes it easier for the business analyst to write simple queries instead of JOINs, and; 2) performance is faster than a normalize JOIN in many cases.

Eventually, [our store] was replaced by the notion of a super projection that contains all the columns.

One way of picturing a denormalized columnar DB is that physically it actually is a star schema with every string column in its own dimension table (the compression dictionary) with all redundant values eliminated.

Join Complexity

Denormalization, Unionization & Columnization makes best use of vector bitmap tables for high performing aggregated queries that combine facts along a time-series or other shared axis.  Using a star schema with multiple columnstore facts & dimensions hinders the use of late materialization vectorization to improve performance and drags in 2 more columns (the FK & PK) to the query plan.

Denormalization also eliminates joins to large high cardinality dimension tables that significantly hurt performance because they do not lend themselves to hash joins.  Costly complex theta (inequality) joins and calculated joins are also resolved at load time.  Joins that use COALESCE on multiple foreign keys to compute default and override are also resolved at load time.  The star schema would have a difficult time handling these and would have to handle this in complex custom surrogate key generation.

Ease of Use

Denormalization to one table improves the user’s ability to extract a wide array of meaningful data, with improved performance.  The white paper “Expediting analytical databases with columnar approach” by Nenad Jukic, Boris Jukic, Abhishek Sharma, Svetlozar Nestorov, Benjamin Korallus Arnold states:

This method improves the feasibility and quality of tactical decision making by making critical information more readily available. It also improves the quality of longer term strategic decision making by widening the range of feasible queries against the vast amounts of available information. The advantages include the improvements in the performance of the ETL process (the most common time-consuming bottleneck in most implementations of data warehousing for quality decision support) and in the performance of the individual analytical queries.

Providing a single table makes it much easier for BI users to find & get what they need.  Users do not need to pick out which fact table data source to use. They can pick columns to show or filter and it will automatically guide them to appropriate transactions. 

Because the OBT approach results in a large number of sparse columns a "generalization layer" should be defined in the BI data source.  Columns should be organized for easy navigation & access using defined hierarchical "Folders" in PowerBI's Fields blade (see picture).  Tableau also has this capability.

BI filter checkboxes are super-fast as the list of possible values is pulled from the dictionary.  Queries are faster too because all joins are pre-computed.

BI charts frequently combine facts along a common time-series or some other axis in a Combo Chart.  With the star schema approach a PowerBI data source often is defined for each fact table. If the user needs to make a chart that combines facts they are forced to create new special purpose data sources from scratch and to resolve the complicated issues on merging fact tables and columns and creating joins with many-to-many relationships.  PowerBI allows you to define a data source that has more than one fact table.  As long as the user select columns from only one fact table they have access to all its columns and that of its related dimensions.  If 2 fact tables are selected then their measures and shared dimensions may be used and all measures are aggregated along the shared dimensions, however, no attribute of any fact table or any unshared dimension may be used, as it would cause a cartesian product.  In contrast, with the one-table approach users do not need to struggle any of this.

Why Union the Facts?

The greatest value of BI analytics is derived from discovering correlations between different facts (e.g. pricing to churn).  Unfortunately, a star schema solution would entail not one star schema but may.  Direct joins between star schema fact tables are not permitted because they would result in a many-to-many cartesian join.  The solution is to use either of the following strategies: (A) separately aggregate each fact table and then join on common ‘group by’ columns or shared dimensions, or (B) ‘union’ the fact tables together (aligning columns) and then perform the aggregation.

Our solution is to union the facts into one table, aligning all columns. A “fact_type” column is added so the user can filter on specific facts.  Separate date columns are included, but one date column from each transaction is replicated to an additional “Transaction Date” column so events that occur on the same date can be shown.

By denormalizing to a single table the source structures are eliminated and become of no concern to the report user.  They can now just focus on the columns needed.

If the user does not know which transactions to use, they can select the columns of interest and exclude the values that may not be NULL.  This will automatically filter the results to the facts that are of interest.  The user can then pick among this subset of fact_types.  The user is free to allow multiple fact_types and these can be shown separately along with the fact_type value, or the fact_type can be removed from the display and the metrics can be aggregated together along shared dimension attributes.  If the relationship between the facts is n:m then showing the detail level will show the n and m records on separate rows.  You also can aggregate on an unshared attribute and the output will show the aggregated measure beside each attribute value, plus one row with NULL for the attribute with the other fact_type attributes and measures.  This approach to storing the data allows many different difficult scenarios to be handled easily, without the need of defining new certified data sources.

In the simplified example below fact_type “A” has metrics on Color, fact_type “B” has metrics on Person, fact_type “C” has metrics on both Color & Person, and fact_type “D” has metrics on both Color, Person & Skill.  When you aggregate on Color & Person you get the following output:

FACT_TYPE

Color

Person

A_Metric

B_Metric

C_Metric

D_Metric

A

blue

5

 

A

green

3

 

A

red

5

 

A

orange

2

 

A

yellow

1

 

B

Bob

65

 

B

Mary

45

 

B

Joe

23

 

C&D

blue

Bob

11

547

C&D

blue

Mary

21

271

C&D

green

Mary

43

828

C&D

green

Joe

78

459

C&D

yellow

Bob

67

451

Note that this allows the display of complex n:m relationships without issue.  The Color & Person columns are aligned across every fact_type and is NULL when unused.  Each row of output may aggregate many detail records.  Additional shared & unshared attributes may be added, which could increase the number of aggregated output rows.  Fact_type C & D have their metrics combined because Skill is not used as a grouping attribute in this example, causing the metrics of D to be rolled up to the level of C.  The rows could be sorted in a number of different ways and subtotaling may be added.

A star schema would find it very difficult to show data at different levels of details as done in the above example.

Load Time

Load time is slower with the denormalized columnar approach compared to the star schema approach.  This is beneficial because the time cost has been shifted from “read” time to “write” time. 

Development & Maintenance Time

If you know your source system OLTP schema it is easy to build a OBT solution:

  1. Create a target table that has all the desired columns.
  2. Pick each of the lowest level transaction tables needed.
  3. For each transaction table code a SELECT that recursively joins all their parent tables.
  4. Convert each SELECT into an INSERT target_table (columns…) SELECT columns… FROM … .  Note: this automatically loads NULL into the unused columns.
  5. Put each INSERT into a stored procedure and execute it.  (The multiple INSERTs acts like a UNION ALL.)
  6. If needed add a VIEW or Data Source definition that gives the columns understandable names.

Because you included every parent table your solution should be comprehensive and complete.  By recursively following every join path the data mart has minimal data loss reducing maintenance requests when data is later found missing.

If you have all referential constraints defined in the database, then it is possible to generate all the code automatically from a few metadata tables.  Auto-generation of code saves hundreds of hours making it possible to deliver the data mart in one day.  A stored procedure can be created that automatically generates the stored procedure that generates the OBT table, view & code from the metadata tables.  Maintenance of the metadata tables is easy to understand and the entire stored procedure can be regenerated in under 10 minutes, eliminating code maintenance.  A later post will show the code for this.  Another post will also show how to capture volumetrics on the tables and profile the columns so you can tell at a glace which tables & columns are data-rich.

A manually created star schema could not be created as fast.

Real World Example

At one company I loaded the following data warehouse into AWS Redshift, which is an MPP columnar DBMS.  It had 13 fact tables with 35 billion rows shown in the middle of the diagram. The top of the diagram shows the conformed shared dimensions, and the unshared dimension are at the bottom of the ER-Diagram; 25 dimensions in all.


For comparison, this was loaded into a single flat table by joining every fact to its dimensions and then merging the joined facts by UNION ALL, aligning the dimensional columns.  Tests were then done to see which performed better, the star or the OBT, and it was found that the single table solution was fastest under every type of BI query.  It was then given to the Tableau BI users and they found the one-table solution to be much easier to use than the star schemas, especially for strategic queries that span multiple facts.

The solution went into production and over the coming 5 years it was found to be easy to maintain.  Over time, the system grew to hold 40 billion rows, 200 sparce columns and ingest 21 facts and 36 dimensions while still offering exceptional performance.

Conclusion

The fully denormalized columnar solution is complete and comprehensive, and provides better performance.  It is easily maintained through metadata tables.  End users will find that it satisfies the widest possible set of queries and provides the greatest insight & guidance when used in an ad hoc manner for strategic goals.

If you too have tried this approach I'd love to get your experience.  It seems no one has written about this.