Star schema considered harmful. They were born for ROLAP using row-based DBMS. In today’s MPP columnar systems using One Big Table (OTB) is the better approach. You may think this is madness, and I would have agreed with you—until I tried it.
Compared to the use of multiple star schemas, a single-table
columnar approach provides improved ease of use, lower maintenance cost, and
better query performance. Read on to
find out why.
Some technical knowledge is needed to understand the driving
trade-offs between normalized, star (columnar or not), and denormalized
columnar structures, and why the history of OLAP architecture varied the
span-of-join from many, to one, and now, to none.
Performance Target: Tuned for FTS
The goal of schema design for performance tuning is to
establish physical structures that provide acceptable response time when it is
most needed. In the case of business intelligence that would be when ad hoc
reports are being iteratively crafted for weighing business strategy. This performance is achieved by building performant
physical structures prior to user access to provide the best performance when
the user later demands the data, in essence to shift the time cost from “read”
time to “write” time to the extent possible for write-once read-many (WORM)
applications.
The slowest component in a computer is its only moving part,
the disk drive. The goal of performance
tuning is to maximize Ram over disk, and if the amount of data exceeds Ram (which
is usually the case for data warehouses) the goal is to is to minimize disk Physical
Input/Output (PIO) time. The insight here
is to build structures based on the performance characteristics of random vs.
sequential PIO. Random reads are when
the read/write head of the disk drive jumps back and forth seeking out data at
random locations. This happens when
performing index access (for predicate or join). 99% of the time is waisted as the system waits
for disk seek, rotation & transfer.
Sequential PIO is when the read/write head is positioned only once, and
the disk blocks are read sequentially off the disk without moving the head to
new locations. This happens when doing
Full Table Scans (FTS). Sequential PIO
is 7,000 times faster than random PIO.
Random PIO runs at 300 reads per second while random PIO runs at 200 MB
per second. In less time than it takes
to read 2 random blocks (8K ea) you can read 128 contiguous blocks (1MB). However, the 1 MB needs to contain useful
data or else the time is wasted, for if only 8K of useful info is found then
you would be better off to have made a random read. Optimally using sequential structures
requires a knowledge of the anticipated data and writing it in the order
needed. An example is the use of materialized
views that pre-loads a fully denormalized data structure ahead of its retrieval
(at the cost of storage).
Access to a Storage Area Network (SAN) can also be a
bottleneck, especially if you use time-stamped shadow blocks as these can silently
convert sequential PIO to random PIO.
The solution is to kill the SAN and go with hyperconverged hardware,
which is what many cloud-based systems and all TPC-H benchmarks do. The next significant performance target is
node-to-node network bandwidth, which can be reduced by replicating dimension
over nodes or correlating rows between tables using partition keys.
Normalized vs. Star: Span-of-Join
Normalized structures are best used for transaction
processing where small amounts of random data (in KB, not MB) need to be
written and read from the disk in about equal measure. Normalization eliminates data redundancy and
thus prevents update anomalies (redundant data falling out of sync). It makes heavy use of indexes and nested loop
index joins to navigate across tables. Normalized
structures are not recommended for reporting over a large volume of data
because navigational access with a large span-of-joins would offer poor
performance.
Relational online analytical processing (ROLAP) using
row-based structures for reporting favors star schemas. Their secret is the reduction of the
span-of-join to 1. One table is
designated the fact table and all other tables are directly joined to it. When queried the cost-based optimizer (CBO)
loads the dimension tables via FTS into an in-memory structure arranged for a
hash table join. The fact table, which
can be larger than working Ram, is then scanned using an FTS, filtering &
aggregating results as it scans. Once a
row in the fact table is merged/summarized into the growing aggregate the data
from the fact row is discarded while the hashed dimensions remain in memory. Thus, every block of data is read off the
disk only once in a single pass using sequential PIO. Surprisingly, many do not know that this is
the rationale behind star schemas. If a
dimension table becomes too large to fit in memory it must be accessed using
the much slower nested-loop index join (with lots of random PIO) or the
sort-merge join (with temporary intermediate materialized disk PIO). Either way, the performance of the query
suffers.
Business Intelligence Footprint
BI queries typically aggregates a large quantity rows to
produce a time-series view of events and their metrics. Only a few columns are needed to filter, group
by, and summarize the output in a graph.
Graphs limit the number of ‘group by’ columns to around 5 or less (based
on #axises, colors, marks, size, etc.) and the amount of aggregated data points
to that which can be perceived.
Enter Columnar Technology
Row-stores favor I/O requiring a few rows and many
columns. Column-stores favor queries
that require many rows and few columns; exactly the type of activity used in BI
dashboards. Instead of storing a few
rows with all their columns in a disk block a column-store will store only the
values of one column in a block. This
eliminates the wasteful reading in of columns that are ignored.
Various extreme data compression techniques are applied
depending on the column datatype. For
example, a string datatype may apply global and local dictionary compression
with tokenization. Run Length Encoding
(RLE) is further applied to compress redundant values and differential encoding
is applied to similar values. Query
filtering is applied by matching the compressed values, without the need to
decompress, and a content min/max zone map performs up-front block
elimination. Query predicates are scanned
into to memory pages of bitmap vectors that are Boolean merged using GPU
instructions permitting “late materialization” of the selected array index
location (i.e. row#). None of these
performance tricks are possible in a row-store.
ACM A.M. Turing Award winner Michael Stonebraker famously predicted
“The only successful ‘Big Data-Small Analytics’ architecture will be a column
executor on shared-nothing hardware. All
the successful vendors will have to get there sooner or later.” He was proven right as all TPC-H benchmark
winners today are columnar.
Join Elimination
The cost of doing joins can be significant. That is why many relational databases use
materialized views to tune their performance with denormalized physical
structures. The columnar database Vertica
automatically denormalizes its physical structures. Their whitepaper states the following:
So what about optimizing JOINs in a
columnar database? What optimizations exist for a columnar store to do JOINs?
Normalized database design often uses a star or snowflake schema model,
comprising multiple large fact tables and many smaller dimension tables.
Queries typically involve joins between a large fact table and multiple
dimension tables. These queries can incur significant overhead, especially as
the analytical needs of the organization change but the database design remains
the same.
Moreover, it’s common for database
schemas to need to be changed. This becomes clear when you ask yourself a
simple question—are you running the same reports this year as you were just one
year ago? What about two years ago? Schemas are designed to be optimized for
some form of reporting, but the data changes, the needs of analysts change and
your business changes. When this happens, you can choose to refactor your
database, or you can use a feature like flattened tables.
Vertica flattened tables are
denormalized tables that that get their values by querying other tables. To the
analyst, it appears that all of the columns needed are in one table. What is
actually happening is that Vertica can create a copy of the data in a flattened
schema and still be very efficient in storage requirement. Again, using the
efficiencies of compression, late stage materialization and predicate
push-down, these normalized tables have two interesting benefits: 1) It makes
it easier for the business analyst to write simple queries instead of JOINs,
and; 2) performance is faster than a normalize JOIN in many cases.
Eventually, [our store] was
replaced by the notion of a super projection that contains all the columns.
One way of picturing a denormalized columnar DB is that
physically it actually is a star schema with every string column in its
own dimension table (the compression dictionary) with all redundant values
eliminated.
Join Complexity
Denormalization, Unionization & Columnization makes best
use of vector bitmap tables for high performing aggregated queries that combine
facts along a time-series or other shared axis.
Using a star schema with multiple columnstore facts & dimensions hinders
the use of late materialization vectorization to improve performance and drags
in 2 more columns (the FK & PK) to the query plan.
Denormalization also eliminates joins to large high
cardinality dimension tables that significantly hurt performance because they
do not lend themselves to hash joins. Costly
complex theta (inequality) joins and calculated joins are also resolved at load
time. Joins that use COALESCE on
multiple foreign keys to compute default and override are also resolved at load
time. The star schema would have a
difficult time handling these and would have to handle this in complex custom surrogate
key generation.
Ease of Use
Denormalization to one table improves the user’s ability to
extract a wide array of meaningful data, with improved performance. The white paper “Expediting analytical
databases with columnar approach” by Nenad Jukic, Boris Jukic, Abhishek Sharma,
Svetlozar Nestorov, Benjamin Korallus Arnold states:
This method improves the
feasibility and quality of tactical decision making by making critical
information more readily available. It also improves the quality of longer term
strategic decision making by widening the range of feasible queries against the
vast amounts of available information. The advantages include the improvements
in the performance of the ETL process (the most common time-consuming
bottleneck in most implementations of data warehousing for quality decision
support) and in the performance of the individual analytical queries.
Providing a single table makes it much easier for BI users
to find & get what they need. Users
do not need to pick out which fact table data source to use. They can pick
columns to show or filter and it will automatically guide them to appropriate
transactions.
BI filter checkboxes are super-fast as the list of possible
values is pulled from the dictionary. Queries
are faster too because all joins are pre-computed.
BI charts frequently combine facts along a common
time-series or some other axis in a Combo Chart. With the star schema approach a PowerBI data
source often is defined for each fact table. If the user needs to make a chart
that combines facts they are forced to create new special purpose data sources
from scratch and to resolve the complicated issues on merging fact tables and
columns and creating joins with many-to-many relationships. PowerBI allows you to define a data source
that has more than one fact table. As
long as the user select columns from only one fact table they have access to
all its columns and that of its related dimensions. If 2 fact tables are selected then their measures
and shared dimensions may be used and all measures are aggregated along the
shared dimensions, however, no attribute of any fact table or any unshared
dimension may be used, as it would cause a cartesian product. In contrast, with the one-table approach
users do not need to struggle any of this.
Why Union the Facts?
The greatest value of BI analytics is derived from
discovering correlations between different facts (e.g. pricing to churn). Unfortunately, a star schema solution would
entail not one star schema but may.
Direct joins between star schema fact tables are not permitted because
they would result in a many-to-many cartesian join. The solution is to use either of the
following strategies: (A) separately aggregate each fact table and then join on
common ‘group by’ columns or shared dimensions, or (B) ‘union’ the fact tables
together (aligning columns) and then perform the aggregation.
Our solution is to union the facts into one table, aligning
all columns. A “fact_type” column is added so the user can filter on specific
facts. Separate date columns are included,
but one date column from each transaction is replicated to an additional “Transaction
Date” column so events that occur on the same date can be shown.
By denormalizing to a single table the source structures are
eliminated and become of no concern to the report user. They can now just focus on the columns
needed.
If the user does not know which transactions to use, they
can select the columns of interest and exclude the values that may not be
NULL. This will automatically filter the
results to the facts that are of interest.
The user can then pick among this subset of fact_types. The user is free to allow multiple fact_types
and these can be shown separately along with the fact_type value, or the fact_type
can be removed from the display and the metrics can be aggregated together
along shared dimension attributes. If
the relationship between the facts is n:m then showing the detail level will
show the n and m records on separate rows.
You also can aggregate on an unshared attribute and the output will show
the aggregated measure beside each attribute value, plus one row with NULL for
the attribute with the other fact_type attributes and measures. This approach to storing the data allows many
different difficult scenarios to be handled easily, without the need of
defining new certified data sources.
In the simplified example below fact_type “A” has metrics on
Color, fact_type “B” has metrics on Person, fact_type “C” has metrics on both
Color & Person, and fact_type “D” has metrics on both Color, Person &
Skill. When you aggregate on Color &
Person you get the following output:
|
FACT_TYPE |
Color |
Person |
A_Metric |
B_Metric |
C_Metric |
D_Metric |
|
A |
blue |
5 |
|
|||
|
A |
green |
3 |
|
|||
|
A |
red |
5 |
|
|||
|
A |
orange |
2 |
|
|||
|
A |
yellow |
1 |
|
|||
|
B |
Bob |
65 |
|
|||
|
B |
Mary |
45 |
|
|||
|
B |
Joe |
23 |
|
|||
|
C&D |
blue |
Bob |
11 |
547 |
||
|
C&D |
blue |
Mary |
21 |
271 |
||
|
C&D |
green |
Mary |
43 |
828 |
||
|
C&D |
green |
Joe |
78 |
459 |
||
|
C&D |
yellow |
Bob |
67 |
451 |
Note that this allows the display of complex n:m relationships without issue. The Color & Person columns are aligned across every fact_type and is NULL when unused. Each row of output may aggregate many detail records. Additional shared & unshared attributes may be added, which could increase the number of aggregated output rows. Fact_type C & D have their metrics combined because Skill is not used as a grouping attribute in this example, causing the metrics of D to be rolled up to the level of C. The rows could be sorted in a number of different ways and subtotaling may be added.
A star schema would find it very difficult to show
data at different levels of details as done in the above example.
Load Time
Load time is slower with the denormalized columnar approach
compared to the star schema approach.
This is beneficial because the time cost has been shifted from “read”
time to “write” time.
Development & Maintenance Time
If you know your source system OLTP schema it is easy to
build a OBT solution:
- Create a target table that has all the desired columns.
- Pick each of the lowest level transaction tables needed.
- For each transaction table code a SELECT that recursively joins all their parent tables.
- Convert each SELECT into an INSERT target_table (columns…) SELECT columns… FROM … . Note: this automatically loads NULL into the unused columns.
- Put each INSERT into a stored procedure and execute it. (The multiple INSERTs acts like a UNION ALL.)
- If needed add a VIEW or Data Source definition that gives the columns understandable names.
Because you included every parent table your solution should
be comprehensive and complete. By
recursively following every join path the data mart has minimal data loss
reducing maintenance requests when data is later found missing.
If you have all referential constraints defined in the database,
then it is possible to generate all the code automatically from a few metadata
tables. Auto-generation of code saves hundreds
of hours making it possible to deliver the data mart in one day. A stored procedure can be created that
automatically generates the stored procedure that generates the OBT table, view
& code from the metadata tables.
Maintenance of the metadata tables is easy to understand and the entire
stored procedure can be regenerated in under 10 minutes, eliminating code
maintenance. A later post will show the
code for this. Another post will also
show how to capture volumetrics on the tables and profile the columns so you
can tell at a glace which tables & columns are data-rich.
A manually created star schema could not be created as fast.
Real World Example
At one company I loaded the following data warehouse into
AWS Redshift, which is an MPP columnar DBMS.
It had 13 fact tables with 35 billion rows shown in the middle of the
diagram. The top of the diagram shows the conformed shared dimensions, and the
unshared dimension are at the bottom of the ER-Diagram; 25 dimensions in all.
For comparison, this was loaded into a single flat table by joining every fact to its dimensions and then merging the joined facts by UNION ALL, aligning the dimensional columns. Tests were then done to see which performed better, the star or the OBT, and it was found that the single table solution was fastest under every type of BI query. It was then given to the Tableau BI users and they found the one-table solution to be much easier to use than the star schemas, especially for strategic queries that span multiple facts.
The solution went into production and over the coming 5
years it was found to be easy to maintain.
Over time, the system grew to hold 40 billion rows, 200 sparce columns
and ingest 21 facts and 36 dimensions while still offering exceptional
performance.
Conclusion
The fully denormalized columnar solution is complete and
comprehensive, and provides better performance.
It is easily maintained through metadata tables. End users will find that it satisfies the
widest possible set of queries and provides the greatest insight & guidance
when used in an ad hoc manner for strategic goals.
If you too have tried this approach I'd love to get your experience. It seems no one has written about this.