Data Architecture Guide

Wednesday, September 3, 2025

Show Referential Integrity Relationship Constraints and Rules

The following commands show referential foreignkey/primary key constraints and rules that are defined in the database. The first command uses the INFORMATION_SCHEMA tables and thus may work in multiple DBMS vendors. It also provides an ALTER TABLE command that could recreate the relationship. The second command is based on SQL Server tables, but can be easily adpted to other DBMSs. Both use the STRING_AGG function to aggregate the column names for compound keys.


--Show referential constraints and rules.
SELECT RC.CONSTRAINT_NAME, MATCH_OPTION, UPDATE_RULE, DELETE_RULE
,KCU1.CONSTRAINT_NAME AS FK_CONSTRAINT
,KCU1.TABLE_SCHEMA AS FK_SCHEMA 
,KCU1.TABLE_NAME AS FK_TABLE
,FK_COLUMNS
,RC.UNIQUE_CONSTRAINT_NAME AS PK_CONSTRAINT
,KCU2.TABLE_SCHEMA AS PK_SCHEMA 
,KCU2.TABLE_NAME AS PK_TABLE
,PK_COLUMNS
,'ALTER TABLE '+KCU1.TABLE_SCHEMA+'.'+KCU1.TABLE_NAME+' ADD CONSTRAINT '+RC.CONSTRAINT_NAME
+' FOREIGN KEY ('+FK_COLUMNS+') REFERENCES '+KCU2.TABLE_SCHEMA+'.'+KCU2.TABLE_NAME+' ('+PK_COLUMNS
+') ON DELETE '+DELETE_RULE+' ON UPDATE '+UPDATE_RULE+';' AS COMMAND
FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS AS RC 
JOIN (
	SELECT CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, CONSTRAINT_NAME, TABLE_SCHEMA, TABLE_NAME
	,STRING_AGG(COLUMN_NAME, ', ') WITHIN GROUP (ORDER BY ORDINAL_POSITION) AS FK_COLUMNS
	FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
	GROUP BY CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, CONSTRAINT_NAME, TABLE_SCHEMA, TABLE_NAME
) AS KCU1 
ON  KCU1.CONSTRAINT_CATALOG = RC.CONSTRAINT_CATALOG  
AND KCU1.CONSTRAINT_SCHEMA = RC.CONSTRAINT_SCHEMA 
AND KCU1.CONSTRAINT_NAME = RC.CONSTRAINT_NAME 
JOIN (
	SELECT CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, CONSTRAINT_NAME, TABLE_SCHEMA, TABLE_NAME
	,STRING_AGG(COLUMN_NAME, ', ') WITHIN GROUP (ORDER BY ORDINAL_POSITION) AS PK_COLUMNS
	FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
	GROUP BY CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, CONSTRAINT_NAME, TABLE_SCHEMA, TABLE_NAME
) AS KCU2
ON  KCU2.CONSTRAINT_CATALOG = RC.UNIQUE_CONSTRAINT_CATALOG  
AND KCU2.CONSTRAINT_SCHEMA = RC.UNIQUE_CONSTRAINT_SCHEMA 
AND KCU2.CONSTRAINT_NAME = RC.UNIQUE_CONSTRAINT_NAME 

 --Show PK/FK relationships Referential Integrity		
SELECT FK.name	ForeignKeyName
, OBJECT_NAME(FK.referenced_object_id)	ParentTable
, OBJECT_NAME(FK.parent_object_id)	ChildTable
, STRING_AGG(PC.Name,', ')	PrimaryColumns
, STRING_AGG(FC.Name,', ')	ForeignColumns
, is_disabled, is_not_trusted, delete_referential_action_desc, update_referential_action_desc
FROM SYS.FOREIGN_KEYS FK
JOIN SYS.FOREIGN_KEY_COLUMNS FKC
ON FKC.constraint_object_id = FK.object_id
JOIN SYS.COLUMNS PC
ON PC.Object_id = FKC.referenced_object_id
AND PC.Column_id = FKC.referenced_column_id
JOIN SYS.COLUMNS FC
ON FC.Object_id = FKC.parent_object_id
AND FC.Column_id = FKC.parent_column_id
--WHERE OBJECT_NAME(FK.referenced_object_id) IN ('LM04')
GROUP BY FK.Name, FK.parent_object_id, FK.referenced_object_id
, is_disabled, is_not_trusted, delete_referential_action_desc, update_referential_action_desc
ORDER BY FK.referenced_object_id, FK.Name

Tuesday, September 2, 2025

How to Profile Columns and their Data

Profiling the metadata and data in every column in every table is frequently done by a purchased package. Doing it yourself is a rather big undertaking. Most approaches will do a full table scan (FTS) for every column in the table. This could be unacceptably time-consuming. The following script will collect this data from the source tables 50 columns at a time in a single pass.

The data collected will show for every table and column the following information in a [Column_Profile] table:

Schema Name
Table Name
Column_ID
Column Name
Extended Property
[Column Name] (shown with square parentheses and embedded spaces)
[Keep] (a subjective determination whether the column has any significant data; e.g. all NULL or only one value.)
Percentage data with non-NULL values
Data Type
data_type_ext (includes size definition)
Is the column Nullable? Yes/No
Default_Value
Is a Primary Key column "PK"
Is a Foreign Key column "FK"
Is a Unique Key column "UK"
Check_Contraint
Computed Column Definition
Column Collation Map
Number of Records
Number of Not NULL Values
Number of Distinct Values
Minimum Value in this column over the table (this is a SQL_VARIANT column)
Maximum Value in this column

DROP TABLE IF EXISTS [dbo].[Column_Profile_50];

--Create staging table to hold data profile.
CREATE TABLE [dbo].[Column_Profile_50]
( [SchemaName] varchar(50) NOT NULL	
, [TableName]  varchar(50) NOT NULL
, [Records] int NULL
, c1	varchar(70) --[ColumnName]			
, n1	int 		--[Not_NULL]
, d1	int 		--[Distinct_Values]
, a1	SQL_VARIANT	--[Min_Value]		
, b1	SQL_VARIANT	--[Max_Value]		
, c2 varchar(70), n2 int, d2 int, a2 SQL_VARIANT, b2 SQL_VARIANT 
, c3 varchar(70), n3 int, d3 int, a3 SQL_VARIANT, b3 SQL_VARIANT 
, c4 varchar(70), n4 int, d4 int, a4 SQL_VARIANT, b4 SQL_VARIANT 
, c5 varchar(70), n5 int, d5 int, a5 SQL_VARIANT, b5 SQL_VARIANT 
, c6 varchar(70), n6 int, d6 int, a6 SQL_VARIANT, b6 SQL_VARIANT 
, c7 varchar(70), n7 int, d7 int, a7 SQL_VARIANT, b7 SQL_VARIANT 
, c8 varchar(70), n8 int, d8 int, a8 SQL_VARIANT, b8 SQL_VARIANT 
, c9 varchar(70), n9 int, d9 int, a9 SQL_VARIANT, b9 SQL_VARIANT 
, c10 varchar(70), n10 int, d10 int, a10 SQL_VARIANT, b10 SQL_VARIANT 
, c11 varchar(70), n11 int, d11 int, a11 SQL_VARIANT, b11 SQL_VARIANT 
, c12 varchar(70), n12 int, d12 int, a12 SQL_VARIANT, b12 SQL_VARIANT 
, c13 varchar(70), n13 int, d13 int, a13 SQL_VARIANT, b13 SQL_VARIANT 
, c14 varchar(70), n14 int, d14 int, a14 SQL_VARIANT, b14 SQL_VARIANT 
, c15 varchar(70), n15 int, d15 int, a15 SQL_VARIANT, b15 SQL_VARIANT 
, c16 varchar(70), n16 int, d16 int, a16 SQL_VARIANT, b16 SQL_VARIANT 
, c17 varchar(70), n17 int, d17 int, a17 SQL_VARIANT, b17 SQL_VARIANT 
, c18 varchar(70), n18 int, d18 int, a18 SQL_VARIANT, b18 SQL_VARIANT 
, c19 varchar(70), n19 int, d19 int, a19 SQL_VARIANT, b19 SQL_VARIANT 
, c20 varchar(70), n20 int, d20 int, a20 SQL_VARIANT, b20 SQL_VARIANT 
, c21 varchar(70), n21 int, d21 int, a21 SQL_VARIANT, b21 SQL_VARIANT 
, c22 varchar(70), n22 int, d22 int, a22 SQL_VARIANT, b22 SQL_VARIANT 
, c23 varchar(70), n23 int, d23 int, a23 SQL_VARIANT, b23 SQL_VARIANT 
, c24 varchar(70), n24 int, d24 int, a24 SQL_VARIANT, b24 SQL_VARIANT 
, c25 varchar(70), n25 int, d25 int, a25 SQL_VARIANT, b25 SQL_VARIANT 
, c26 varchar(70), n26 int, d26 int, a26 SQL_VARIANT, b26 SQL_VARIANT 
, c27 varchar(70), n27 int, d27 int, a27 SQL_VARIANT, b27 SQL_VARIANT 
, c28 varchar(70), n28 int, d28 int, a28 SQL_VARIANT, b28 SQL_VARIANT 
, c29 varchar(70), n29 int, d29 int, a29 SQL_VARIANT, b29 SQL_VARIANT 
, c30 varchar(70), n30 int, d30 int, a30 SQL_VARIANT, b30 SQL_VARIANT 
, c31 varchar(70), n31 int, d31 int, a31 SQL_VARIANT, b31 SQL_VARIANT 
, c32 varchar(70), n32 int, d32 int, a32 SQL_VARIANT, b32 SQL_VARIANT 
, c33 varchar(70), n33 int, d33 int, a33 SQL_VARIANT, b33 SQL_VARIANT 
, c34 varchar(70), n34 int, d34 int, a34 SQL_VARIANT, b34 SQL_VARIANT 
, c35 varchar(70), n35 int, d35 int, a35 SQL_VARIANT, b35 SQL_VARIANT 
, c36 varchar(70), n36 int, d36 int, a36 SQL_VARIANT, b36 SQL_VARIANT 
, c37 varchar(70), n37 int, d37 int, a37 SQL_VARIANT, b37 SQL_VARIANT 
, c38 varchar(70), n38 int, d38 int, a38 SQL_VARIANT, b38 SQL_VARIANT 
, c39 varchar(70), n39 int, d39 int, a39 SQL_VARIANT, b39 SQL_VARIANT 
, c40 varchar(70), n40 int, d40 int, a40 SQL_VARIANT, b40 SQL_VARIANT 
, c41 varchar(70), n41 int, d41 int, a41 SQL_VARIANT, b41 SQL_VARIANT 
, c42 varchar(70), n42 int, d42 int, a42 SQL_VARIANT, b42 SQL_VARIANT 
, c43 varchar(70), n43 int, d43 int, a43 SQL_VARIANT, b43 SQL_VARIANT 
, c44 varchar(70), n44 int, d44 int, a44 SQL_VARIANT, b44 SQL_VARIANT 
, c45 varchar(70), n45 int, d45 int, a45 SQL_VARIANT, b45 SQL_VARIANT 
, c46 varchar(70), n46 int, d46 int, a46 SQL_VARIANT, b46 SQL_VARIANT 
, c47 varchar(70), n47 int, d47 int, a47 SQL_VARIANT, b47 SQL_VARIANT 
, c48 varchar(70), n48 int, d48 int, a48 SQL_VARIANT, b48 SQL_VARIANT 
, c49 varchar(70), n49 int, d49 int, a49 SQL_VARIANT, b49 SQL_VARIANT 
, c50 varchar(70), n50 int, d50 int, a50 SQL_VARIANT, b50 SQL_VARIANT 
)

--SELECT * FROM [dbo].[Column_Profile_50];
--TRUNCATE TABLE [dbo].[Column_Profile_50];

--Collect column profile data 50 columns at a time in Full Table Scans.		
DECLARE @table_schema varchar(100)		
DECLARE @table_name varchar(100)		
DECLARE @column_name varchar(100)		
DECLARE @current_table varchar(100)		
DECLARE @current_schema varchar(100)		
DECLARE @sql varchar(MAX)		
DECLARE @sql1 varchar(MAX)		
DECLARE @sql2 varchar(MAX)		
DECLARE @col int = 0

--Get columns for data profile.
DECLARE Cursor01 CURSOR LOCAL FAST_FORWARD FOR		
	SELECT A.TABLE_SCHEMA, A.TABLE_NAME, COLUMN_NAME 
	FROM INFORMATION_SCHEMA.COLUMNS a	
--	left join [DW].[Column_Profile_temp] x on a.TABLE_SCHEMA=x.SchemaName and a.TABLE_NAME=x.tableName and concat('[',a.COLUMN_NAME,']')=x.columnname  --this is to continue after failure
--	where x.TableName is null
	WHERE DATA_TYPE not in ('bit','text','ntext')	--Do not profile data in columns with these datatypes.
	AND CHARACTER_OCTET_LENGTH <> -1
--	AND a.TABLE_SCHEMA = 'Financials' --Change this filter if needed.
--	AND a.TABLE_NAME NOT LIKE '%TEST%'
	order by 1,2,3
OPEN Cursor01;		
FETCH NEXT FROM Cursor01 into @table_schema, @table_name, @column_name;		
WHILE @@FETCH_STATUS = 0		
BEGIN	
	--If new table then start building query
	IF @col=0
	BEGIN
		set @sql1 = 'insert into [dbo].Column_Profile_50 (SchemaName, TableName, Records'
		set @sql2 = replace(replace('select ''!'', ''^'', count(*)', '^',@table_name), '!',@table_schema)
	END
	--Add column to query
	set @col=@col+1
	set @sql1=concat(@sql1, replace(', c^, n^, d^, a^, b^', '^', @col))
	set @sql2=concat(@sql2, replace(', ''[?]'', count([?]), count(distinct [?]), min([?]), max([?])', '?',@column_name))
	--Get next column
	set @current_table  = @table_name
	set @current_schema = @table_schema
	FETCH NEXT FROM Cursor01 into @table_schema, @table_name, @column_name;	
	--If table name changed then finish query and execute
	IF @table_name<>@current_table OR @col=50 OR @@FETCH_STATUS <> 0
	BEGIN
		set @sql=concat(@sql1, ') ', @sql2, replace(replace(' from !.[^];', '^', @current_table), '!', @current_schema))
		PRINT @sql
		EXEC(@sql)
		set @col=0
	END
END		
CLOSE Cursor01;		
DEALLOCATE Cursor01;
GO

DROP TABLE IF EXISTS [dbo].[Column_Profile_temp]		

CREATE TABLE [dbo].[Column_Profile_temp](		
	[SchemaName] [varchar](50) NOT NULL,	
	[TableName] [varchar](50) NOT NULL,	
	[ColumnName] [varchar](70) NOT NULL,	
	[Records] [int] NULL,	
	[Not_NULL] [int] NULL,	
	[Distinct_Values] [int] NULL,	
	[Min_Value] SQL_VARIANT,	
	[Max_Value] SQL_VARIANT	
)

--Convert from 50 source columns per row to 1 row per source column.
INSERT INTO [dbo].[Column_Profile_temp]
SELECT [SchemaName], [TableName], c1, [Records], n1, d1, a1, b1 FROM [dbo].[Column_Profile_50] WHERE c1 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c2, [Records], n2, d2, a2, b2 FROM [dbo].[Column_Profile_50] WHERE c2 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c3, [Records], n3, d3, a3, b3 FROM [dbo].[Column_Profile_50] WHERE c3 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c4, [Records], n4, d4, a4, b4 FROM [dbo].[Column_Profile_50] WHERE c4 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c5, [Records], n5, d5, a5, b5 FROM [dbo].[Column_Profile_50] WHERE c5 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c6, [Records], n6, d6, a6, b6 FROM [dbo].[Column_Profile_50] WHERE c6 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c7, [Records], n7, d7, a7, b7 FROM [dbo].[Column_Profile_50] WHERE c7 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c8, [Records], n8, d8, a8, b8 FROM [dbo].[Column_Profile_50] WHERE c8 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c9, [Records], n9, d9, a9, b9 FROM [dbo].[Column_Profile_50] WHERE c9 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c10, [Records], n10, d10, a10, b10 FROM [dbo].[Column_Profile_50] WHERE c10 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c11, [Records], n11, d11, a11, b11 FROM [dbo].[Column_Profile_50] WHERE c11 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c12, [Records], n12, d12, a12, b12 FROM [dbo].[Column_Profile_50] WHERE c12 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c13, [Records], n13, d13, a13, b13 FROM [dbo].[Column_Profile_50] WHERE c13 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c14, [Records], n14, d14, a14, b14 FROM [dbo].[Column_Profile_50] WHERE c14 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c15, [Records], n15, d15, a15, b15 FROM [dbo].[Column_Profile_50] WHERE c15 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c16, [Records], n16, d16, a16, b16 FROM [dbo].[Column_Profile_50] WHERE c16 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c17, [Records], n17, d17, a17, b17 FROM [dbo].[Column_Profile_50] WHERE c17 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c18, [Records], n18, d18, a18, b18 FROM [dbo].[Column_Profile_50] WHERE c18 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c19, [Records], n19, d19, a19, b19 FROM [dbo].[Column_Profile_50] WHERE c19 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c20, [Records], n20, d20, a20, b20 FROM [dbo].[Column_Profile_50] WHERE c20 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c21, [Records], n21, d21, a21, b21 FROM [dbo].[Column_Profile_50] WHERE c21 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c22, [Records], n22, d22, a22, b22 FROM [dbo].[Column_Profile_50] WHERE c22 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c23, [Records], n23, d23, a23, b23 FROM [dbo].[Column_Profile_50] WHERE c23 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c24, [Records], n24, d24, a24, b24 FROM [dbo].[Column_Profile_50] WHERE c24 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c25, [Records], n25, d25, a25, b25 FROM [dbo].[Column_Profile_50] WHERE c25 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c26, [Records], n26, d26, a26, b26 FROM [dbo].[Column_Profile_50] WHERE c26 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c27, [Records], n27, d27, a27, b27 FROM [dbo].[Column_Profile_50] WHERE c27 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c28, [Records], n28, d28, a28, b28 FROM [dbo].[Column_Profile_50] WHERE c28 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c29, [Records], n29, d29, a29, b29 FROM [dbo].[Column_Profile_50] WHERE c29 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c30, [Records], n30, d30, a30, b30 FROM [dbo].[Column_Profile_50] WHERE c30 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c31, [Records], n31, d31, a31, b31 FROM [dbo].[Column_Profile_50] WHERE c31 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c32, [Records], n32, d32, a32, b32 FROM [dbo].[Column_Profile_50] WHERE c32 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c33, [Records], n33, d33, a33, b33 FROM [dbo].[Column_Profile_50] WHERE c33 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c34, [Records], n34, d34, a34, b34 FROM [dbo].[Column_Profile_50] WHERE c34 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c35, [Records], n35, d35, a35, b35 FROM [dbo].[Column_Profile_50] WHERE c35 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c36, [Records], n36, d36, a36, b36 FROM [dbo].[Column_Profile_50] WHERE c36 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c37, [Records], n37, d37, a37, b37 FROM [dbo].[Column_Profile_50] WHERE c37 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c38, [Records], n38, d38, a38, b38 FROM [dbo].[Column_Profile_50] WHERE c38 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c39, [Records], n39, d39, a39, b39 FROM [dbo].[Column_Profile_50] WHERE c39 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c40, [Records], n40, d40, a40, b40 FROM [dbo].[Column_Profile_50] WHERE c40 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c41, [Records], n41, d41, a41, b41 FROM [dbo].[Column_Profile_50] WHERE c41 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c42, [Records], n42, d42, a42, b42 FROM [dbo].[Column_Profile_50] WHERE c42 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c43, [Records], n43, d43, a43, b43 FROM [dbo].[Column_Profile_50] WHERE c43 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c44, [Records], n44, d44, a44, b44 FROM [dbo].[Column_Profile_50] WHERE c44 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c45, [Records], n45, d45, a45, b45 FROM [dbo].[Column_Profile_50] WHERE c45 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c46, [Records], n46, d46, a46, b46 FROM [dbo].[Column_Profile_50] WHERE c46 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c47, [Records], n47, d47, a47, b47 FROM [dbo].[Column_Profile_50] WHERE c47 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c48, [Records], n48, d48, a48, b48 FROM [dbo].[Column_Profile_50] WHERE c48 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c49, [Records], n49, d49, a49, b49 FROM [dbo].[Column_Profile_50] WHERE c49 IS NOT NULL UNION ALL
SELECT [SchemaName], [TableName], c50, [Records], n50, d50, a50, b50 FROM [dbo].[Column_Profile_50] WHERE c50 IS NOT NULL
ORDER BY 1,2,3;
GO

DROP TABLE IF EXISTS [dbo].[Column_Profile];

CREATE TABLE [dbo].Column_Profile(
	[Schema_Name] nvarchar(128) NULL,
	Table_Name sysname NOT NULL,
	Column_ID int NULL,
	Column_Name sysname NOT NULL,
	Extended_Property SQL_VARIANT NULL,
	[Column Name] varchar(130) NULL,
	[Keep] varchar(1) NULL,
	Pct_Valued int NULL,
	Data_Type sysname NULL,
	data_type_ext nvarchar(269) NULL,
	Nullable varchar(3) NULL,
	Default_Value nvarchar(max) NULL,
	Primary_Key varchar(2) NULL,
	Foreign_Key varchar(2) NULL,
	Unique_Key varchar(2) NULL,
	Check_contraint varchar(5) NULL,
	computed_column_definition nvarchar(max) NULL,
	ColumnCollation varchar(50),
	Records int NULL,
	Not_NULL int NULL,
	Distinct_Values int NULL,
	Min_Value SQL_VARIANT,
	Max_Value SQL_VARIANT
)
GO

 --Load column profile.
INSERT INTO [dbo].[Column_Profile]
SELECT schema_name(tbl.schema_id) AS Schema_Name
,tbl.NAME AS Table_Name
,col.Column_ID
,col.NAME AS Column_Name
,ep.value AS Extended_Property
,'['+TRIM(CAST(eq.value AS VARCHAR(50)))+' - '+TRIM(CAST(ep.value AS VARCHAR(75)))+']' AS [Column Name]

,case
 when ep.value is NULL THEN 'N'
 when P.[Records] < 2 then 'N'
 when P.[Distinct_Values] = 1 then 'N'
-- when 100.0*P.[Not_NULL]/CASE P.[Records] when 0 then 1 else P.[Records] end < 10 then 'N'
 when CAST(P.[Not_NULL] AS FLOAT)/CAST(P.[Records] AS FLOAT) < 0.005 then 'N'
 else 'Y'
 end AS Keep	--Determine if this column has any significant data (this is subjective).
 
,FLOOR(100.0*P.[Not_NULL]/CASE P.[Records] when 0 then 1 else P.[Records] end) AS Pct_Valued
,t.NAME AS Data_Type
,t.NAME 
+	CASE 
	WHEN t.is_user_defined = 0
	THEN isnull( '(' + 
		CASE 
		WHEN t.NAME IN ('binary', 'char', 'nchar', 'varchar', 'nvarchar', 'varbinary')
		THEN
			CASE col.max_length
			WHEN - 1
			THEN 'MAX'
			ELSE 
				CASE 
				WHEN t.NAME IN ('nchar', 'nvarchar')
				THEN cast(col.max_length / 2 AS VARCHAR(4))
				ELSE cast(col.max_length AS VARCHAR(4))
				END
			END
		WHEN t.NAME IN ('datetime2', 'datetimeoffset', 'time')
		THEN cast(col.scale AS VARCHAR(4))
		WHEN t.NAME IN ('decimal', 'numeric')
		THEN cast(col.precision AS VARCHAR(4)) + ', ' + cast(col.scale AS VARCHAR(4))
		END + ')', '')
	ELSE ':' + (
		SELECT c_t.NAME + isnull('(' + 
			CASE 
			WHEN c_t.NAME IN ('binary', 'char', 'nchar', 'varchar', 'nvarchar', 'varbinary')
			THEN 
				CASE c.max_length
				WHEN - 1
				THEN 'MAX'
				ELSE 
					CASE 
					WHEN t.NAME IN ('nchar', 'nvarchar')
					THEN cast(c.max_length / 2 AS VARCHAR(4))
					ELSE cast(c.max_length AS VARCHAR(4))
					END
				END
			WHEN c_t.NAME IN ('datetime2', 'datetimeoffset', 'time')
			THEN cast(c.scale AS VARCHAR(4))
			WHEN c_t.NAME IN ('decimal', 'numeric')
			THEN cast(c.precision AS VARCHAR(4)) + ', ' + cast(c.scale AS VARCHAR(4))
			END + ')', '')
		FROM sys.columns AS c
		INNER JOIN sys.types AS c_t ON c.system_type_id = c_t.user_type_id
		WHERE c.object_id = col.object_id AND c.column_id = col.column_id AND c.user_type_id = col.user_type_id
			)
	END AS data_type_ext
,CASE WHEN col.is_nullable = 0 THEN 'No' ELSE 'Yes' END AS Nullable
,CASE WHEN def.DEFINITION IS NOT NULL THEN def.DEFINITION ELSE '' END AS Default_Value
,CASE WHEN pk.column_id IS NOT NULL THEN 'PK' ELSE '' END AS Primary_Key
,CASE WHEN fk.parent_column_id IS NOT NULL THEN 'FK' ELSE '' END AS Foreign_Key
,CASE WHEN uk.column_id IS NOT NULL THEN 'UK' ELSE '' END AS Unique_Key
,CASE WHEN ch.check_const IS NOT NULL THEN ch.check_const ELSE '' END AS Check_contraint
,cc.DEFINITION AS computed_column_definition
,col.collation_name AS ColumnCollation
,P.[Records]
,P.[Not_NULL]
,P.[Distinct_Values]
,P.[Min_Value]
,P.[Max_Value]
FROM sys.tables AS tbl
JOIN sys.columns AS col ON tbl.object_id = col.object_id
JOIN sys.types AS t ON col.user_type_id = t.user_type_id
LEFT JOIN sys.default_constraints AS def ON def.object_id = col.default_object_id
LEFT JOIN (
	SELECT index_columns.object_id, index_columns.column_id
	FROM sys.index_columns
	INNER JOIN sys.indexes ON index_columns.object_id = indexes.object_id AND index_columns.index_id = indexes.index_id
	WHERE indexes.is_primary_key = 1
) AS pk ON col.object_id = pk.object_id AND col.column_id = pk.column_id
LEFT JOIN (
	SELECT fc.parent_column_id, fc.parent_object_id
	FROM sys.foreign_keys AS f
	INNER JOIN sys.foreign_key_columns AS fc ON f.object_id = fc.constraint_object_id
	GROUP BY fc.parent_column_id, fc.parent_object_id
) AS fk ON fk.parent_object_id = col.object_id AND fk.parent_column_id = col.column_id
LEFT JOIN (
	SELECT c.parent_column_id, c.parent_object_id, 'Check' check_const
	FROM sys.check_constraints AS c
	GROUP BY c.parent_column_id, c.parent_object_id
) AS ch ON col.column_id = ch.parent_column_id AND col.object_id = ch.parent_object_id
LEFT JOIN (
	SELECT index_columns.object_id, index_columns.column_id
	FROM sys.index_columns
	INNER JOIN sys.indexes ON indexes.index_id = index_columns.index_id AND indexes.object_id = index_columns.object_id
	WHERE indexes.is_unique_constraint = 1
	GROUP BY index_columns.object_id, index_columns.column_id
) AS uk ON col.column_id = uk.column_id AND col.object_id = uk.object_id
LEFT JOIN sys.extended_properties AS eq ON eq.major_id=tbl.object_id AND eq.minor_id=0 AND eq.class=1 AND eq.NAME = 'MS_Description'
LEFT JOIN sys.extended_properties AS ep 
	ON tbl.object_id = ep.major_id
	AND col.column_id = ep.minor_id
	AND ep.NAME = 'MS_Description'
	AND ep.class_desc = 'OBJECT_OR_COLUMN'
LEFT JOIN sys.computed_columns AS cc ON tbl.object_id = cc.object_id AND col.column_id = cc.column_id
LEFT JOIN [dbo].[Column_Profile_temp] P ON P.SchemaName = schema_name(tbl.schema_id) AND P.TableName = tbl.NAME AND P.ColumnName = CONCAT('[',col.NAME,']') --If you created this table		
--WHERE SCHEMA_NAME(tbl.schema_id) = 'Financials' --Filter if needed.
--AND tbl.TABLE_NAME NOT LIKE '%TEST%'
ORDER BY 1,2,3;
GO

SELECT * FROM [dbo].[Column_Profile] ORDER BY 1,2,3
GO

Monday, September 1, 2025

Table Profile

The following SQL Server code will create a "Table_Profile" that shows the following table metadata and static & dynamic volumetrics:

Table Owner
Schema Name
Table Name
Number of Columns
Number of Rows
Size (MB)
Has Primary Key
Has Identity Column
Has Unique Constraint
Has Cluster Index
Has an Index
Has a Foreign Key Constraint to a Parent Table
Is Referenced by other Child tables having a Foreign Key Constraint
Number of Referenced Parent Tables
Number of Referenced Child Tables
Has Check Constraint
Has Text Image (Blob)
Has Active Full Text Index
Has Row Guid Column
Has Timestamp
Has Trigger
Number of Index Seeks
Number of Index Scans
Number of Index Lookups
Number of Index Updates
Create Date
Modify Date
Last User Update
Last User Index Read
Extended Property Name
Extended Property Value (This is a table description you can assign using EXEC sys.sp_addextendedproperty)


 --Display table metadata and static & dynamic volumetrics.
SELECT 
  COALESCE((SELECT pr.name 		
            FROM sys.database_principals pr 		
            WHERE pr.principal_id = tbl.principal_id)		
           , SCHEMA_NAME(tbl.schema_id)) AS [Owner]		
, SCHEMA_NAME(tbl.schema_id) as [Schema]
, tbl.Name		
, tbl.max_column_id_used as [Columns]
--, (SELECT COUNT(*) 
--	FROM INFORMATION_SCHEMA.COLUMNS a
--	WHERE a.table_schema = SCHEMA_NAME(tbl.schema_id) AND a.table_name = tbl.Name ) AS [#Columns] 
, Coalesce( (Select sum (spart.rows) from sys.partitions spart 		
    Where spart.object_id = tbl.object_id and spart.index_id < 2), 0) AS [Rows]		
, "Size (MB)"
, OBJECTPROPERTY(tbl.object_id, 'TableHasPrimaryKey') AS [PK]
, OBJECTPROPERTY(tbl.object_id, 'TableHasIdentity')   AS [Identity Col]
, OBJECTPROPERTY(tbl.object_id, 'TableHasUniqueCnst') AS [Unique Cnt]
, OBJECTPROPERTY(tbl.object_id, 'TableHasClustIndex') AS [ClustIdx]
, OBJECTPROPERTY(tbl.object_id, 'TableHasIndex')      AS [Indexed]
, OBJECTPROPERTY(tbl.object_id, 'TableHasForeignKey') AS [FK Cnt]
, OBJECTPROPERTY(tbl.object_id, 'TableHasForeignRef') AS [FK Ref]
, Coalesce(A.Parents, 0) Parents
, Coalesce(C.Children, 0 ) Children
, OBJECTPROPERTY(tbl.object_id, 'TableHasCheckCnst')  AS [Check Cnt]
, OBJECTPROPERTY(tbl.object_id, 'TableHasTextImage')  AS [Blob]
, OBJECTPROPERTY(tbl.object_id, 'TableHasActiveFulltextIndex') AS [FT index]
, OBJECTPROPERTY(tbl.object_id, 'TableHasRowGuidCol') AS [ROWGUIDCOL]
, OBJECTPROPERTY(tbl.object_id, 'TableHasTimestamp')  AS [Timestamp]
, CASE 
  WHEN OBJECTPROPERTY(tbl.object_id, 'HasDeleteTrigger') = 1 THEN 1 
  WHEN OBJECTPROPERTY(tbl.object_id, 'HasInsertTrigger') = 1 THEN 1
  WHEN OBJECTPROPERTY(tbl.object_id, 'HasUpdateTrigger') = 1 THEN 1
  WHEN OBJECTPROPERTY(tbl.object_id, 'HasAfterTrigger')  = 1 THEN 1
  WHEN OBJECTPROPERTY(tbl.object_id, 'HasInsteadOfTrigger') = 1 THEN 1
  ELSE 0
  END AS [Trigger]
, L.index_seeks [Idx Seeks], L.index_scans [Idx Scan], L.index_lookups [Idx Lookups], L.index_updates [Idx Updates]		
, tbl.create_date, tbl.modify_date		
, L.last_user_update		
, L.last_user_index_read		
, P.name AS ExtendedPropertyName		
, CAST(P.value AS sql_variant) AS ExtendedPropertyValue
INTO [dbo].[Table_Profile]
FROM sys.tables AS tbl		
LEFT JOIN (SELECT object_id, SUM(reserved_page_count)*8.0/1024 AS "Size (MB)" FROM sys.dm_db_partition_stats GROUP BY object_id) D ON D.object_id=tbl.object_id
LEFT JOIN (SELECT PARENT_OBJECT_ID, COUNT(*) AS Parents FROM SYS.FOREIGN_KEY_COLUMNS GROUP BY PARENT_OBJECT_ID) A ON A.PARENT_OBJECT_ID=tbl.object_id
LEFT JOIN (SELECT REFERENCED_OBJECT_ID, COUNT(*) AS Children FROM SYS.FOREIGN_KEY_COLUMNS GROUP BY REFERENCED_OBJECT_ID) C ON C.REFERENCED_OBJECT_ID=tbl.object_id
LEFT JOIN (		
	SELECT OBJECT_ID	
	, SUM(user_seeks) index_seeks, SUM(user_scans) index_scans, SUM(user_lookups) index_lookups, MAX(user_updates) index_updates	
	, MAX(last_user_update) last_user_update	
	, MAX(last_user_index_read) last_user_index_read	
	FROM (	
		SELECT OBJECT_ID, user_seeks, user_scans, user_lookups, user_updates, last_user_update
		, (SELECT MAX(LastUpdateDate) FROM (VALUES (last_user_seek),(last_user_scan),(last_user_lookup)) AS last_user_read(LastUpdateDate)) last_user_index_read
		FROM sys.dm_db_index_usage_stats
		) L
	GROUP BY OBJECT_ID	
	) L	
ON L.object_id = tbl.object_id		
LEFT JOIN sys.extended_properties AS p ON p.major_id=tbl.object_id AND p.minor_id=0 AND p.class=1
--WHERE SCHEMA_NAME(tbl.schema_id) = 'Financials' --Filter if needed
ORDER BY 1,2,3

Tuesday, May 31, 2022

Log T-SQL processing events

I frequently see questions about the best way to log events in T-SQL processing, including success and failure messages. This method has the following advantages:

Minimal code needs to be added to your procedures.
Can log both informational status messages and error messages.
Events go into a log table, and are displayed in the SSMS messages pane, and if it has an error it “throws” the error to the calling program with error detail info.
Error messages include the following:

Date and time
Application that is running the procedure
User that is running the procedure
Server name
Database name
Schema name
Procedure name (this is the calling procedure where the even happened)
Message to record, e.g. “Start of MyProc”, “Insert sales”
Line number of the error
Error Message
Error number
Error severity
Error state
(optionally) the SQL command causing the error

Start by creating the following table to log events:

CREATE TABLE DW.EventLog (
id            int IDENTITY(1,1) NOT NULL,
DateLogged    datetime NOT NULL DEFAULT (GETDATE()),
AppName       nvarchar(240),
UserName      nvarchar(128) DEFAULT (USER_NAME()),
ServerName    nvarchar(240),
DBName        nvarchar(240),
SchemaName    nvarchar(128),
ProcName      nvarchar(128),
[Message]     varchar(240),
ErrorLine     int,
ErrorMessage  nvarchar(4000),
ErrorNumber   int,
ErrorSeverity int,
ErrorState    int
)

Create the following procedure to log the event:

--Procedure to log error and success information.
--Call this using: EXEC DW.LogEvent 'Description of next step', @@PROCID
ALTER PROCEDURE [DW].[LogEvent] @Message VARCHAR(240) = '', @ProcID int = NULL
AS  
BEGIN

DECLARE 
  @AppName  		NVARCHAR(128)
, @UserName 		NVARCHAR(128)
, @ServerName		NVARCHAR(128)
, @DBName   		NVARCHAR(128)
, @SchemaName		NVARCHAR(128)
, @ProcName 		NVARCHAR(128)
, @ErrorLine		INT
, @ErrorMessage		NVARCHAR(4000)
, @ErrorNumber		INT
, @ErrorSeverity	INT
, @ErrorState		INT
, @FullMessage		NVARCHAR(max);

SET @AppName		= APP_NAME()
SET @UserName		= USER_NAME()
SET @ServerName		= @@SERVERNAME
SET @DBName 		= DB_NAME()
SET @SchemaName		= COALESCE(LEFT(ERROR_PROCEDURE(),CHARINDEX('.',ERROR_PROCEDURE())-1), OBJECT_SCHEMA_NAME(@ProcID), OBJECT_SCHEMA_NAME(@@PROCID))
SET @ProcName		= COALESCE(SUBSTRING(ERROR_PROCEDURE(),CHARINDEX('.',ERROR_PROCEDURE())+1,128), OBJECT_NAME(@ProcID), OBJECT_NAME(@@PROCID))
SET @ErrorLine		= ERROR_LINE()
SET @ErrorMessage	= ERROR_MESSAGE()
SET @ErrorNumber	= ERROR_NUMBER()
SET @ErrorSeverity	= ERROR_SEVERITY()
SET @ErrorState		= ERROR_STATE()

--Log the error.
INSERT INTO [DW].[EventLog](
AppName,
ServerName,
DBName,
SchemaName,
ProcName,
[Message],
ErrorLine, 
ErrorMessage, 
ErrorNumber,  
ErrorSeverity, 
ErrorState
) 
SELECT
@AppName,
@ServerName,
@DBName,
@SchemaName,
@ProcName,
@Message,
@ErrorLine,  
@ErrorMessage,  
@ErrorNumber,
@ErrorSeverity,  
@ErrorState;

SET @FullMessage = CONCAT
( '['	, @ServerName
, '].['	, @DBName
, '].['	, @SchemaName
, '].['	, @ProcName
, ']: '	, @Message
);

IF @ErrorNumber IS NOT NULL
SET @FullMessage = LEFT(CONCAT
( @FullMessage
, '. ErrorLine: '      , TRIM(STR(@ErrorLine))
, '. ErrorMessage: '   , @ErrorMessage
, '. ErrorNumber: '    , TRIM(STR(@ErrorNumber))
, '. ErrorSeverity: '  , TRIM(STR(@ErrorSeverity))
, '. ErrorState: '     , TRIM(STR(@ErrorState))
, '. Application: '    , @AppName
, '. User: '           , @UserName
), 2048);

--Display the message.
PRINT @FullMessage;

--If this is an error then throw the error to the calling application, otherwise return to calling application.
IF @ErrorNumber IS NOT NULL
	THROW 50000, @FullMessage, @ErrorState;
ELSE
	RETURN

END

Here is an example of a procedure that is modified to log these events. The changes you need to make to your code are:

Add “DECLARE @step VARCHAR(240); ”
Add the TRY/CATCH. Note: the @@PROCID is needed to accurately record the name of the current procedure.
Add the “EXEC DW.LogEvent…” lines to announce information that you want recorded.
Add “SET @step = 'what is next'” prior to each SQL command to describe where it failed.

CREATE PROCEDURE [DW].[test_error]
AS
BEGIN TRY
	DECLARE @step VARCHAR(240);

	EXEC DW.LogEvent 'Starting error test', @@PROCID;	--announce start of run


	--Body of the procedure follows
	SET @step = 'second step'	--describe each coming step
	SELECT 1/0


	EXEC DW.LogEvent 'Error test completed', @@PROCID;	--announce end of run

	RETURN

END TRY
BEGIN CATCH
	EXEC DW.LogEvent @step, @@PROCID;
END CATCH

The only thing missing is the actual text of the SQL command that caused the error. If that is needed you can execute your commands using the following procedure:

CREATE PROCEDURE [DW].[ExecuteSQL] @SQL NVARCHAR(max), @ProcID int = NULL
AS
--Procedure to execute a sql command and log the error and the offending sql text.
--Call this using: EXEC DW.ExecuteSQL 'SQL Command here', @@PROCID
BEGIN
	BEGIN TRY  
		DECLARE @SqlCommand	NVARCHAR(240)
		SET @SqlCommand = LEFT(CONCAT('SQL: ', @SQL), 240)
		--execute the sql command
		EXECUTE sp_executesql @SQL
	END TRY  
	BEGIN CATCH  
		PRINT 'ERROR WITH THE FOLLOWING COMMAND:'
		PRINT @SQL
		-- Log the error.  
		EXECUTE DW.LogEvent @SqlCommand, @ProcID;
	END CATCH;
END;

Wednesday, May 11, 2022

Introducing the Unified Star Schema

A few months after my prior blog post on the OneBigTableapproach I learned of a new way to design Data Warehouses that is somewhat similar. It is a new way to design star schemas and is known as the Unified Star Schema (USS). It is recommended by the grandfather of Data Warehousing Bill Inmon in a book that just came out in the past year. Once I heard about it I immediately went out and bought The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design, by Bill Inmon and Francesco Puppini and read & underlined all 288 pages. I agree this new approach is better than the Kimball multi-star approach. USS is the best approach for building a ROLAP data warehouse using row-oriented tables. My OneBigTable approach is the best way to build a data warehouse for a columnar share-nothing hyper-converged DBMS. (The book portrays the OBT in passing in Figure 14.15.)

The core of the approach is the construction of a single “Puppini Bridge” table that works like a single super-fact table of only keys and transforms all other tables to fact/dimensions that users can blend as they see fit. Actually, the distinction between fact and dimension is gone. The Puppini Bridge table has no attributes and no metrics, only primary & foreign keys. It joins to all other tables like a star schema with no snowflaking. It is built quickly by performing a UNION ALL of the primary key (PK) & foreign keys (FK) of every table. This transforms an OLTP normalized database into a performant OLAP star data warehouse by the addition of just one table and no other ETL.

USS works well with BI tools that use “associations” in relationships to solve the “Fan Trap”, such as Microsoft’s Power BI, QlikView, and Tableau (starting with version 2020.2), but not so well with BI tools that use SQL JOIN, like Tibco Spotfire and Business Objects. (A Fan trap is when metrics in a parent table gets duplicated against the child tables causing incorrect aggregations.)

The classic star schema approach causes a problem known as the “chasm trap” when two or more fact tables are used in a BI tool. The display of data from multiple fact tables results in a cartesian join duplication. The Puppini Bridge & OneBigTable overcomes this situation, letting the details show in a logical list of one fact followed by the other fact.

The book left out one point that I think is essential to building the Puppini Bridge: One should include not only each record’s PK & FKs but also the FKs to all ancestor tables (parent of parent, etc.).

I also learned an important point when building the OneBigTable: attributes should be duplicated down but metrics should only show on the record that provided the PK.

Most exciting, I believe USS allows the rapid generation of code that creates the solution if the referential constraints are all defined in the database. This also makes the solution more maintainable and robust. I plan to share this technique in a future post.

BTW, a company named Incorta pitches a data warehouse that they say is the “Death of the Star Schema.” The USS book mentions it in Figure 14.6 and it appears to have nothing in common with USS or OBT.

Here are some other reviews:

CONCLUSION: USS & OBT is better than classic star design in usability, maintenance and development time. Muti-star solutions should be replaced with USS in row-oriented databases and OBT should be considered for MPP columnar databases like AWS Redshift & Azure Synapse.

At this point no one has a valid reason to stick with classic star schemas.

Wednesday, February 2, 2022

The Fully Denormalized, United, Columnar Approach

Star schema considered harmful. They were born for ROLAP using row-based DBMS. In today’s MPP columnar systems using One Big Table (OTB) is the better approach. You may think this is madness, and I would have agreed with you—until I tried it.

Compared to the use of multiple star schemas, a single-table columnar approach provides improved ease of use, lower maintenance cost, and better query performance. Read on to find out why.

Some technical knowledge is needed to understand the driving trade-offs between normalized, star (columnar or not), and denormalized columnar structures, and why the history of OLAP architecture varied the span-of-join from many, to one, and now, to none.

Performance Target: Tuned for FTS

The goal of schema design for performance tuning is to establish physical structures that provide acceptable response time when it is most needed. In the case of business intelligence that would be when ad hoc reports are being iteratively crafted for weighing business strategy. This performance is achieved by building performant physical structures prior to user access to provide the best performance when the user later demands the data, in essence to shift the time cost from “read” time to “write” time to the extent possible for write-once read-many (WORM) applications.

The slowest component in a computer is its only moving part, the disk drive. The goal of performance tuning is to maximize Ram over disk, and if the amount of data exceeds Ram (which is usually the case for data warehouses) the goal is to is to minimize disk Physical Input/Output (PIO) time. The insight here is to build structures based on the performance characteristics of random vs. sequential PIO. Random reads are when the read/write head of the disk drive jumps back and forth seeking out data at random locations. This happens when performing index access (for predicate or join). 99% of the time is waisted as the system waits for disk seek, rotation & transfer. Sequential PIO is when the read/write head is positioned only once, and the disk blocks are read sequentially off the disk without moving the head to new locations. This happens when doing Full Table Scans (FTS). Sequential PIO is 7,000 times faster than random PIO. Random PIO runs at 300 reads per second while random PIO runs at 200 MB per second. In less time than it takes to read 2 random blocks (8K ea) you can read 128 contiguous blocks (1MB). However, the 1 MB needs to contain useful data or else the time is wasted, for if only 8K of useful info is found then you would be better off to have made a random read. Optimally using sequential structures requires a knowledge of the anticipated data and writing it in the order needed. An example is the use of materialized views that pre-loads a fully denormalized data structure ahead of its retrieval (at the cost of storage).

Access to a Storage Area Network (SAN) can also be a bottleneck, especially if you use time-stamped shadow blocks as these can silently convert sequential PIO to random PIO. The solution is to kill the SAN and go with hyperconverged hardware, which is what many cloud-based systems and all TPC-H benchmarks do. The next significant performance target is node-to-node network bandwidth, which can be reduced by replicating dimension over nodes or correlating rows between tables using partition keys.

Normalized vs. Star: Span-of-Join

Normalized structures are best used for transaction processing where small amounts of random data (in KB, not MB) need to be written and read from the disk in about equal measure. Normalization eliminates data redundancy and thus prevents update anomalies (redundant data falling out of sync). It makes heavy use of indexes and nested loop index joins to navigate across tables. Normalized structures are not recommended for reporting over a large volume of data because navigational access with a large span-of-joins would offer poor performance.

Relational online analytical processing (ROLAP) using row-based structures for reporting favors star schemas. Their secret is the reduction of the span-of-join to 1. One table is designated the fact table and all other tables are directly joined to it. When queried the cost-based optimizer (CBO) loads the dimension tables via FTS into an in-memory structure arranged for a hash table join. The fact table, which can be larger than working Ram, is then scanned using an FTS, filtering & aggregating results as it scans. Once a row in the fact table is merged/summarized into the growing aggregate the data from the fact row is discarded while the hashed dimensions remain in memory. Thus, every block of data is read off the disk only once in a single pass using sequential PIO. Surprisingly, many do not know that this is the rationale behind star schemas. If a dimension table becomes too large to fit in memory it must be accessed using the much slower nested-loop index join (with lots of random PIO) or the sort-merge join (with temporary intermediate materialized disk PIO). Either way, the performance of the query suffers.

Business Intelligence Footprint

BI queries typically aggregates a large quantity rows to produce a time-series view of events and their metrics. Only a few columns are needed to filter, group by, and summarize the output in a graph. Graphs limit the number of ‘group by’ columns to around 5 or less (based on #axises, colors, marks, size, etc.) and the amount of aggregated data points to that which can be perceived.

Enter Columnar Technology

Row-stores favor I/O requiring a few rows and many columns. Column-stores favor queries that require many rows and few columns; exactly the type of activity used in BI dashboards. Instead of storing a few rows with all their columns in a disk block a column-store will store only the values of one column in a block. This eliminates the wasteful reading in of columns that are ignored.

Various extreme data compression techniques are applied depending on the column datatype. For example, a string datatype may apply global and local dictionary compression with tokenization. Run Length Encoding (RLE) is further applied to compress redundant values and differential encoding is applied to similar values. Query filtering is applied by matching the compressed values, without the need to decompress, and a content min/max zone map performs up-front block elimination. Query predicates are scanned into to memory pages of bitmap vectors that are Boolean merged using GPU instructions permitting “late materialization” of the selected array index location (i.e. row#). None of these performance tricks are possible in a row-store.

ACM A.M. Turing Award winner Michael Stonebraker famously predicted “The only successful ‘Big Data-Small Analytics’ architecture will be a column executor on shared-nothing hardware. All the successful vendors will have to get there sooner or later.” He was proven right as all TPC-H benchmark winners today are columnar.

Join Elimination

The cost of doing joins can be significant. That is why many relational databases use materialized views to tune their performance with denormalized physical structures. The columnar database Vertica automatically denormalizes its physical structures. Their whitepaper states the following:

So what about optimizing JOINs in a columnar database? What optimizations exist for a columnar store to do JOINs? Normalized database design often uses a star or snowflake schema model, comprising multiple large fact tables and many smaller dimension tables. Queries typically involve joins between a large fact table and multiple dimension tables. These queries can incur significant overhead, especially as the analytical needs of the organization change but the database design remains the same.

Moreover, it’s common for database schemas to need to be changed. This becomes clear when you ask yourself a simple question—are you running the same reports this year as you were just one year ago? What about two years ago? Schemas are designed to be optimized for some form of reporting, but the data changes, the needs of analysts change and your business changes. When this happens, you can choose to refactor your database, or you can use a feature like flattened tables.

Vertica flattened tables are denormalized tables that that get their values by querying other tables. To the analyst, it appears that all of the columns needed are in one table. What is actually happening is that Vertica can create a copy of the data in a flattened schema and still be very efficient in storage requirement. Again, using the efficiencies of compression, late stage materialization and predicate push-down, these normalized tables have two interesting benefits: 1) It makes it easier for the business analyst to write simple queries instead of JOINs, and; 2) performance is faster than a normalize JOIN in many cases.

Eventually, [our store] was replaced by the notion of a super projection that contains all the columns.

One way of picturing a denormalized columnar DB is that physically it actually is a star schema with every string column in its own dimension table (the compression dictionary) with all redundant values eliminated.

Join Complexity

Denormalization, Unionization & Columnization makes best use of vector bitmap tables for high performing aggregated queries that combine facts along a time-series or other shared axis. Using a star schema with multiple columnstore facts & dimensions hinders the use of late materialization vectorization to improve performance and drags in 2 more columns (the FK & PK) to the query plan.

Denormalization also eliminates joins to large high cardinality dimension tables that significantly hurt performance because they do not lend themselves to hash joins. Costly complex theta (inequality) joins and calculated joins are also resolved at load time. Joins that use COALESCE on multiple foreign keys to compute default and override are also resolved at load time. The star schema would have a difficult time handling these and would have to handle this in complex custom surrogate key generation.

Ease of Use

Denormalization to one table improves the user’s ability to extract a wide array of meaningful data, with improved performance. The white paper “Expediting analytical databases with columnar approach” by Nenad Jukic, Boris Jukic, Abhishek Sharma, Svetlozar Nestorov, Benjamin Korallus Arnold states:

This method improves the feasibility and quality of tactical decision making by making critical information more readily available. It also improves the quality of longer term strategic decision making by widening the range of feasible queries against the vast amounts of available information. The advantages include the improvements in the performance of the ETL process (the most common time-consuming bottleneck in most implementations of data warehousing for quality decision support) and in the performance of the individual analytical queries.

Providing a single table makes it much easier for BI users to find & get what they need. Users do not need to pick out which fact table data source to use. They can pick columns to show or filter and it will automatically guide them to appropriate transactions.

Because the OBT approach results in a large number of sparse columns a "generalization layer" should be defined in the BI data source. Columns should be organized for easy navigation & access using defined hierarchical "Folders" in PowerBI's Fields blade (see picture). Tableau also has this capability.

BI filter checkboxes are super-fast as the list of possible values is pulled from the dictionary. Queries are faster too because all joins are pre-computed.

BI charts frequently combine facts along a common time-series or some other axis in a Combo Chart. With the star schema approach a PowerBI data source often is defined for each fact table. If the user needs to make a chart that combines facts they are forced to create new special purpose data sources from scratch and to resolve the complicated issues on merging fact tables and columns and creating joins with many-to-many relationships. PowerBI allows you to define a data source that has more than one fact table. As long as the user select columns from only one fact table they have access to all its columns and that of its related dimensions. If 2 fact tables are selected then their measures and shared dimensions may be used and all measures are aggregated along the shared dimensions, however, no attribute of any fact table or any unshared dimension may be used, as it would cause a cartesian product. In contrast, with the one-table approach users do not need to struggle any of this.

Why Union the Facts?

The greatest value of BI analytics is derived from discovering correlations between different facts (e.g. pricing to churn). Unfortunately, a star schema solution would entail not one star schema but may. Direct joins between star schema fact tables are not permitted because they would result in a many-to-many cartesian join. The solution is to use either of the following strategies: (A) separately aggregate each fact table and then join on common ‘group by’ columns or shared dimensions, or (B) ‘union’ the fact tables together (aligning columns) and then perform the aggregation.

Our solution is to union the facts into one table, aligning all columns. A “fact_type” column is added so the user can filter on specific facts. Separate date columns are included, but one date column from each transaction is replicated to an additional “Transaction Date” column so events that occur on the same date can be shown.

By denormalizing to a single table the source structures are eliminated and become of no concern to the report user. They can now just focus on the columns needed.

If the user does not know which transactions to use, they can select the columns of interest and exclude the values that may not be NULL. This will automatically filter the results to the facts that are of interest. The user can then pick among this subset of fact_types. The user is free to allow multiple fact_types and these can be shown separately along with the fact_type value, or the fact_type can be removed from the display and the metrics can be aggregated together along shared dimension attributes. If the relationship between the facts is n:m then showing the detail level will show the n and m records on separate rows. You also can aggregate on an unshared attribute and the output will show the aggregated measure beside each attribute value, plus one row with NULL for the attribute with the other fact_type attributes and measures. This approach to storing the data allows many different difficult scenarios to be handled easily, without the need of defining new certified data sources.

In the simplified example below fact_type “A” has metrics on Color, fact_type “B” has metrics on Person, fact_type “C” has metrics on both Color & Person, and fact_type “D” has metrics on both Color, Person & Skill. When you aggregate on Color & Person you get the following output:

FACT_TYPE	Color	Person	A_Metric	B_Metric	C_Metric	D_Metric
A	blue		5
A	green		3
A	red		5
A	orange		2
A	yellow		1
B		Bob		65
B		Mary		45
B		Joe		23
C&D	blue	Bob			11	547
C&D	blue	Mary			21	271
C&D	green	Mary			43	828
C&D	green	Joe			78	459
C&D	yellow	Bob			67	451

Note that this allows the display of complex n:m relationships without issue. The Color & Person columns are aligned across every fact_type and is NULL when unused. Each row of output may aggregate many detail records. Additional shared & unshared attributes may be added, which could increase the number of aggregated output rows. Fact_type C & D have their metrics combined because Skill is not used as a grouping attribute in this example, causing the metrics of D to be rolled up to the level of C. The rows could be sorted in a number of different ways and subtotaling may be added.

A star schema would find it very difficult to show data at different levels of details as done in the above example.

Load Time

Load time is slower with the denormalized columnar approach compared to the star schema approach. This is beneficial because the time cost has been shifted from “read” time to “write” time.

Development & Maintenance Time

If you know your source system OLTP schema it is easy to build a OBT solution:

Create a target table that has all the desired columns.
Pick each of the lowest level transaction tables needed.
For each transaction table code a SELECT that recursively joins all their parent tables.
Convert each SELECT into an INSERT target_table (columns…) SELECT columns… FROM … . Note: this automatically loads NULL into the unused columns.
Put each INSERT into a stored procedure and execute it. (The multiple INSERTs acts like a UNION ALL.)
If needed add a VIEW or Data Source definition that gives the columns understandable names.

Because you included every parent table your solution should be comprehensive and complete. By recursively following every join path the data mart has minimal data loss reducing maintenance requests when data is later found missing.

If you have all referential constraints defined in the database, then it is possible to generate all the code automatically from a few metadata tables. Auto-generation of code saves hundreds of hours making it possible to deliver the data mart in one day. A stored procedure can be created that automatically generates the stored procedure that generates the OBT table, view & code from the metadata tables. Maintenance of the metadata tables is easy to understand and the entire stored procedure can be regenerated in under 10 minutes, eliminating code maintenance. A later post will show the code for this. Another post will also show how to capture volumetrics on the tables and profile the columns so you can tell at a glace which tables & columns are data-rich.

A manually created star schema could not be created as fast.

Real World Example

At one company I loaded the following data warehouse into AWS Redshift, which is an MPP columnar DBMS. It had 13 fact tables with 35 billion rows shown in the middle of the diagram. The top of the diagram shows the conformed shared dimensions, and the unshared dimension are at the bottom of the ER-Diagram; 25 dimensions in all.

For comparison, this was loaded into a single flat table by joining every fact to its dimensions and then merging the joined facts by UNION ALL, aligning the dimensional columns. Tests were then done to see which performed better, the star or the OBT, and it was found that the single table solution was fastest under every type of BI query. It was then given to the Tableau BI users and they found the one-table solution to be much easier to use than the star schemas, especially for strategic queries that span multiple facts.

The solution went into production and over the coming 5 years it was found to be easy to maintain. Over time, the system grew to hold 40 billion rows, 200 sparce columns and ingest 21 facts and 36 dimensions while still offering exceptional performance.

Conclusion

The fully denormalized columnar solution is complete and comprehensive, and provides better performance. It is easily maintained through metadata tables. End users will find that it satisfies the widest possible set of queries and provides the greatest insight & guidance when used in an ad hoc manner for strategic goals.

If you too have tried this approach I'd love to get your experience. It seems no one has written about this.