Dennis and Jim's Data Engineering Blog: Adding Min and Max Date Functionality to a Snapshot Fact Table

When generating periodic snapshots of your data, analysts may want to view that data in ways that may be different than the time frame of capture. For example, you could take snapshots of your company's parts inventory several times a day, once a day, several times a month, once a month, once a quarter, etc. But some analysts may want to report on this data by day, by month, by quarter, by year and by either the state of the inventory by the beginning of the time period or by the end of that period. How do we model this data to allow for this kind of analysis?

For this example, let's expand upon the company parts and imagine we're dealing with a piece of data that looks like the following and the snapshot schedule is daily:

Figure 1. Source System Inventory Schema

This schema represents the parts inventory quantities, grouped by departments, for a company. Whenever the data warehouse snapshots the data from this schema a date and time of the snapshot is captured so we can deduce what the inventory state was at that time.

In order to fulfill the requirement to report on these snapshots on different time groups, we need a way to categorize this data as such. Now with every data warehouse there will be a date dimension at the level of granularity of the day, in which we could place min/max day/month/quarter/year flags that can help us with this problem. But as you may soon discover, these flags would only pertain to inventory exclusively and if you wanted to use this functionality with other snapshots you would have to add flags specific to that piece of data as well. In addition to this, we may not run this table ETL every day, or if we have an ETL job failure we may miss a day's snapshot. Thus a failure on Sept. 30 means that the Sept 29th snapshot is the maximum one captured for that month. With all of these scenarios, this solution would become untenable very fast. The best way to attack this problem, and provide a re-usable solution, would be to separate these flags out into their own junk dimension.

First we need to create this dimension, then determine how we're going to integrate it with the ETL of the snapshot fact table. Our goal is a table that looks like this:

Figure 2. Junk Dimension with Time Flags

-->Create dimension to hold our date categories

CREATE TABLE DIM_TIME_FLAG(

FLAG_KEY int NOT NULL,

MIN_MONTH_FLG char(1) NULL,

MAX_MONTH_FLG char(1) NULL,

MIN_QUARTER_FLG char(1) NULL,

MAX_QUARTER_FLG char(1) NULL,

MIN_YEAR_FLG char(1) NULL,

MAX_YEAR_FLG char(1) NULL,

CONSTRAINT PK_DIM_TIME_FLAG PRIMARY KEY CLUSTERED (FLAG_KEY)

)

-->Hold our Y/N enum

DECLARE @enum TABLE

(

field char(1)

)

INSERT INTO @enum(field)values('Y'),('N')

-->Temp table to contain cartesian of Y/N for all fields

DECLARE @final TABLE

(

flag_key int identity(1,1),

field1 char(1),

field2 char(1),

field3 char(1),

field4 char(1),

field5 char(1),

field6 char(1)

)

-->Generate cartesian

INSERT INTO @final

SELECT

a.field,

B.field,

c.field,

d.field,

e.field,

f.field

FROM

@enum a,

@enum b,

@enum c,

@enum d,

@enum e,

@enum f

-->Populate DIM_TIME_FLAG

INSERT INTO DIM_TIME_FLAG

SELECT

null,

null

UNION

SELECT

flag_key,

field1,

field2,

field3,

field4,

field5,

field6

FROM

@final

-->View Table Rows

SELECT

FLAG_KEY,

MIN_MONTH_FLG,

MAX_MONTH_FLG,

MIN_QUARTER_FLG,

MAX_QUARTER_FLG,

MIN_YEAR_FLG,

MAX_YEAR_FLG

FROM

DIM_TIME_FLAG

We use this table to determine if the record in the snapshot fact is the first/last for the month recorded/quarter recorded/year recorded. The MIN/MAX flags will have values of Y/N and the reporting solution can filter on these flags when trying to report on the various times. For our daily snapshot of parts inventory, we could add this table to a data warehouse schema like this:

Figure 3. Parts Inventory Data Warehouse Schema

Now, we need to figure out how to get DIM_TIME_FLAG integrated into this schema in our ETL plan. Key thing to keep in mind is that these flags will change as more and more snapshots are taken from the source system. Today's records that have MAX_MONTH_FLG=Y may need to be set to N the following day. After populating the other columns in the table you would have to execute this SQL statement to update the FLAG_KEY:

UPDATE A

SET

A.FLAG_KEY=Z.FLAG_KEY

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

LEFT JOIN

(

SELECT

MIN(B.FULL_DATE) AS MIN_MONTH

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY

B.YEARMO

)MIN_MONTH ON B.FULL_DATE=MIN_MONTH.MIN_MONTH

LEFT JOIN

(

SELECT

MAX(B.FULL_DATE) AS MAX_MONTH

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY B.YEARMO

)MAX_MONTH ON B.FULL_DATE=MAX_MONTH.MAX_MONTH

LEFT JOIN

(

SELECT

MIN(B.FULL_DATE) AS MIN_QUARTER

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY

CONVERT(VARCHAR,B.QUARTER)+CONVERT(VARCHAR,B.YEAR)

)MIN_QUARTER ON B.FULL_DATE=MIN_QUARTER.MIN_QUARTER

LEFT JOIN

(

SELECT

MAX(B.FULL_DATE) AS MAX_QUARTER

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY

CONVERT(VARCHAR,B.QUARTER)+CONVERT(VARCHAR,B.YEAR)

)MAX_QUARTER ON B.FULL_DATE=MAX_QUARTER.MAX_QUARTER

LEFT JOIN

(

SELECT

MIN(B.FULL_DATE) AS MIN_YEAR

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY

B.YEAR

)MIN_YEAR ON B.FULL_DATE=MIN_YEAR.MIN_YEAR

LEFT JOIN

(

SELECT

MAX(B.FULL_DATE) AS MAX_YEAR

FROM

FCT_PART_INVENTORY A

INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY

GROUP BY B.YEAR

)MAX_YEAR ON B.FULL_DATE=MAX_YEAR.MAX_YEAR

LEFT JOIN DIM_TIME_FLAG Z

ON CASE WHEN MIN_MONTH.MIN_MONTH IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_MONTH_FLG

AND CASE WHEN MAX_MONTH.MAX_MONTH IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_MONTH_FLG

AND CASE WHEN MIN_QUARTER.MIN_QUARTER IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_QUARTER_FLG

AND CASE WHEN MAX_QUARTER.MAX_QUARTER IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_QUARTER_FLG

AND CASE WHEN MIN_YEAR.MIN_YEAR IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_YEAR_FLG

AND CASE WHEN MAX_YEAR.MAX_YEAR IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_YEAR_FLG

With this table now integrated into this schema, we can not only use it at the level of day, but now month, quarter and year all using the same fact. For example, if we wanted to view data across days we can ignore the DIM_TIME_FLAG and simply filter our query across DIM_DAY and report on something that looks like this:

Figure 4. Daily Parts Inventory

If we need to report on trending across months we could query the exact same snapshot fact, but this time join to DIM_TIME_FLAG where MIN_MONTH_FLG or MAX_MONTH_FLG=Y (depending of if you want to trend over months based on what the inventory looked like at the beginning of the month or end of the month). This would produce a report that would look something like this:

Figure 5. Monthly Parts Inventory

The same could be done by quarter, as well, using the quarter flags:

Figure 6. Quarterly Parts Inventory

And finally by year:

Figure 7. Yearly Parts Inventory

This design can come in very handy, especially when building dashboards and executives want to be able to trend across the different time frames. It also negates the need to create separate snapshot tables that capture this data for each of these time frames(i.e. a snapshot for months, a snapshot for quarters, etc.).

Dennis and Jim's Data Engineering Blog

Monday, September 30, 2013

Adding Min and Max Date Functionality to a Snapshot Fact Table

No comments:

Post a Comment