Monday, September 30, 2013

Adding Min and Max Date Functionality to a Snapshot Fact Table

     When generating periodic snapshots of your data, analysts may want to view that data in ways that may be different than the time frame of capture. For example, you could take snapshots of your company's parts inventory several times a day, once a day, several times a month, once a month, once a quarter, etc. But some analysts may want to report on this data by day, by month, by quarter, by year and by either the state of the inventory by the beginning of the time period or by the end of that period. How do we model this data to allow for this kind of analysis?

     For this example, let's expand upon the company parts and imagine we're dealing with a piece of data that looks like the following and the snapshot schedule is daily:


Figure 1. Source System Inventory Schema


This schema represents the parts inventory quantities, grouped by departments, for a company.  Whenever the data warehouse snapshots the data from this schema a date and time of the snapshot is captured so we can deduce what the inventory state was at that time. 

     In order to fulfill the requirement to report on these snapshots on different time groups, we need a way to categorize this data as such. Now with every data warehouse there will be a date dimension at the level of granularity of the day, in which we could place min/max day/month/quarter/year flags that can help us with this problem. But as you may soon discover, these flags would only pertain to inventory exclusively and if you wanted to use this functionality with other snapshots you would have to add flags specific to that piece of data as well. In addition to this, we may not run this table ETL every day, or if we have an ETL job failure we may miss a day's snapshot. Thus a failure on Sept. 30 means that the Sept 29th snapshot is the maximum one captured for that month. With all of these scenarios, this solution would become untenable very fast. The best way to attack this problem, and provide a re-usable solution, would be to separate these flags out into their own junk dimension

     First we need to create this dimension, then determine how we're going to integrate it with the ETL of the snapshot fact table. Our goal is a table that looks like this:


Figure 2. Junk Dimension with Time Flags

-->Create dimension to hold our date categories
CREATE  TABLE  DIM_TIME_FLAG(
    FLAG_KEY        int        NOT NULL,
    MIN_MONTH_FLG   char(1)    NULL,
    MAX_MONTH_FLG   char(1)    NULL,
    MIN_QUARTER_FLG char(1)    NULL,
    MAX_QUARTER_FLG char(1)    NULL,
    MIN_YEAR_FLG    char(1)    NULL,
    MAX_YEAR_FLG    char(1)    NULL,
    CONSTRAINT PK_DIM_TIME_FLAG PRIMARY KEY CLUSTERED (FLAG_KEY)
)

 -->Hold our Y/N enum
DECLARE @enum TABLE
  (
       field char(1)
  )

INSERT INTO @enum(field)values('Y'),('N')

-->Temp table to contain cartesian of Y/N for all fields
DECLARE @final TABLE
 (
        flag_key int identity(1,1),
        field1   char(1),
        field2   char(1),
        field3   char(1),
        field4   char(1),
        field5   char(1),
        field6   char(1)
 )
-->Generate cartesian
INSERT INTO @final
SELECT
        a.field,
        B.field,
        c.field,
        d.field,
        e.field,
        f.field
FROM
        @enum a,
        @enum b,
        @enum c,
        @enum d,
        @enum e,
        @enum f

-->Populate DIM_TIME_FLAG
INSERT INTO DIM_TIME_FLAG
SELECT
       0,
       null,
       null,
       null,
       null,
       null,
       null
UNION
SELECT
       flag_key,
       field1,
       field2,
       field3,
       field4,
       field5,
       field6 
FROM
       @final

-->View Table Rows
SELECT
     FLAG_KEY,
     MIN_MONTH_FLG,
     MAX_MONTH_FLG,
     MIN_QUARTER_FLG,
     MAX_QUARTER_FLG,
     MIN_YEAR_FLG,
     MAX_YEAR_FLG   
FROM
     DIM_TIME_FLAG

We use this table to determine if the record in the snapshot fact is the first/last for the month recorded/quarter recorded/year recorded. The MIN/MAX flags will have values of Y/N and the reporting solution can filter on these flags when trying to report on the various times. For our daily snapshot of parts inventory, we could add this table to a data warehouse schema like this:

Figure 3. Parts Inventory Data Warehouse Schema

     Now, we need to figure out how to get DIM_TIME_FLAG integrated into this schema in our ETL plan. Key thing to keep in mind is that these flags will change as more and more snapshots are taken from the source system. Today's records that have MAX_MONTH_FLG=Y may need to be set to N the following day. After populating the other columns in the table you would have to execute this SQL statement to update the FLAG_KEY:



UPDATE A
SET
       A.FLAG_KEY=Z.FLAG_KEY
FROM
       FCT_PART_INVENTORY A
       INNER JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
       LEFT JOIN
(
SELECT
       MIN(B.FULL_DATE) AS MIN_MONTH
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY
       B.YEARMO
)MIN_MONTH ON B.FULL_DATE=MIN_MONTH.MIN_MONTH
       LEFT JOIN
(
SELECT
       MAX(B.FULL_DATE) AS MAX_MONTH
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY B.YEARMO
)MAX_MONTH ON B.FULL_DATE=MAX_MONTH.MAX_MONTH
       LEFT JOIN
(
SELECT
       MIN(B.FULL_DATE) AS MIN_QUARTER
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY
       CONVERT(VARCHAR,B.QUARTER)+CONVERT(VARCHAR,B.YEAR)
)MIN_QUARTER ON B.FULL_DATE=MIN_QUARTER.MIN_QUARTER
       LEFT JOIN
(
SELECT
       MAX(B.FULL_DATE) AS MAX_QUARTER
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY
       CONVERT(VARCHAR,B.QUARTER)+CONVERT(VARCHAR,B.YEAR)
)MAX_QUARTER ON B.FULL_DATE=MAX_QUARTER.MAX_QUARTER
       LEFT JOIN
(
SELECT
       MIN(B.FULL_DATE) AS MIN_YEAR
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY
       B.YEAR
)MIN_YEAR ON B.FULL_DATE=MIN_YEAR.MIN_YEAR
LEFT JOIN
(
SELECT
       MAX(B.FULL_DATE) AS MAX_YEAR
FROM
       FCT_PART_INVENTORY A
       INNER  JOIN DIM_DAY B ON A.DAY_KEY=B.DAY_KEY
GROUP BY B.YEAR
)MAX_YEAR ON B.FULL_DATE=MAX_YEAR.MAX_YEAR
LEFT JOIN DIM_TIME_FLAG Z
ON CASE WHEN MIN_MONTH.MIN_MONTH IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_MONTH_FLG
AND CASE WHEN MAX_MONTH.MAX_MONTH IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_MONTH_FLG
AND CASE WHEN MIN_QUARTER.MIN_QUARTER IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_QUARTER_FLG
AND CASE WHEN MAX_QUARTER.MAX_QUARTER IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_QUARTER_FLG
AND CASE WHEN MIN_YEAR.MIN_YEAR IS NULL THEN 'N' ELSE 'Y' END =Z.MIN_YEAR_FLG

AND CASE WHEN MAX_YEAR.MAX_YEAR IS NULL THEN 'N' ELSE 'Y' END =Z.MAX_YEAR_FLG

     With this table now integrated into this schema, we can not only use it at the level of day, but now month, quarter and year all using the same fact. For example, if we wanted to view data across days we can ignore the DIM_TIME_FLAG and simply filter our query across DIM_DAY and report on something that looks like this:

Figure 4. Daily Parts Inventory

     If we need to report on trending across months we could query the exact same snapshot fact, but this time join to DIM_TIME_FLAG where MIN_MONTH_FLG or MAX_MONTH_FLG=Y (depending of if you want to trend over months based on what the inventory looked like at the beginning of the month or end of the month). This would produce a report that would look something like this:

Figure 5. Monthly Parts Inventory

     The same could be done by quarter, as well, using the quarter flags:

Figure 6. Quarterly Parts Inventory

     And finally by year:

Figure 7. Yearly Parts Inventory

     This design can come in very handy, especially when building dashboards and executives want to be able to trend across the different time frames. It also negates the need to create separate snapshot tables that capture this data for each of these time frames(i.e. a snapshot for months, a snapshot for quarters, etc.).