Friday, November 22, 2013

Designing and Populating a Mini-Dimension

     Dealing with master data can be a headache.....a big one. One of these headaches is when large dimensions (millions of rows) contains rapidly changing attributes, or what Kimball refers to as monster dimensions. Now to solve this problem you could go with a traditional SCD type 2, and simply insert a new record every time a change occurs. But due to the rate of change this just doesn't seem tenable. If you had 2 million master data records in your dimension, and even if they only change once a year, that's a million new records in that table every year. Imagine if it was every month, week, day? 

     Luckily, to address this issue we can take advantage of a modeling trick called a mini-dimension. With a mini-dimension, we separate the rapidly changing attributes from the dimension and add them to their own dimension. This eliminates the need to add more records to the dimension every time a change takes place. In this post I'll be describing how to implement a SCD type 4 in your schema, which takes advantage of a mini-dimension.

     For this example, lets use a source system schema such as this:

Figure 1. Source System Schema
     This represents a service transaction on a product, for example's sake it doesn't matter what kind of service, just that a transaction at a point in time involved a customer's product. We have the product being serviced, the customer who owns the product, the employee who serviced the product and when the service took place.  Now, deriving a dimensional schema from this appears rather simple, and would probably look something like this:

Figure 2. Dimensional Schema

     It's simple and easy to understand, reporting out of this schema should not be a problem. However; what if the version of the product changes very often, impacting the product level and service level attributes, and the number of installed products goes up into the millions? Implementing an SCD type 2 probably wouldn't be a good choice. For this type of situation, were going to pull the rapidly changing attributes out of DIM_INSTALLED_PRODUCTS and put them into their own dimension:

Figure 3. Dimensional Schema with Mini-Dimension

     With the addition of DIM_PRODUCT_VERSION we can see what the product version, product level and service level were at the time of the transaction as well as what they are now (make those attributes SCD type 1 in DIM_INSTALLED_PRODUCTS).

     Now that we have this modeled out we need to figure out how to populate this new dimension and link it to FCT_SERVICE.  Let's generate some sample data, stage it into a persistent staging database (integration layer in your architecture)and then populate the dimension. 

-->Source System Table
CREATE TABLE INSTALLED_PRODUCTS
(
       PRODUCT_ID INT NOT NULL,
       NAME VARCHAR(20),
       PRODUCT_LEVEL INT,
       SERVICE_LEVEL INT,
       VERSION VARCHAR(10),
       INSTALL_DATE DATETIME,
       PRODUCT_GROUP VARCHAR(25),
       SERIAL_NUMBER VARCHAR(10),
       CUSTOMER_ID INT ,
       CONSTRAINT PK_INSTALLED_PRODUCTS PRIMARY KEY(PRODUCT_ID)
)
GO

-->Sample Data
INSERT INTO
       INSTALLED_PRODUCTS
       (PRODUCT_ID,NAME,PRODUCT_LEVEL,SERVICE_LEVEL,VERSION,INSTALL_DATE,PRODUCT_GROUP,SERIAL_NUMBER,CUSTOMER_ID)
VALUES
       (1, 'PRODUCT A', 1, 2, 'V1.1', '09/01/2013', 'HIGH END', '12345GR4', 1),
       (2, 'PRODUCT B', 5, 1, 'V2.1', '11/14/2010', 'LOW END', '7555655YY8', 4),
       (3, 'PRODUCT C', 7, 5, 'V3.5', '02/17/2008', 'HIGH END', '43789HDKH', 8),
       (4, 'PRODUCT D', 2, 5, 'V7.2', '05/22/2012', 'LOW END', '2956DHD55', 9)
GO

When adding this data to a persistent staging database table, let's call it P_STG_INSTALLED_PRODUCTS, it may end up looking something like this:

Figure 4. Data in Persistent Staging
     Here we inserted all the data from the source system, marked the date it was valid from, set the valid to date to some time way out in the future, and generated hash values for each record. Say 10 days later a change happens to PRODUCT A, we compare the hash of the data coming in from staging with the hash existing in persistent staging, determine if they are different and if so insert a new record:

Figure 5. Record Change in Persistent Staging
     When inserting this new record, we need to be sure to update the previous version of this record to have the VALID_TO date set the end of the day right before the change. With some sample data generated, lets populate our mini-dimension. To populate this dimension, were going to treat it just like a junk dimension  by just inserting unique combinations of these attributes and only combinations that exist in the source system table.

SELECT DISTINCT
       PRODUCT_LEVEL,
       SERVICE_LEVEL,
       VERSION
FROM
       P_STG_INSTALLED_PRODUCTS P_STG
WHERE
    NOT EXISTS(
                 SELECT
                        1
                 FROM
                        DIM_PRODUCT_VERSION DPV
                 WHERE
                        P_STG.PRODUCT_LEVEL=DPV.PRODUCT_LEVEL
                        AND P_STG.SERVICE_LEVEL=DPV.SERVICE_LEVEL
                        AND P_STG.VERSION=DPV.VERSION
              )

     This query, along with generating a surrogate key for the dimension, can be used to populate our DIM_PRODUCT_VERSION mini-dimension. Next, we need to figure out how to incorporate this mini-dimension into the loading of the FCT_SERVICE table. Lets create some sample data to work with for this example.

-->Source System Table
CREATE TABLE SERVICE
(
       SERVICE_ID INT NOT NULL,
       SERVICE_OPEN_DATE DATETIME NOT NULL,
       SERVICE_CLOSE_DATE DATETIME,
       PRODUCT_ID INT NOT NULL,
       CUSTOMER_ID INT NOT NULL,
       EMPLOYEE_ID INT NOT NULL,
       CONSTRAINT PK_SERVICE PRIMARY KEY(SERVICE_ID),
       CONSTRAINT FK_SERVICE_PRODUCT_ID  FOREIGN KEY  (PRODUCT_ID)  REFERENCES  INSTALLED_PRODUCTS(PRODUCT_ID),
       CONSTRAINT FK_SERVICE_CUSTOMER_ID FOREIGN KEY  (CUSTOMER_ID) REFERENCES  CUSTOMER(CUSTOMER_ID),
       CONSTRAINT FK_SERVICE_EMPLOYEE_ID FOREIGN KEY  (EMPLOYEE_ID) REFERENCES  EMPLOYEE(EMPLOYEE_ID)
)
GO

-->Sample Data
INSERT INTO
       SERVICE
       (SERVICE_ID,SERVICE_OPEN_DATE,SERVICE_CLOSE_DATE,PRODUCT_ID,CUSTOMER_ID,EMPLOYEE_ID)
VALUES
       (1, '11/25/2013 10:22:33.324','11/25/2013 11:45:22.432', 1, 1, 7),
       (2, '11/26/2013 16:37:21.876','11/27/2013 10:21:54.765', 2, 4, 9),
       (3, '11/28/2013 19:29:48.768','11/28/2013 22:25:12.472', 3, 8, 7),
       (4, '12/05/2013 05:09:12.354','12/05/2013 07:16:08.009', 4, 9, 7),
       (5, '12/11/2013 07:12:32.312','12/12/2013 23:25:14.764', 1, 1, 7)
      
GO

     This data would eventually be inserted into a persistent staging table just like we did for INSTALLED_PRODUCTS, with P_STG_INSTALLED_PRODUCTS, lets call this table P_STG_SERVICE. This will be the primary table we will look at when trying to generate the data for our FCT_SERVICE table. You'll notice that from the data in SERVICE, that there are 2 transactions for PRODUCT_A. One that occurred before the row in INSTALLED_PRODUCTS was updated and one after. So were going to want to join to DIM_PRODUCT_VERSION on values that matched the values in INSTALLED_PRODUCTS at the time of the transaction for that product in SERVICE. This can be achieved, by populating FCT_SERVICE, in a query like this:


SELECT
  P_SRV.SERVICE_ID,
  CUST.CUSTOMER_KEY,
  PROD.PRODUCT_KEY,
  EMP.EMPLOYEE_KEY,
  OPEND.DAY_KEY AS SERVICE_OPEN_DAY_KEY,
  ISNULL(CLOSED.DAY_KEY,0) AS SERVICE_CLOSE_DAY_KEY,
  PVR.PRODUCT_VERSION_KEY,
  CASE WHEN P_SRV.SERVICE_CLOSE_DATE IS NOT NULL
       THEN DATEDIFF(MI,P_SRV.SERVICE_OPEN_DATE, P_SRV.SERVICE_CLOSE_DATE)
       ELSE NULL
       END AS MINUTES_TO_CLOSE
FROM
  P_STG_SERVICE P_SRV
  INNER JOIN DIM_CUSTOMER CUST ON P_SRV.CUSTOMER_ID=CUST.CUSTOMER_ID
  INNER JOIN DIM_INSTALLED_PRODUCTS PROD ON P_SRV.PRODUCT_ID=PROD.PRODUCT_ID
  INNER JOIN DIM_EMPLOYEE EMP ON P_SRV.EMPLOYEE_ID = EMP.EMPLOYEE_ID
  INNER JOIN DIM_DAY OPEND ON TRY_PARSE(P_SRV.SERVICE_OPEN_DATE AS DATE)=OPEND.FULL_DATE
  LEFT  JOIN DIM_DAY CLOSED ON TRY_PARSE(P_SRV.SERVICE_CLOSE_DATE AS DATE)=CLOSED.FULL_DATE
  INNER JOIN P_STG_INSTALLED_PRODUCTS P_PRD ON P_SRV.PRODUCT_ID=P_PRD.PRODUCT_ID AND P_SRV.SERVICE_OPEN_DATE BETWEEN P_PRD.VALID_FROM AND P_PRD.VALID_TO
  INNER JOIN DIM_PRODUCT_VERSION PVR ON PVR.PRODUCT_LEVEL = P_PRD.PRODUCT_LEVEL AND PVR.SERVICE_LEVEL=P_PRD.SERVICE_LEVEL AND PVR.VERSION=P_PRD.VERSION

     With this solution, we'll be able to see what the product's product level, service level and version were at the time of service in DIM_PRODUCT_VERSION as well as seeing what these values currently are today in DIM_INSTALLED_PRODUCTS.

No comments:

Post a Comment