Luckily, to address this issue we can take advantage of a modeling trick called a mini-dimension. With a mini-dimension, we separate the rapidly changing attributes from the dimension and add them to their own dimension. This eliminates the need to add more records to the dimension every time a change takes place. In this post I'll be describing how to implement a SCD type 4 in your schema, which takes advantage of a mini-dimension.
For this example, lets use a source system schema such as this:
Figure 1. Source System Schema |
This represents a service transaction on a product, for example's sake it doesn't matter what kind of service, just that a transaction at a point in time involved a customer's product. We have the product being serviced, the customer who owns the product, the employee who serviced the product and when the service took place. Now, deriving a dimensional schema from this appears rather simple, and would probably look something like this:
Figure 2. Dimensional Schema |
It's simple and easy to understand, reporting out of this schema should not be a problem. However; what if the version of the product changes very often, impacting the product level and service level attributes, and the number of installed products goes up into the millions? Implementing an SCD type 2 probably wouldn't be a good choice. For this type of situation, were going to pull the rapidly changing attributes out of DIM_INSTALLED_PRODUCTS and put them into their own dimension:
Figure 3. Dimensional Schema with Mini-Dimension |
With the addition of DIM_PRODUCT_VERSION we can see what the product version, product level and service level were at the time of the transaction as well as what they are now (make those attributes SCD type 1 in DIM_INSTALLED_PRODUCTS).
Now that we have this modeled out we need to figure out how to populate this new dimension and link it to FCT_SERVICE. Let's generate some sample data, stage it into a persistent staging database (integration layer in your architecture)and then populate the dimension.
-->Source System Table
CREATE TABLE INSTALLED_PRODUCTS
(
PRODUCT_ID INT NOT NULL,
NAME VARCHAR(20),
PRODUCT_LEVEL INT,
SERVICE_LEVEL INT,
VERSION VARCHAR(10),
INSTALL_DATE DATETIME,
PRODUCT_GROUP VARCHAR(25),
SERIAL_NUMBER VARCHAR(10),
CUSTOMER_ID INT ,
CONSTRAINT PK_INSTALLED_PRODUCTS PRIMARY KEY(PRODUCT_ID)
)
GO
-->Sample Data
INSERT INTO
INSTALLED_PRODUCTS
(PRODUCT_ID,NAME,PRODUCT_LEVEL,SERVICE_LEVEL,VERSION,INSTALL_DATE,PRODUCT_GROUP,SERIAL_NUMBER,CUSTOMER_ID)
VALUES
(1, 'PRODUCT A', 1, 2, 'V1.1', '09/01/2013', 'HIGH END', '12345GR4', 1),
(2, 'PRODUCT B', 5, 1, 'V2.1', '11/14/2010', 'LOW END', '7555655YY8', 4),
(3, 'PRODUCT C', 7, 5, 'V3.5', '02/17/2008', 'HIGH END', '43789HDKH', 8),
(4, 'PRODUCT D', 2, 5, 'V7.2', '05/22/2012', 'LOW END', '2956DHD55', 9)
GO
When adding this data to a persistent staging database table, let's call it P_STG_INSTALLED_PRODUCTS, it may end up looking something like this:
Figure 4. Data in Persistent Staging |
Here we inserted all the data from the source system, marked the date it was valid from, set the valid to date to some time way out in the future, and generated hash values for each record. Say 10 days later a change happens to PRODUCT A, we compare the hash of the data coming in from staging with the hash existing in persistent staging, determine if they are different and if so insert a new record:
Figure 5. Record Change in Persistent Staging |
When inserting this new record, we need to be sure to update the previous version of this record to have the VALID_TO date set the end of the day right before the change. With some sample data generated, lets populate our mini-dimension. To populate this dimension, were going to treat it just like a junk dimension by just inserting unique combinations of these attributes and only combinations that exist in the source system table.
SELECT DISTINCT
PRODUCT_LEVEL,
SERVICE_LEVEL,
VERSION
FROM
P_STG_INSTALLED_PRODUCTS P_STG
WHERE
NOT EXISTS(
SELECT
1
FROM
DIM_PRODUCT_VERSION DPV
WHERE
P_STG.PRODUCT_LEVEL=DPV.PRODUCT_LEVEL
AND P_STG.SERVICE_LEVEL=DPV.SERVICE_LEVEL
AND P_STG.VERSION=DPV.VERSION
)
This query, along with generating a surrogate key for the dimension, can be used to populate our DIM_PRODUCT_VERSION mini-dimension. Next, we need to figure out how to incorporate this mini-dimension into the loading of the FCT_SERVICE table. Lets create some sample data to work with for this example.
-->Source System Table
CREATE TABLE
SERVICE
(
SERVICE_ID
INT NOT NULL,
SERVICE_OPEN_DATE
DATETIME NOT
NULL,
SERVICE_CLOSE_DATE
DATETIME,
PRODUCT_ID
INT NOT NULL,
CUSTOMER_ID
INT NOT NULL,
EMPLOYEE_ID
INT NOT NULL,
CONSTRAINT
PK_SERVICE PRIMARY
KEY(SERVICE_ID),
CONSTRAINT
FK_SERVICE_PRODUCT_ID FOREIGN KEY (PRODUCT_ID) REFERENCES INSTALLED_PRODUCTS(PRODUCT_ID),
CONSTRAINT
FK_SERVICE_CUSTOMER_ID FOREIGN
KEY (CUSTOMER_ID) REFERENCES CUSTOMER(CUSTOMER_ID),
CONSTRAINT
FK_SERVICE_EMPLOYEE_ID FOREIGN
KEY (EMPLOYEE_ID) REFERENCES EMPLOYEE(EMPLOYEE_ID)
)
GO
-->Sample Data
INSERT INTO
SERVICE
(SERVICE_ID,SERVICE_OPEN_DATE,SERVICE_CLOSE_DATE,PRODUCT_ID,CUSTOMER_ID,EMPLOYEE_ID)
VALUES
(1, '11/25/2013 10:22:33.324','11/25/2013 11:45:22.432',
1, 1, 7),
(2, '11/26/2013 16:37:21.876','11/27/2013 10:21:54.765',
2, 4, 9),
(3, '11/28/2013 19:29:48.768','11/28/2013 22:25:12.472',
3, 8, 7),
(4, '12/05/2013 05:09:12.354','12/05/2013 07:16:08.009',
4, 9, 7),
(5, '12/11/2013 07:12:32.312','12/12/2013 23:25:14.764',
1, 1, 7)
GO
This data would eventually be inserted into a persistent staging table just like we did for INSTALLED_PRODUCTS, with P_STG_INSTALLED_PRODUCTS, lets call this table P_STG_SERVICE. This will be the primary table we will look at when trying to generate the data for our FCT_SERVICE table. You'll notice that from the data in SERVICE, that there are 2 transactions for PRODUCT_A. One that occurred before the row in INSTALLED_PRODUCTS was updated and one after. So were going to want to join to DIM_PRODUCT_VERSION on values that matched the values in INSTALLED_PRODUCTS at the time of the transaction for that product in SERVICE. This can be achieved, by populating FCT_SERVICE, in a query like this:
With this solution, we'll be able to see what the product's product level, service level and version were at the time of service in DIM_PRODUCT_VERSION as well as seeing what these values currently are today in DIM_INSTALLED_PRODUCTS.
SELECT
P_SRV.SERVICE_ID,
CUST.CUSTOMER_KEY,
PROD.PRODUCT_KEY,
EMP.EMPLOYEE_KEY,
OPEND.DAY_KEY AS SERVICE_OPEN_DAY_KEY,
ISNULL(CLOSED.DAY_KEY,0) AS SERVICE_CLOSE_DAY_KEY,
PVR.PRODUCT_VERSION_KEY,
CASE WHEN P_SRV.SERVICE_CLOSE_DATE IS NOT NULL
THEN DATEDIFF(MI,P_SRV.SERVICE_OPEN_DATE, P_SRV.SERVICE_CLOSE_DATE)
ELSE NULL
END AS MINUTES_TO_CLOSE
FROM
P_STG_SERVICE P_SRV
INNER JOIN DIM_CUSTOMER CUST ON P_SRV.CUSTOMER_ID=CUST.CUSTOMER_ID
INNER JOIN DIM_INSTALLED_PRODUCTS PROD ON P_SRV.PRODUCT_ID=PROD.PRODUCT_ID
INNER JOIN DIM_EMPLOYEE EMP ON P_SRV.EMPLOYEE_ID = EMP.EMPLOYEE_ID
INNER JOIN DIM_DAY OPEND ON TRY_PARSE(P_SRV.SERVICE_OPEN_DATE AS DATE)=OPEND.FULL_DATE
LEFT JOIN DIM_DAY CLOSED ON TRY_PARSE(P_SRV.SERVICE_CLOSE_DATE AS DATE)=CLOSED.FULL_DATE
INNER JOIN P_STG_INSTALLED_PRODUCTS P_PRD ON P_SRV.PRODUCT_ID=P_PRD.PRODUCT_ID AND P_SRV.SERVICE_OPEN_DATE BETWEEN P_PRD.VALID_FROM AND P_PRD.VALID_TO
INNER JOIN DIM_PRODUCT_VERSION PVR ON PVR.PRODUCT_LEVEL = P_PRD.PRODUCT_LEVEL AND PVR.SERVICE_LEVEL=P_PRD.SERVICE_LEVEL AND PVR.VERSION=P_PRD.VERSION
With this solution, we'll be able to see what the product's product level, service level and version were at the time of service in DIM_PRODUCT_VERSION as well as seeing what these values currently are today in DIM_INSTALLED_PRODUCTS.
No comments:
Post a Comment