Take this sample schema:
Here we have a schema that represents the audit trail of shipments of products to customers. It has the employee involved in preparing the shipment, the customer the product is being shipped to, what product and in what quantity, and different shipping categories(priority, type and container). These different shipment categories are prime candidates for a junk dimension in the data warehouse. Let's take a closer look at the data in these tables:
SHIP_PRIORITY
Figure 2. SHIP_PRIORITY Records |
SHIP_TYPE
Figure 3. SHIP_TYPE Records |
SHIP_CONTAINER
Figure 4. SHIP_CONTAINER Records |
SELECT
PRIORITY_CODE,
PRIORITY_DESCR,
SHIP_TYPE_CODE,
SHIP_TYPE_DESCR,
SHIP_CONTAINER_CODE,
SHIP_CONTAINER_DESCR
FROM
SHIP_PRIORITY CROSS JOIN
SHIP_TYPE CROSS JOIN
SHIP_CONTAINER
Which returns:
Figure 5. Cross Join Results |
INSERT INTO SHIPMENT VALUES
(1, '1/1/2013', '1/3/2013', '1/2/2013', 2, 'L',1, '2', 'WCR'),
(2, '2/1/2013', '2/20/2013', '2/19/2013',3,'M',5, 'STD', 'SCR'),
(3, '3/1/2013', '3/2/2013', '3/2/2013',4,'H',7, 'O', 'LBX')
If we changed our source query to:
SELECT
PRIORITY_CODE,
PRIORITY_DESCR,
SHIP_TYPE_CODE,
SHIP_TYPE_DESCR,
SHIP_CONTAINER_CODE,
SHIP_CONTAINER_DESCR
FROM
SHIP_PRIORITY CROSS JOIN
SHIP_TYPE CROSS JOIN
SHIP_CONTAINER
WHERE EXISTS (SELECT
1
FROM
SHIPMENT
WHERE
SHIP_PRIORITY.PRIORITY_CODE=SHIPMENT.PRIORITY_CODE AND
SHIP_TYPE.SHIP_TYPE_CODE=SHIPMENT.SHIP_TYPE_CODE AND
SHIPMENT.SHIP_CONTAINER_CODE=SHIP_CONTAINER.SHIP_CONTAINER_CODE
)
We would return the following result set:
Figure 6. Combinations that Exist in the SHIPMENT Table |
This cuts down the combinations we deal with in the data warehouse from 60 to 3. Now while this isn't that much, keep in mind if you had thousands and thousands of values for each shipment look-up, this solution would help to keep your junk dimension on a diet.
Another scenario worth considering is say we do have these thousands of entries in one of these look-up tables and many of these combinations are pertinent. But maybe only certain sub-sets of these look-up tables are highly correlated. Perhaps there is a high correlation between priorities and containers that is noteworthy, but when it comes to types we may have thousands and thousands of variations that have no real bearing on the other 2. This may be a situation where adding types to the junk dimension may unnecessarily exacerbate our record number problem. It may be proper to keep this out of the junk dimension and make types its own dimension. This may also be pertinent if we have a second set of codes related to shipments such as product codes. These codes may be highly correlated to each other, but not to the shipment codes. Putting these all together in one junk dimension may not be the right design choice, splitting them in 2 may be best. Keep in mind, these dimensions may be used in conjunction with other facts where shipment codes are relevant and product codes are not and vice versa.
Populating our Junk Dimension:
After considering our design options, our junk dimension for the given schema would probably look something like this:
Figure 7. DIM_SHIPMENT_CODES |
Figure 8. Junk Dimension Data Flow |
Figure 9. Redirect Non-Existing Rows |
On the connection table we specify our look-up query against the dimension:
Figure 10. Look-up Query Against Dimension |
Figure 11. Restricting New Records |
Figure 12. No Match Output |
After we generate our surrogate keys for this data stream, we should see this in our data viewer:
Figure 13. Data Viewer with Records to be Inserted |
Figure 14. Junk Dimension Usage in a Dimensional Schema |
Our data warehouse now has a junk dimension for our shipment codes. Your end users and report developers are happy with the simplicity of use and your DBA is happy to only have to maintain 1 table instead of 3 and monitor 1 SSIS package instead of 3.
Very nicely explained with ETL example.
ReplyDeleteThanks!!!