Dennis and Jim's Data Engineering Blog: Modeling a Junk Dimension and Populating it with Data using SSIS

In Data Warehousing Junk Dimensions provide a way for you to group together highly correlated attributes that are external, but related to, a transaction in your source system. This most often represents a series of codes you use to categorize a transaction, the source of these being in look-up tables (which are basically enumerations in a database). The benefit to this is that instead of having to make n dimensions with n joins to the fact table, you only have to make 1 dimension with 1 join to the fact table. This reduces the complexity of your schema and reduces the number of joins your SQL statements have to make to get a full picture of the transaction.

Take this sample schema:

Figure 1. Shipment Schema

Here we have a schema that represents the audit trail of shipments of products to customers. It has the employee involved in preparing the shipment, the customer the product is being shipped to, what product and in what quantity, and different shipping categories(priority, type and container). These different shipment categories are prime candidates for a junk dimension in the data warehouse. Let's take a closer look at the data in these tables:

SHIP_PRIORITY

Figure 2. SHIP_PRIORITY Records

SHIP_TYPE

Figure 3. SHIP_TYPE Records

SHIP_CONTAINER

Figure 4. SHIP_CONTAINER Records

Shipments can have any number of combinations of these 3 codes. In order for us to populate our junk dimension from these sources we have be wary of what these combinations mean and what impact they have to design and ETL. In the example I have here, there is little data, so I would suggest using a cross join (cartesian product) for the junk dimension source as such:

SELECT

PRIORITY_CODE,

PRIORITY_DESCR,

SHIP_TYPE_CODE,

SHIP_TYPE_DESCR,

SHIP_CONTAINER_CODE,

SHIP_CONTAINER_DESCR

FROM

SHIP_PRIORITY CROSS JOIN

SHIP_TYPE CROSS JOIN

SHIP_CONTAINER

Which returns:

Figure 5. Cross Join Results

Now this solution seems to be the best for this particular data source, we only have 60 combinations. However; you may experience a few variations on this scenario in your source system. The first of these could be the exact same schema, but with many more look-up values in which most of the combinations will never occur. This can be due to business rules in place in the source system, or because certain combinations just happen to never occur. For this type of scenario, a cross join just isn't feasible, why bring in tons of combinations whose likely hood of appearing in a transaction is very remote. For this situation we only bring in the combinations that actually exist in the transaction. Say we went and added these values to the SHIPMENT table:

INSERT INTO SHIPMENT VALUES

(1, '1/1/2013', '1/3/2013', '1/2/2013', 2, 'L',1, '2', 'WCR'),

(2, '2/1/2013', '2/20/2013', '2/19/2013',3,'M',5, 'STD', 'SCR'),

(3, '3/1/2013', '3/2/2013', '3/2/2013',4,'H',7, 'O', 'LBX')

If we changed our source query to:

SELECT

PRIORITY_CODE,

PRIORITY_DESCR,

SHIP_TYPE_CODE,

SHIP_TYPE_DESCR,

SHIP_CONTAINER_CODE,

SHIP_CONTAINER_DESCR

FROM

SHIP_PRIORITY CROSS JOIN

SHIP_TYPE CROSS JOIN

SHIP_CONTAINER

WHERE EXISTS (SELECT

FROM

SHIPMENT

WHERE

SHIP_PRIORITY.PRIORITY_CODE=SHIPMENT.PRIORITY_CODE AND

SHIP_TYPE.SHIP_TYPE_CODE=SHIPMENT.SHIP_TYPE_CODE AND

SHIPMENT.SHIP_CONTAINER_CODE=SHIP_CONTAINER.SHIP_CONTAINER_CODE

)

We would return the following result set:

Figure 6. Combinations that Exist in the SHIPMENT Table

This cuts down the combinations we deal with in the data warehouse from 60 to 3. Now while this isn't that much, keep in mind if you had thousands and thousands of values for each shipment look-up, this solution would help to keep your junk dimension on a diet.

Another scenario worth considering is say we do have these thousands of entries in one of these look-up tables and many of these combinations are pertinent. But maybe only certain sub-sets of these look-up tables are highly correlated. Perhaps there is a high correlation between priorities and containers that is noteworthy, but when it comes to types we may have thousands and thousands of variations that have no real bearing on the other 2. This may be a situation where adding types to the junk dimension may unnecessarily exacerbate our record number problem. It may be proper to keep this out of the junk dimension and make types its own dimension. This may also be pertinent if we have a second set of codes related to shipments such as product codes. These codes may be highly correlated to each other, but not to the shipment codes. Putting these all together in one junk dimension may not be the right design choice, splitting them in 2 may be best. Keep in mind, these dimensions may be used in conjunction with other facts where shipment codes are relevant and product codes are not and vice versa.

Populating our Junk Dimension:

After considering our design options, our junk dimension for the given schema would probably look something like this:

Figure 7. DIM_SHIPMENT_CODES

This table could be populated with the full cross join source or the existing in SHIPMENT source. Either way, this table design is sound. There are now 2 ETL considerations we need to make as we decide on how to populate DIM_SHIPMENT_CODES, do code descriptions ever change, if so do we lock down the description at the time of transaction? This type of operation would definitely increase the complexity of your ETL. We would need to either update our dimension with the changed description or we would need to insert the update to the source system as a new combination in our DIM_SHIPMENT_CODES. For example sake, lets assume that descriptions are locked down thus requiring only an insert and not an update. Our Data Flow would look similar to:

Figure 8. Junk Dimension Data Flow

The OLE DB Source query would be either our cross join query or the query with the restriction on inner joins to the SHIPMENT table. In our Look-up transformation we determine if the data stream of unique combination of values already exists in the dimension table or not. If they don't we want to redirect these rows to our no match output:

Figure 9. Redirect Non-Existing Rows

On the connection table we specify our look-up query against the dimension:

Figure 10. Look-up Query Against Dimension

On the columns screen we specify the join criteria between the incoming data stream and the dimension. Any records that do not match this criteria are thus new combinations and will be re-directed to our no match output:

Figure 11. Restricting New Records

After we determined these are new records, we want to generate surrogate keys for these and insert them into our dimension. Be sure to select the correct output when connecting to the transformation that generates the surrogate keys:

Figure 12. No Match Output

After we generate our surrogate keys for this data stream, we should see this in our data viewer:

Figure 13. Data Viewer with Records to be Inserted

This dimension would ultimately end up being used in a data warehouse schema such as this(after related tables are denormalized):

Figure 14. Junk Dimension Usage in a Dimensional Schema

When creating the package that will populate FCT_SHIPMENT we will use the same look-up logic we did for populating DIM_SHIPMENT_CODES, only this time we include SHIPMENT_CODE_KEY in the look-up query, we add SHIPMENT_CODE_KEY to the data stream output and our join would direct matching rows, this time, to the data stream.

Our data warehouse now has a junk dimension for our shipment codes. Your end users and report developers are happy with the simplicity of use and your DBA is happy to only have to maintain 1 table instead of 3 and monitor 1 SSIS package instead of 3.

Dennis and Jim's Data Engineering Blog

Monday, May 13, 2013

Modeling a Junk Dimension and Populating it with Data using SSIS

1 comment: