Wednesday, May 8, 2013

Modeling Blobs for a Data Warehouse and Concatenating Nvarchar(max), Varchar(max), TEXT, NTEXT in a SSIS Scripting Component

     When dealing with facts and dimensions in a data warehouse, you usually model for attributes of the event such as employee involved, day it occurred, customer associated, etc. (dimensions). These are then associated with measures to quantify the event such as minutes to completion, cost, quantity used, etc. (facts). This scenario is easy, lots of documentation out there. What's troubling is when we introduce a blob to this scenario. If you deal with data in any kind of service related field, or medical field, you are bound to have to load and report on large text fields for these repair notes or doctor's notes. 

     When attempting to model this scenario, you really only have 4 options.
  • Create a fact-less fact 
    • Have the blob contained in a column in the fact and have dimensions hang off of it such as employee dimension (who created the note) day dimension (when was it created) and customer dimension (customer who was serviced) as well as a degenerate dimension for a transaction number and note identifier(or creation date if a combination of transaction number and creation date creates a unique identifier for a compound key)
    • Every record will correspond to an individual note entered into the source system
    • Easy to load and, if you're using SQL Server, will be stored separately from the other data in the table. So queries that don't involve the blob will not affect query performance
Figure 1. Fact-less Fact Solution
  • Create a bridge table 
    • Move the notes to a separate dimension, will have to also move creation time and employee information(since the employee that may be associated with the fact record may not be the same employee that wrote the note)
    • create a "bridge" between the transaction fact and the notes dimension with an ordering column to sort by
    • Can work with facts that contain multiple records for same transaction
    • Hard to load and maintain, not easy for end users to understand

Figure 2. Bridge Table Solution
  • Concatenate notes for transaction and place in column in fact
    • Will only be viable if the transaction can be expressed in a single row in the fact, else you will have to concatenate and copy for each row associated with that transaction....yuck.
    • Creation Date and Employee information will have to be concatenated with the note and notes will have to be separated by some kind of line break
    • Breaks some design paradigms (See rule #7) 

Figure 3. Note Concatenated in Fact Solution 

  • Concatenate notes for transaction in a single dimension
    • Can work with facts that contain multiple records for same transaction
    • Creation Date and Employee information will have to be concatenated with the note and notes will have to be separated by some kind of line break
    • This means that your dimension will grow at the same rate as your transaction growth, this can get ugly 
Figure 4. Note Concatenated in Dimension Solution
     The first solution is by far the easiest to deal with from an ETL stand point. The second  is definitely the hardest of the 4 solutions to implement. The last 2 are very similar in that both need to have the notes concatenated in association with the transaction. The next section will demonstrate how to prep the data for either solution 3 or solution 4 using a script component in SSIS.  

Sample Data Set:
Figure 5. Sample Data Set

     The sample data set we're going to use for this will consist of a transaction identifier (TRANS_NO nvarchar(10)), a note creation date (CREATE_DATE datetime), and a note written about the transaction (SERVICE_NOTE nvarchar(max)). This will be the data in the OLE DB Source. The goal is to take the notes and create dates of these 2 transactions and concatenate them into one note, the result being 2 records (1 for each transaction). It is very important that this data set is sorted. The transaction numbers need to be grouped together as shown in this example or the following script component won't work.

Example Data Flow:
Figure 6. Example Data Flow
(Ignore the Union All transformation, I have that here so that I can later add a data viewer to the data stream. You would usually have your data destinations here instead)

     First thing is to configure the amount of rows we want to go through the scripting component at a time. I find that 1000 seems to be good for my projects, anything more than that I see sporadic performance. You may experience differences in your project. In the design surface of your data flow, right click and select properties. Then go to the DefaultBufferMaxRows and type 1000 for the value:

Figure 7. DefaultBufferMaxRows Setting



Creating the Script Component:
     Drag a scripting component onto the data flow design surface from the tool box. Connect the output of the OLE DB source to the script component. Now lets configure. Open the script component. Leave everything on the script screen at its default:


Figure 8. Script Screen

     On the input screen select the 3 columns from our OLE DB Source:

Figure 9. Input Columns Screen

     On the Inputs and Outputs screen expand Output0 and click on Output Folders node. We're going to add 2 columns to this output by clicking the Add Column button. One column we'll name TRANSNOOUT and make it a Unicode string [DT_WSTR] with a length of 10. The second column we'll name NOTE_OUT and make it a Unicode text stream [DT_NTEXT]:

Figure 10. Adding Columns to Output

Before we leave this screen we have one more important thing to do. Click on the Output0 node. In common properties section we need to change the SynchronousInputID selection to none: 

Figure 11. Changing the SynchronousInputID Value

According to Microsoft's description (see previous link) "For a synchronous output, where rows are passed through to the selected output or outputs without adding any new rows, this property should contain the ID of the component's input." Since we are going to replace the rows going in with new concatenated rows, we need to select None.
Since we're not using any Connection Managers lets go back to the Script screen, click the Edit Script button and paste the following code:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Windows.Forms;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Text;


[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()]

public class ScriptMain : UserComponent
{
    //Holds current transaction number in buffer
    private string transNo;

    //String builders that concatenates the notes
    StringBuilder concatNote = new StringBuilder();

    //Boolean for determining if loop is on first row of series or not
    private bool firstRow = true;


    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        //Converts our note from the input buffer to string
        int noteLen = Convert.ToInt32(Row.SERVICENOTE.Length);
        byte[] noteBlob = new byte[noteLen];
        noteBlob = Row.SERVICENOTE.GetBlobData(0, noteLen);
        string ServiceStr = System.Text.Encoding.Unicode.GetString(noteBlob);
        //Holds the create date from the input buffer
        string createDate = Row.CREATEDATE.ToString();

        //If were still within the same transaction(will be false for first row coming in the input buffer)
        if (Row.TRANSNO == transNo)
        {
            //Pass notes to the concat function
            concatNote.Append(createDate+ "\r\n"+ServiceStr + "\r\n");
        }
        else
        {
            //If this is the first record in a new series we dont want to add a record to the output yet
            if (firstRow)
            {
            //We set the indicator to false now so that when we compare transaction numbers again, and were
            //not on the same transaction, we need to output a record
                firstRow = false;
            }
            else
            {
          
                // If the transaction number has changed we output the string for the current transaction no
                //Add a row to the output buffer
                Output0Buffer.AddRow();
                //Pass through the transaction number
                Output0Buffer.TRANSNOOUT = transNo;
                //Output the concatenated note
                Output0Buffer.NOTEOUT.AddBlobData(Encoding.Unicode.GetBytes(concatNote.ToString()));
        
                //Reset the string builder for the next transaction(done in .net 2.0, so no .Clear() method
                concatNote.Length = 0;
                concatNote.Capacity = 0;

            }
            //Prep for next transaction
            transNo = Row.TRANSNO;
            concatNote.Append(createDate + "\r\n" + ServiceStr + "\r\n");
        }
    }

    public override void Input0_ProcessInput(Input0Buffer Buffer)
    {
        base.Input0_ProcessInput(Buffer);

        //Makes sure last record in buffer gets outputted
            if (Buffer.EndOfRowset())
            {
                Output0Buffer.AddRow();
                Output0Buffer.TRANSNOOUT = transNo;
                Output0Buffer.NOTEOUT.AddBlobData(Encoding.Unicode.GetBytes(concatNote.ToString()));
                Output0Buffer.SetEndOfRowset();
            }
       

    }

}




     Lets step through and explain some of this code. if (Row.TRANSNO == transNo)determines if the transaction number were dealing with is the same as the previous one in the buffer. This way we know if we need to still keep concatenating notes or not. You'll notice when setting the note blob to a string the encoding is Unicode string ServiceStr = System.Text.Encoding.Unicode.GetString(noteBlob);. This is because the blob coming through the input buffer is an nvarchar(max). If you were using a TEXT or varchar(max) column the encoding should be changed to ASCII as: string ServiceStr = System.Text.Encoding.ASCII.GetString(noteBlob);.  Output0Buffer.AddRow(); is where we add a new record to our output buffer and output the concatenated note with the transaction number:

Figure 12. Data Viewer

As you can see the original 8 records have been condensed to 2 with notes concatenated in the NOTE_OUT field. Since the data viewer cant display blobs I created a message box (MessageBox.Show("Transaction Number: " + transNo + "\r\n" + "Note: "+concatNote.ToString());that will display the data for us:



Figure 13. Result for Transaction 1122424566

Figure 14. Result for Transaction 1122580388
Now we're ready to either add this to a fact table or slap surrogate keys on them and insert them into a dimension.




No comments:

Post a Comment