Tuesday, May 28, 2013

Parse Delimited Values from a Column using a SSIS Script Component

     I get some weird data sometimes from our source system developers. Not too long ago, I received a piece of data that had a column with a number of values that were semi-colon delimited and needed to be parsed out during ETL. I don't even know how they got the data in this format or why data, coming from a database, should ever look like this. Since this isn't a flat file, I can't use a flat file connection manager to parse by the delimiter. So, we'll have to use some custom code in a script component to do this. 

     Say we were given a piece of data that looks like this:


Figure 1. Auditor Assignments


This data represents the departments 3 different auditors are assigned to, delimited by a comma. Now, you would have hoped to have a piece of data that has the AUDITOR_ID and each individual department in separate rows, but this was delivered instead. In order to work with this data, we will have to parse out the departments.

Create/Populate Script:

CREATE TABLE AUDIT_ASSIGNMENT
(
AUDITOR_FIRST_NAME VARCHAR(25),
AUDITOR_LAST_NAME VARCHAR(30),
AUDITOR_ID INT,
ASSIGNED_DEPARTMENTS VARCHAR(100)
)
GO

INSERT INTO AUDIT_ASSIGNMENT VALUES
('John', 'Smith', 10134, 'Finance, Marketing, Operations'),
('Bob',  'Jones', 10135, 'Accounting, Product Management, R&D'),
('Jane', 'Foley', 10136, 'HR, Sales, Account Management, Customer Support')
GO


     The Data Flow we're going to use in this example is going to look like this:


Figure 2. Data Flow
     
     The OLE DB Source is going to be our AUDIT_ASSIGNMENT table. The Script Component is going to have 2 outputs, 1 that is going to push out data "as is" from the source and a second that is going to parse out the departments and pair them with the AUDITOR_ID of the auditor assigned. Ignore the Union All Components, they are there so I can add Data Viewers to the Data Flow.

     We also need to set up the delimiter for this data using a SSIS variable. In this example we are using a comma, but this can be anything we want it to be, a semi-colon, pipe, etc.:


Figure 3.  SSIS Variable

This will be used to configure our Script Component that will do the parsing. Drag a Script Component transformation from the tool box onto the Data Flow design surface. When prompted to select a script component type, choose Transformation:


Figure 4. Script Component Type

On the Script screen, make sure to add our SSIS Variable as a read only variable:


Figure 5. Script Component Script Screen
     
     With this done, let's go to the Input Columns screen and select the AUDITOR_ID and ASSIGNED_DEPARTMENTS columns, since these are the ones we need to produce our data set that pairs auditors with departments:


Figure 6. Script Component Input Columns Screen

     Next, we need to click on Input and Outputs to create a new output for our auditor/department pairing. Click on the Add Output button and give it a name, in this example I named it Output_Department. Under Output_Department, create 2 columns: AUDITOR_ID DT_I4 and  DEPARTMENT DT_STR 50:


Figure 7. Script Component Inputs and Outputs Screen

     Notice we are leaving alone the Output 0 output. This output is going to redirect the untransformed data from the OLE DB Source. Now we're ready to start coding, go back to the Script screen and click the Edit Script button. Paste the following code in main.cs:


#region Namespaces
using System;
using System.Data;
using System.Windows.Forms;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
#endregion

#region Class
/// <summary>
/// This class parses out delimited values from a column in a database
/// </summary>
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

#region Methods
    /// <summary>
    /// This method is called once for every row that passes through the component from Input0.
    ///
    /// Example of reading a value from a column in the the row:
    ///  string zipCode = Row.ZipCode
    ///
    /// Example of writing a value to a column in the row:
    ///  Row.ZipCode = zipCode
    /// </summary>
    /// <param name="Row">The row that is currently passing through the component</param>
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {

        try
        {
            //Get delimiter
            char delimiter = (char)Variables.DELIMITER;
            //Get parsed departments
            string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter);
         
            //Output new rows for each department
            foreach (string dept in department)
            {
                OutputDepartmentBuffer.AddRow();
                OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;
                OutputDepartmentBuffer.DEPARTMENT = dept.Trim();
            }
        }
        catch (Exception e)
        {
            failComponent(e.ToString());
        }
    }

    /// <summary>
    /// Outputs an error message
    /// </summary>
    /// <param name="errorMsg">The exception message</param>
    private void failComponent(string errorMsg)
    {
        bool fail = false;
        IDTSComponentMetaData100 compMetadata = this.ComponentMetaData;
        compMetadata.FireError(1, "Error Parsing Departments!"errorMsg"", 0, out fail);

    }
#endregion
}
#endregion

     Lets go through and explain some of this code. First thing we want to do is override the public override void Input0_ProcessInputRow(Input0Buffer Row) method. For every record that comes in we want to parse out the departments into an array  string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter) based on the delimiter we declared in our SSIS variable char delimiter = (char)Variables.DELIMITER. We then loop through the array and output each department/auditor id to the OutputDepartmentBuffer:


            //Output new rows for each department
            foreach (string dept in department)
            {
                OutputDepartmentBuffer.AddRow();
                OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;
                OutputDepartmentBuffer.DEPARTMENT = dept.Trim();
            }

     Build, save and lets run the package and see what our outputs look like:

Figure 8. Data Viewers
     As you can see, we were able to parse out each department and add it to its own output. Once these 2 outputs are sent to their data destinations we can use SQL to join them on AUDITOR_ID, if we need to project auditor attributes along with department attributes.

6 comments:

  1. Love it!, Thanks for the really clear walk through

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. 1. Session slot has value of “08-May-2014 (AM, PM)
    09-May-2014 (AM, PM)”
    We need to explore if this can be read by ETL to breakdown the record as below,
    Date Slot Status
    8-May-2014 AM Confirmed
    8-May-2014 PM Confirmed
    9-May-2014 AM Confirmed
    9-May-2014 PM Confirmed

    ReplyDelete
    Replies
    1. I guess I'm not understanding. Do you need help with something?

      Delete