Dennis and Jim's Data Engineering Blog: May 2013

Thursday, May 30, 2013

Creating Daily Snapshots of Data Warehouse Growth

When running ETL jobs, its wise to track the number of records processed, inserted, updated etc. This is easy to do when using an ETL tool such as SSIS. This data contributes to decisions made when planning for resources for your data warehouse. In addition to this activity, it is also good to keep track of the size of these databases to determine what kind of scalability you're going to need. This also helps you to forecast what the growth will be in a month, a year, 5 years, etc. You can track and model this data just like any other piece of data your data warehouse imports. It contains dimensions(databases, tables, dates) and facts(rows, size, index sizes, etc.).

To aid us with this, we can take advantage of a number of system stored procedures that come with SQL Server.

SP_HELPDB

Returns basic and extended information about a database, depending on if the database name is passed as a parameter

SP_HELPFILE

Returns information about the physical files associated with the database

SP_SPACEUSED

Returns information about a table. This includes rows, disk space, index sizes, etc.

SP_MSFOREACHTABLE

This is an undocumented stored procedure that allows you to execute a statement against every table in a database

For database information we can execute an SQL statement like this(The temp tables can be replaced with tables used for staging data):

-->DECLARATIONS

-->HOLDS DATABASE NAMES

DECLARE @TEMP_DB_NAME TABLE

(

DB_KEY INT IDENTITY(1,1),

DB_NAME VARCHAR(200)

)

-->HOLDS THE NUMBER OF DATABASES TO LOOP FOR

DECLARE @NUMBER_DBS INT

-->ITERATOR

DECLARE @COUNTER INT

-->DB NAME IN LOOP

DECLARE @DB_NAME VARCHAR(200)

-->HOLD EXECUTE STATEMENT

DECLARE @STATEMENT VARCHAR(200)

-->LOADS THE TEMP TABLE WITH OUR DATABASE NAMES

INSERT INTO @TEMP_DB_NAME

SELECT

NAME

FROM

SYS.DATABASES

WHERE

DATABASE_ID>4

AND HAS_DBACCESS(NAME)=1

-->INITIALIZE ITERATOR

SET @COUNTER=1

-->SETS THE NUMBER OF DATABASES WE HAVE

SELECT

@NUMBER_DBS=MAX(DB_KEY)

FROM

@TEMP_DB_NAME

CREATE TABLE #DATABASEINFO

(

NAME VARCHAR(128),

DB_SIZE NVARCHAR(13),

OWNER VARCHAR(255),

DBID SMALLINT,

CREATED SMALLDATETIME,

STATUS NVARCHAR(600),

COMPATIBILITY_LEVEL TINYINT,

PROCESSDATE DATETIME

)

-->EXTENDED DATABASE INFO

CREATE TABLE #DATABASEEXTINFO

(

NAME VARCHAR(128),

FILEID SMALLINT,

FILENAME NCHAR(260),

FILEGROUP NVARCHAR(128),

DBSIZE NVARCHAR(18),

MAXDBSIZE NVARCHAR(18),

GROWTH NVARCHAR(18),

USAGE VARCHAR(9),

DATABASENAME VARCHAR(128),

PROCESSDATE DATETIME

)

-->LOOP

WHILE @COUNTER<= @NUMBER_DBS

BEGIN

SELECT

@DB_NAME = DB_NAME

FROM

@TEMP_DB_NAME

WHERE

DB_KEY=@COUNTER

-->GET EXTENDED INFORMATION

EXEC('USE '+ @DB_NAME + ' INSERT INTO #DATABASEEXTINFO (NAME, FILEID, FILENAME, FILEGROUP, DBSIZE, MAXDBSIZE, GROWTH, USAGE)

EXEC SP_HELPFILE')

UPDATE

#DATABASEEXTINFO

SET

DATABASENAME=@DB_NAME

WHERE

DATABASENAME IS NULL

SET @COUNTER= @COUNTER+1

END

-->ALL DATABASE GENERAL INFO

INSERT INTO #DATABASEINFO (NAME, DB_SIZE, OWNER, DBID, CREATED, STATUS, COMPATIBILITY_LEVEL)

EXEC SP_HELPDB

-->REMOVE SYSTEM DATABASES

DELETE FROM #DATABASEINFO WHERE DBID<5

-->SET PROCESSEDATE

UPDATE #DATABASEINFO SET PROCESSDATE=GETDATE()

UPDATE #DATABASEEXTINFO SET PROCESSDATE=GETDATE()

-->SHOW RESULTS

SELECT

NAME,

DB_SIZE,

OWNER,

DBID,

CREATED,

STATUS,

COMPATIBILITY_LEVEL,

PROCESSDATE

FROM

#DATABASEINFO

SELECT

NAME,

FILEID,

FILENAME,

FILEGROUP,

DBSIZE,

MAXDBSIZE,

GROWTH,

USAGE,

DATABASENAME,

PROCESSDATE

FROM

#DATABASEEXTINFO

-->DROP TEMP TABLES

DROP TABLE #DATABASEEXTINFO

DROP TABLE #DATABASEINFO

This statement will produce 2 recordsets (tuples) that look something like this:

Figure 1. SP_HELP with No Parameter

Figure 2. SP_HELP with Database Name Parameter

These 2 data sources can then be used to generate the tables behind this schema:

Figure 3. Database Growth Schema

This schema represents a periodic snapshot that captures the database file and log file sizes daily. You sum on FILE_SIZE and roll up to the database to get the size of the entire database on a given day. If you have all of your inserts and updates for the entire database logged in another table, you can roll those up to the database/day level and add those as measures to the fact as well. From this schema you could produce daily database reports giving an overview of what the growth looks like:

Figure 4. Database Growth Report

To get down to the level of granularity of individual tables, we can execute a SQL statement similar to this (The temp tables can be replaced with tables used for staging data):

-->DECLARATIONS

-->HOLDS DATABASES TO LOOP THROUGH

DECLARE @TEMP_DB_NAME TABLE

(

DB_KEY INT IDENTITY(1,1),

DB_NAME VARCHAR(200)

)

-->HOLDS DATA ABOUT EACH TABLE

CREATE TABLE #TABLEINFO (

NAME VARCHAR(200),

ROWS INT,

RESERVED VARCHAR(255),

DATA VARCHAR(20),

INDEX_SIZE VARCHAR(20),

UNUSED VARCHAR(250),

DBNAME VARCHAR(50),

PROCESSDATE DATETIME

)

-->HOLDS THE NUMBER OF DATABASES TO LOOP FOR

DECLARE @NUMBER_DBS INT

-->ITERATOR

DECLARE @COUNTER INT

-->DB NAME IN LOOP

DECLARE @DB_NAME VARCHAR(200)

-->LOADS THE TEMP TABLE WITH OUR DATABASE NAMES

INSERT INTO @TEMP_DB_NAME

SELECT

NAME

FROM

SYS.DATABASES

WHERE

DATABASE_ID>4

AND HAS_DBACCESS(NAME)=1

-->INITIALIZE ITERATOR

SET @COUNTER=1

-->SETS THE NUMBER OF DATABASES WE HAVE

SELECT

@NUMBER_DBS=MAX(DB_KEY)

FROM

@TEMP_DB_NAME

-->PERFORMS AN INSERT OF TABLE STATS FOR EACH DATABASE

WHILE @COUNTER<= @NUMBER_DBS

BEGIN

SELECT

@DB_NAME = DB_NAME

FROM

@TEMP_DB_NAME

WHERE

DB_KEY=@COUNTER

-->GET TABLE STATISTICS

EXECUTE(

'USE '+@DB_NAME +

' INSERT INTO #TABLEINFO

(

NAME,

ROWS,

RESERVED,

DATA,

INDEX_SIZE,

UNUSED

)

EXEC SP_MSFOREACHTABLE @COMMAND1="EXEC SP_SPACEUSED ''?''"')

UPDATE A

SET

A.DBNAME=@DB_NAME,

A.PROCESSDATE=GETDATE()

FROM

#TABLEINFO A

WHERE

A.DBNAME IS NULL

AND A.PROCESSDATE IS NULL

SET @COUNTER= @COUNTER+1

END

-->SELECT TABLE STATS

SELECT

NAME,

ROWS,

replace(RESERVED,'KB','') as RESERVED,

replace(DATA, 'KB','') as DATA,

replace(INDEX_SIZE,'KB','') as INDEX_SIZE,

replace(UNUSED,'KB','') as UNUSED,

DBNAME,

PROCESSDATE

FROM

#TABLEINFO

WHERE

NAME<>'SYSDIAGRAMS'

-->DROP OUR TEMP TABLE

DROP TABLE #TABLEINFO

This statement will produce the following recordset:

Figure 5. SP_SPACEUSED

From this data source we can derive a schema that looks something like this:

Figure 6. Table Growth Schema

This schema represents a periodic snapshot that captures the table growth daily. If you have all of your inserts and updates for the each table logged, you can roll those up to the table/database/day level and add those as measures to the fact as well. From this schema you could produce daily table reports giving an overview of what the growth looks like:

Figure 7. Table Growth Report

These 4 system stored procedures give us a good place to start from in recording database metrics that pertain to scalability. The sky is basically the limit for how much you want to log, but this is a pretty good example of how simple it is to achieve.

Tuesday, May 28, 2013

Parse Delimited Values from a Column using a SSIS Script Component

I get some weird data sometimes from our source system developers. Not too long ago, I received a piece of data that had a column with a number of values that were semi-colon delimited and needed to be parsed out during ETL. I don't even know how they got the data in this format or why data, coming from a database, should ever look like this. Since this isn't a flat file, I can't use a flat file connection manager to parse by the delimiter. So, we'll have to use some custom code in a script component to do this.

Say we were given a piece of data that looks like this:

Figure 1. Auditor Assignments

This data represents the departments 3 different auditors are assigned to, delimited by a comma. Now, you would have hoped to have a piece of data that has the AUDITOR_ID and each individual department in separate rows, but this was delivered instead. In order to work with this data, we will have to parse out the departments.

Create/Populate Script:

CREATE TABLE AUDIT_ASSIGNMENT

(

AUDITOR_FIRST_NAME VARCHAR(25),

AUDITOR_LAST_NAME VARCHAR(30),

AUDITOR_ID INT,

ASSIGNED_DEPARTMENTS VARCHAR(100)

)

INSERT INTO AUDIT_ASSIGNMENT VALUES

('John', 'Smith', 10134, 'Finance, Marketing, Operations'),

('Bob', 'Jones', 10135, 'Accounting, Product Management, R&D'),

('Jane', 'Foley', 10136, 'HR, Sales, Account Management, Customer Support')

The Data Flow we're going to use in this example is going to look like this:

Figure 2. Data Flow

The OLE DB Source is going to be our AUDIT_ASSIGNMENT table. The Script Component is going to have 2 outputs, 1 that is going to push out data "as is" from the source and a second that is going to parse out the departments and pair them with the AUDITOR_ID of the auditor assigned. Ignore the Union All Components, they are there so I can add Data Viewers to the Data Flow.

We also need to set up the delimiter for this data using a SSIS variable. In this example we are using a comma, but this can be anything we want it to be, a semi-colon, pipe, etc.:

Figure 3. SSIS Variable

This will be used to configure our Script Component that will do the parsing. Drag a Script Component transformation from the tool box onto the Data Flow design surface. When prompted to select a script component type, choose Transformation:

Figure 4. Script Component Type

On the Script screen, make sure to add our SSIS Variable as a read only variable:

Figure 5. Script Component Script Screen

With this done, let's go to the Input Columns screen and select the AUDITOR_ID and ASSIGNED_DEPARTMENTS columns, since these are the ones we need to produce our data set that pairs auditors with departments:

Figure 6. Script Component Input Columns Screen

Next, we need to click on Input and Outputs to create a new output for our auditor/department pairing. Click on the Add Output button and give it a name, in this example I named it Output_Department. Under Output_Department, create 2 columns: AUDITOR_ID DT_I4 and DEPARTMENT DT_STR 50:

Figure 7. Script Component Inputs and Outputs Screen

Notice we are leaving alone the Output 0 output. This output is going to redirect the untransformed data from the OLE DB Source. Now we're ready to start coding, go back to the Script screen and click the Edit Script button. Paste the following code in main.cs:

#region Namespaces

using System;

using System.Data;

using System.Windows.Forms;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

#endregion

#region Class

/// <summary>

/// This class parses out delimited values from a column in a database

/// </summary>

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]

public class ScriptMain : UserComponent

{

#region Methods

/// <summary>

/// This method is called once for every row that passes through the component from Input0.

///

/// Example of reading a value from a column in the the row:

/// string zipCode = Row.ZipCode

///

/// Example of writing a value to a column in the row:

/// Row.ZipCode = zipCode

/// </summary>

/// <param name="Row">The row that is currently passing through the component</param>

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

try

{

//Get delimiter

char delimiter = (char)Variables.DELIMITER;

//Get parsed departments

string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter);

//Output new rows for each department

foreach (string dept in department)

{

OutputDepartmentBuffer.AddRow();

OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;

OutputDepartmentBuffer.DEPARTMENT = dept.Trim();

}

catch (Exception e)

{

failComponent(e.ToString());

}

/// <summary>

/// Outputs an error message

/// </summary>

/// <param name="errorMsg">The exception message</param>

private void failComponent(string errorMsg)

{

bool fail = false;

IDTSComponentMetaData100 compMetadata = this.ComponentMetaData;

compMetadata.FireError(1, "Error Parsing Departments!", errorMsg, "", 0, out fail);

}

#endregion

}

#endregion

Lets go through and explain some of this code. First thing we want to do is override the public override void Input0_ProcessInputRow(Input0Buffer Row) method. For every record that comes in we want to parse out the departments into an array string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter) based on the delimiter we declared in our SSIS variable char delimiter = (char)Variables.DELIMITER. We then loop through the array and output each department/auditor id to the OutputDepartmentBuffer:

//Output new rows for each department

foreach (string dept in department)

{

OutputDepartmentBuffer.AddRow();

OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;

OutputDepartmentBuffer.DEPARTMENT = dept.Trim();

}

Build, save and lets run the package and see what our outputs look like:

Figure 8. Data Viewers

As you can see, we were able to parse out each department and add it to its own output. Once these 2 outputs are sent to their data destinations we can use SQL to join them on AUDITOR_ID, if we need to project auditor attributes along with department attributes.