Monday, March 28, 2011

A One-Package, Generic SSIS Staging Process

I recently worked on a project that was flat file and staging intensive. By intensive I mean we had essentially 1.5 ETL developers and over 50 flat files to stage. We also had a very aggressive timeline. Two data marts with source data coming from 5 totally disparate and disconnected (hence the flat files) subsidiaries/source systems all over the world (think currency conversion, translation, master data management, etc.) and only three and a half months to get it done.

Was I going to create a package to stage each flat file? Ummm . . . no! So what was I to do? I was going to create a single, metadata driven package that would loop through a data set and populate my staging tables one by one. Below is a screenshot of all the components necessary to make this work. There is no Data Flow.

Before going into the details there are a few things to keep in mind.

  1. This solution is based on text files. It can easily be tailored to read from relational sources and also non-relational sources with a little more tweaking.
  2. What makes this design possible is a Script Task that alleviates the need for custom data flows for each extract file/table to staged.
  3. The solution assumes that the staging tables can hold more than one day's/load's worth of data and therefore adds an ExtractFileID column to each staging table and to the source data as it is being loaded.
  4. For the script task to work as-is it is necessary to create a staging table for each data source that will be staged. Each staging table must meet the following criteria:
    • The staging table must contain column names that match the data source column names exactly.
    • The staging table columns must have data types that are appropriate for the incoming data. If they are character data types they must be large enough that they will not truncate any incoming data.  
The metadata that drives the solution might be specific to your project so I won't go into it in this post. What you'll need to do is create the metadata tables necessary to create a data set that can by used by a Foreach Loop container that will loop through each file or table that needs to be staged.

However, you can test the Script Task by hard-coding a single file's metadata within the Script Task. That is exactly what I did to create the script. Once the script was working as expected I created the necessary metadata tables and data and passed the values into the script to make it more dynamic. Below is the method that does the bulk of the work (no pun intended). The entire script can be downloaded here.

Private Sub BulkCopyDelimitedFile(ByVal batchSize As Integer, _
            ByVal columnDelimiter As String, _
            ByVal extractFileID As Int32, _
            ByVal extractFileFullPath As String, _
            ByVal stageTableName As String)

        Dim SqlConnectionManager As Microsoft.SqlServer.Dts.Runtime.ConnectionManager

        'Set a connection manager object equal to the connection manager named "Stage"
        SqlConnectionManager = Dts.Connections("Stage")

        'Since the "Stage" connection manager is of type OLEDB we need to modify it to make it compatible with an ADO.NET connection
        Dim dbConn As SqlConnection = New SqlConnection(SqlConnectionManager.ConnectionString.Replace("Provider=SQLNCLI10.1;", ""))

        'Create a new StreamReader object and pass in the path of the file to be read in
        Dim sr As StreamReader = New StreamReader(extractFileFullPath)

        'Read in the first line of the file
        Dim line As String = sr.ReadLine()

        'Create an array of strings and fill it with each column within the extract file by using the Split function
        Dim strArray As String() = line.Split(columnDelimiter)

        'Create a DataTable object
        Dim dt As DataTable = New DataTable()

        'Create a DataRow object
        Dim row As DataRow

        'Open the database connection

        'Create a SQLTransaction object in which to execute the Bulk Copy process so that if it fails, all writes are rolled back
        Dim bulkCopyTransaction As SqlTransaction = dbConn.BeginTransaction()

        'Instantiate our SqlBulkCopy object and set appropriate properties
        Dim bc As SqlBulkCopy = New SqlBulkCopy(dbConn, SqlBulkCopyOptions.Default, bulkCopyTransaction)

        'Set the BulkCopy destination table name equal to the staging table name provided by the Foreach Loop Container
        bc.DestinationTableName = stageTableName

        'Set the BulkCopy batch size equal to the size contained in the metadata provided by the Foreach Loop Container
        bc.BatchSize = batchSize

        'For each column that was found in the extract file
        For Each columnName As String In strArray
            'Add a column in the data table
            dt.Columns.Add(New DataColumn(columnName))

            'Add a column mapping in the SqlBulkCopy object
            bc.ColumnMappings.Add(columnName, columnName)

        'Add the ExtractFileID column to the data table since it doesn't exist in the extract file and wasn't added in the previous For Each loop
        dt.Columns.Add(New DataColumn("ExtractFileID", System.Type.GetType("System.String")))

        'Add the ExtractFileID column mapping since it doesn't exist in the extract file and wasn't added in the previous For Each loop
        bc.ColumnMappings.Add("ExtractFileID", "ExtractFileID")

        'Move the the first row in the extract file after the header
        line = sr.ReadLine()

        Dim rowCount As Integer

        'Loop through all rows in the extract file
        While Not line = String.Empty
            rowCount += 1

            'Create a new data table row to store the extract file's data in
            row = dt.NewRow()

            'Add all column values to the row
            row.ItemArray = line.Split(columnDelimiter)

            'Add the ExtractFileId 
            row("ExtractFileID") = extractFileID

            'Add the newly created row to the data table

            'Move to the next row in the extract file
            line = sr.ReadLine()
        End While

            'Write the data table to the staging table

            'If successful, commit the transaction
        Catch ex As Exception
            'If an exception occurs, rollback the transaction

            'Set the Task result to Failure
            Dts.TaskResult = ScriptResults.Failure

            Throw ex
        End Try

        'Close the database connection

        'Close the BulkCopy object

        'Set the SSIS metadata variable equal to the number of rows loaded
        Dts.Variables("ExtractActualRowCount").Value = rowCount
    End Sub

Download script

SSIS Interview Questions

Below are some questions I recently asked while interviewing for an SSIS developer. Thought they might be helpful to others looking to hire. Answers to follow . . .
  1. Have you used SSIS in a project/product type of environment as opposed to an operational/once-off type of an environment?
  2. Have you worked with a hierachical/modular type of SSIS project with packages calling other packages? If so, did you pass values back and forth between the packages and if so, how?
  3. Have you dealt with package versioning and if so, how?
  4. Have you used any of the package configuration functionality within SSIS? If so, which types and for what types of values?
  5. Have you used any kind of restart mechanisms in SSIS? If so, which one(s)?
  6. Have you developed packages that could be migrated from one environment to another without any code changes (directory locations, database connections, etc.)? If so, how?
  7. Which data flow transformations have you used?
  8. Are you familiar with dimensional/star-schema modeling and concepts such as junk dimensions and slowly changing dimensions?
  9. Have you used the Slowly Changing Dimension transformation? If so, what if any issues did you encounter while using it?
  10. What kind of logging have you used with SSIS, if any?
  11. Have you used package event handlers for any kind of processing logic?
  12. What types of debugging tools/methods have you used with SSIS?
  13. Have you used script components and if so, what language are you most comfortable with?
  14. Where have you stored/deployed your production SSIS packages (file system or msdb)? What is your preference and why?

Wednesday, March 16, 2011

SSIS Error On OLE DB Command - Operand type clash: int is incompatible with date

Here's a good one - Operand type clash: int is incompatible with date.

It appears, although I have not fully confirmed, that SSIS does not play well with the new SQL Server DATE data type.

I ran into this issue when calling a stored procedure from an OLE DB Command transformation. The stored procedure had a parameter of data type DATE.

While not entirely satisfying, the quick fix for this is to change the parameter data type to DATETIME.

An issue was opened with Microsoft but they state they have not been able to reproduce the error.

Tuesday, March 15, 2011

VS Studio 2010 Table Designer

Embarrassingly enough this one eluded me for quite some time. I was surprised to stumble onto the fact that there is indeed table designer functionality within Visual Studio. It is very much comparable to the SQL Server Management Studio table designer.

However, as far as I can tell, it is dependent on a database server connection which goes against one of Microsoft's biggest database project selling points - disconnected development.

If you're like me and haven't been able to find this functionality I included a screen shot below to help you find it. The key is the Server Explorer tab.

  1. Create a connection if you don't already have one.
  2. Expand your connection and right-click on Tables -> Add New Table.

Once you have defined your table you can use the Table Designer - > Generate Change Script . . . menu option to create a DDL script. If you save the definition it will actually save it to the database.