Wednesday, June 22, 2011

SQL Bulk Copy

In an earlier post I talked about a generic SSIS package for staging data using SQL Bulk Copy. I have been using it extensively without fail . . . until just recently.

I got a new set of files from a new business unit that we were bringing into our data warehouse solution. Of their 20 files 14 loaded without issue and the remaining 6 failed with an error similar to the one below.

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.InvalidOperationException: The given value of type String from the data source cannot be converted to type int of the specified target column.

I kept going back to the developers that created the extract process and telling them that they were doing something wrong with their encoding or delimiters but neither us nor them could actually find the issue. Until now.

Turns out the issue is related to SQL Bulk Copy, numeric columns, and null values. Apparently SQL Bulk Copy does not, by default, keep nulls. Instead, it converts them to empty strings. Well, as I'm sure you know, SQL Server doesn't like it when you try to load empty strings, or any strings for that matter, into a numerically typed column.

There are a couple of solutions.

  1. Make sure to replace all nulls with zeros when created flat files with numeric data that will be consumed by BCP.
  2. Use the KeepNulls option when executing BCP. In my SSIS solution I changed my SqlBulkCopy object creation to look like this . . .
Dim bc As SqlBulkCopy = New SqlBulkCopy(dbConn, SqlBulkCopyOptions.KeepNulls, bulkCopyTransaction)

Tuesday, June 14, 2011

Data Warehouse QA Technical Screen Questions

I was recently asked to perform a technical phone screen of a data warehouse QA candidate. The project manager wanted to ensure that this person knew enough SQL to be able to efficiently test our data warehouse contents (not end-to-end processing testing).

So that I didn't sound like a bumbling fool when I made the call I put together a list of questions. It's by no means exhaustive but I figured it could be helpful to some . . .

When was the last time you used SQL on a day to day basis (how often do they use it and how rusty might they be)?

How would you rate your SQL from 1-10 (sets expectations for how well they will answer questions below)?

What database platforms/versions have you used (Oracle, SQL Server, MS Acess, MySQL, etc.)?

1) What are the two keywords necessary for any SQL statement?

2) How do you limit the results of a query based one or more criterion?
A1) WHERE clause

  What is a join?
A1) It Combines records from two or more tables.

4)      What types of joins are there and what do they do?

5)      What is the syntax for joining two tables?
 A1) FROM tableA
INNER JOIN tableB ON tableA.column1 = tableB.column2

6)      How do you get a single instance of each value of a column in a table?
            How do you get the number of occurrences of each distinct value in a table?
            How do you get the largest value of a column in a table?
9) What is the difference between putting criteria in a WHERE clause or a JOIN clause?
A1) When it is an INNER JOIN it does not affect the results.      
A2) When it is an OUTER JOIN, putting the criteria in the WHERE clause effectively makes it an INNER JOIN which is not usually the intended behavior.
A3) When it is an OUTER JOIN, putting the criteria in the JOIN clause allows the criteria to be applied to only those rows that are outer-joined and does not create the INNER JOIN – effect.
A4) Depending on the database platform and version, the performance could be affected.
10)   What is the difference between a WHERE clause and a HAVING clause?
A1) The WHERE clause applies the criteria at the row level whereas the HAVING clause applies it at the aggregate level.

11)   What is a derived table/inline view?
A1) It is a query that is aliased as a table and used like any other table in a FROM clause.

12)   What is a nested subquery?
A1) It is a query that is nested within a SELECT or a WHERE clause.

13)   What is a correlated subquery and when are they useful?
A1) It is a query that is used within a parent query and refers to the parent query.
A2) A correlated subquery is evaluated once for each row processed by the parent statement.
A3) They are useful in providing a set of values to be used in a WHERE clause.

Our data warehouse solution consists of the following.
       1)      Flat files that provide all source data
       2)      A staging database that the flat files are loaded into
       3)      A data warehouse database that contains the transformed data based on a documented set of rules.

1)      If you could assume that the flat files were loaded to the staging database without any errors, how would you test that the data was correctly transformed from the staging database to the data warehouse? What tools, if any, would you need?
A1) Write queries against the staging database that follow the transformation rules. Then write queries to pull the related rows from the
A2) Some kind of a query execution tool is necessary. Toad, RapidSQL, SQL Server Management Studio, etc.
2)      How do you test large sets of data where there could be millions of rows?
A1) Narrow down the results based on specific criteria.
A2) Apply aggregate functions such as SUM to determine if the test query aggregates to the same value as the data warehouse query.
A3) Join the test query/tables to the data warehouse query/tables and apply a WHERE clause that filters on only those values that do not equal each other.

3)      What data would you provide to the development team when you found inconsistencies between the documented transformations and what was in the data warehouse?
A1) The transformation rule
A2) The test query
A3) The data warehouse query (if independent from above)
A4) Explanation of expected versus actual results