Thursday, May 30, 2013

Creating Daily Snapshots of Data Warehouse Growth

       When running ETL jobs, its wise to track the number of records processed, inserted, updated etc. This is easy to do when using an ETL tool such as SSIS. This data contributes to decisions made when planning for resources for your data warehouse.  In addition to this activity, it is also good to keep track of the size of these databases to determine what kind of scalability you're going to need. This also helps you to forecast what the growth will be in a month, a year, 5 years, etc.  You can track and model this data just like any other piece of data your data warehouse imports. It contains dimensions(databases, tables, dates) and facts(rows, size, index sizes, etc.).

    To aid us with this, we can take advantage of a number of system stored procedures that come with SQL Server.

  • SP_HELPDB
    • Returns basic and extended information about a database, depending on if the database name is passed as a parameter
  • SP_HELPFILE
    • Returns information about the physical files associated with the database
  • SP_SPACEUSED
    • Returns information about a table. This includes rows, disk space, index sizes, etc.
  • SP_MSFOREACHTABLE
    • This is an undocumented stored procedure that allows you to execute a statement against every table in a database
     For database information we can execute an SQL statement like this(The temp tables can be replaced with tables used for staging data):

-->DECLARATIONS

-->HOLDS DATABASE NAMES
DECLARE @TEMP_DB_NAME TABLE
(
       DB_KEY INT IDENTITY(1,1),
       DB_NAME VARCHAR(200)
)
-->HOLDS THE NUMBER OF DATABASES TO LOOP FOR
DECLARE @NUMBER_DBS INT
-->ITERATOR
DECLARE @COUNTER INT
-->DB NAME IN LOOP
DECLARE @DB_NAME VARCHAR(200)
-->HOLD EXECUTE STATEMENT
DECLARE @STATEMENT VARCHAR(200)
-->LOADS THE TEMP TABLE WITH OUR DATABASE NAMES
INSERT INTO @TEMP_DB_NAME
SELECT
       NAME
FROM
       SYS.DATABASES 
WHERE
       DATABASE_ID>4
       AND HAS_DBACCESS(NAME)=1


-->INITIALIZE ITERATOR
SET @COUNTER=1
-->SETS THE NUMBER OF DATABASES WE HAVE
SELECT
       @NUMBER_DBS=MAX(DB_KEY) 
FROM
       @TEMP_DB_NAME

CREATE TABLE #DATABASEINFO  
(
       NAME VARCHAR(128),
       DB_SIZE NVARCHAR(13),
       OWNER VARCHAR(255),
       DBID SMALLINT,
       CREATED SMALLDATETIME,
       STATUS NVARCHAR(600),
       COMPATIBILITY_LEVEL TINYINT,
       PROCESSDATE DATETIME
)

-->EXTENDED DATABASE INFO
CREATE TABLE #DATABASEEXTINFO  
(
       NAME VARCHAR(128),
       FILEID SMALLINT,
       FILENAME NCHAR(260),
       FILEGROUP NVARCHAR(128),
       DBSIZE NVARCHAR(18),
       MAXDBSIZE NVARCHAR(18),
       GROWTH NVARCHAR(18),
       USAGE VARCHAR(9),
       DATABASENAME VARCHAR(128),
       PROCESSDATE DATETIME
)

-->LOOP
WHILE @COUNTER<= @NUMBER_DBS
BEGIN
SELECT
       @DB_NAME = DB_NAME
FROM
       @TEMP_DB_NAME
WHERE
       DB_KEY=@COUNTER
-->GET EXTENDED INFORMATION
EXEC('USE '+ @DB_NAME + ' INSERT INTO #DATABASEEXTINFO (NAME, FILEID, FILENAME, FILEGROUP, DBSIZE, MAXDBSIZE, GROWTH, USAGE)
EXEC SP_HELPFILE')

UPDATE 
       #DATABASEEXTINFO
SET
       DATABASENAME=@DB_NAME
WHERE
       DATABASENAME IS NULL

SET @COUNTER= @COUNTER+1
END

 -->ALL DATABASE GENERAL INFO
INSERT INTO #DATABASEINFO (NAME, DB_SIZE, OWNER, DBID, CREATED, STATUS, COMPATIBILITY_LEVEL)
EXEC SP_HELPDB 
-->REMOVE SYSTEM DATABASES
DELETE FROM #DATABASEINFO WHERE DBID<5

-->SET PROCESSEDATE
UPDATE #DATABASEINFO SET PROCESSDATE=GETDATE()
UPDATE #DATABASEEXTINFO SET PROCESSDATE=GETDATE()

-->SHOW RESULTS
SELECT
       NAME,
       DB_SIZE,
       OWNER,
       DBID,
       CREATED,
       STATUS,
       COMPATIBILITY_LEVEL,
       PROCESSDATE
FROM
       #DATABASEINFO

SELECT
       NAME,
       FILEID,
       FILENAME,
       FILEGROUP,
       DBSIZE,
       MAXDBSIZE,
       GROWTH,
       USAGE,
       DATABASENAME,
       PROCESSDATE
FROM
       #DATABASEEXTINFO
        
-->DROP TEMP TABLES
   DROP TABLE #DATABASEEXTINFO  
   DROP TABLE #DATABASEINFO  

     This statement will produce 2 recordsets (tuples) that look something like this:

Figure 1. SP_HELP with No Parameter

Figure 2. SP_HELP with Database Name Parameter

     These 2 data sources can then be used to generate the tables behind this schema:

Figure 3. Database Growth Schema
    
      This schema represents a periodic snapshot that captures the database file and log file sizes daily. You sum on FILE_SIZE and roll up to the database to get the size of the entire database on a given day.  If you have all of your inserts and updates for the entire database logged in another table, you can roll those up to the database/day level and add those as measures to the fact as well. From this schema you could produce daily database reports giving an overview of what the growth looks like:

Figure 4. Database Growth Report


     To get down to the level of granularity of individual tables, we can execute a SQL statement similar to this (The temp tables can be replaced with tables used for staging data):

-->DECLARATIONS

-->HOLDS DATABASES TO LOOP THROUGH
DECLARE @TEMP_DB_NAME TABLE
(
DB_KEY INT IDENTITY(1,1),
DB_NAME VARCHAR(200)
)
-->HOLDS DATA ABOUT EACH TABLE
CREATE TABLE #TABLEINFO (
       NAME VARCHAR(200),
       ROWS INT,
       RESERVED VARCHAR(255),
       DATA VARCHAR(20),
       INDEX_SIZE VARCHAR(20),
       UNUSED VARCHAR(250),
       DBNAME VARCHAR(50),
       PROCESSDATE DATETIME
       )
-->HOLDS THE NUMBER OF DATABASES TO LOOP FOR
DECLARE @NUMBER_DBS INT
-->ITERATOR
DECLARE @COUNTER INT
-->DB NAME IN LOOP
DECLARE @DB_NAME VARCHAR(200)

-->LOADS THE TEMP TABLE WITH OUR DATABASE NAMES
INSERT INTO @TEMP_DB_NAME
SELECT
       NAME
FROM
       SYS.DATABASES 
WHERE
       DATABASE_ID>4
       AND HAS_DBACCESS(NAME)=1

-->INITIALIZE ITERATOR
SET @COUNTER=1
-->SETS THE NUMBER OF DATABASES WE HAVE
SELECT
       @NUMBER_DBS=MAX(DB_KEY) 
FROM
       @TEMP_DB_NAME

-->PERFORMS AN INSERT OF TABLE STATS FOR EACH DATABASE
WHILE @COUNTER<= @NUMBER_DBS
BEGIN
SELECT
       @DB_NAME = DB_NAME
FROM
       @TEMP_DB_NAME
WHERE
       DB_KEY=@COUNTER

-->GET TABLE STATISTICS
EXECUTE(

       'USE '+@DB_NAME +

       ' INSERT INTO #TABLEINFO
 (
       NAME,
       ROWS,
       RESERVED,
       DATA,
       INDEX_SIZE,
       UNUSED
 )
       EXEC SP_MSFOREACHTABLE @COMMAND1="EXEC SP_SPACEUSED ''?''"')


UPDATE A
SET
       A.DBNAME=@DB_NAME,
       A.PROCESSDATE=GETDATE()
FROM
       #TABLEINFO A
WHERE
       A.DBNAME IS NULL
       AND A.PROCESSDATE IS NULL

SET @COUNTER= @COUNTER+1
END

 -->SELECT TABLE STATS
       SELECT
              NAME,
              ROWS,
              replace(RESERVED,'KB','') as RESERVED,
              replace(DATA, 'KB','') as DATA,
              replace(INDEX_SIZE,'KB','') as INDEX_SIZE,
              replace(UNUSED,'KB','') as UNUSED,
              DBNAME,
              PROCESSDATE
       FROM
              #TABLEINFO
       WHERE
              NAME<>'SYSDIAGRAMS'

-->DROP OUR TEMP TABLE
       DROP TABLE #TABLEINFO

This statement will produce the following recordset:

Figure 5. SP_SPACEUSED
    
      From this data source we can derive a schema that looks something like this:

Figure 6. Table Growth Schema
     
     This schema represents a periodic snapshot that captures the table growth daily. If you have all of your inserts and updates for the each table logged, you can roll those up to the table/database/day level and add those as measures to the fact as well. From this schema you could produce daily table reports giving an overview of what the growth looks like:

Figure 7. Table Growth Report
     
     These 4 system stored procedures give us a good place to start from in recording database metrics that pertain to scalability. The sky is basically the limit for how much you want to log, but this is a pretty good example of how simple it is to achieve.

Tuesday, May 28, 2013

Parse Delimited Values from a Column using a SSIS Script Component

     I get some weird data sometimes from our source system developers. Not too long ago, I received a piece of data that had a column with a number of values that were semi-colon delimited and needed to be parsed out during ETL. I don't even know how they got the data in this format or why data, coming from a database, should ever look like this. Since this isn't a flat file, I can't use a flat file connection manager to parse by the delimiter. So, we'll have to use some custom code in a script component to do this. 

     Say we were given a piece of data that looks like this:


Figure 1. Auditor Assignments


This data represents the departments 3 different auditors are assigned to, delimited by a comma. Now, you would have hoped to have a piece of data that has the AUDITOR_ID and each individual department in separate rows, but this was delivered instead. In order to work with this data, we will have to parse out the departments.

Create/Populate Script:

CREATE TABLE AUDIT_ASSIGNMENT
(
AUDITOR_FIRST_NAME VARCHAR(25),
AUDITOR_LAST_NAME VARCHAR(30),
AUDITOR_ID INT,
ASSIGNED_DEPARTMENTS VARCHAR(100)
)
GO

INSERT INTO AUDIT_ASSIGNMENT VALUES
('John', 'Smith', 10134, 'Finance, Marketing, Operations'),
('Bob',  'Jones', 10135, 'Accounting, Product Management, R&D'),
('Jane', 'Foley', 10136, 'HR, Sales, Account Management, Customer Support')
GO


     The Data Flow we're going to use in this example is going to look like this:


Figure 2. Data Flow
     
     The OLE DB Source is going to be our AUDIT_ASSIGNMENT table. The Script Component is going to have 2 outputs, 1 that is going to push out data "as is" from the source and a second that is going to parse out the departments and pair them with the AUDITOR_ID of the auditor assigned. Ignore the Union All Components, they are there so I can add Data Viewers to the Data Flow.

     We also need to set up the delimiter for this data using a SSIS variable. In this example we are using a comma, but this can be anything we want it to be, a semi-colon, pipe, etc.:


Figure 3.  SSIS Variable

This will be used to configure our Script Component that will do the parsing. Drag a Script Component transformation from the tool box onto the Data Flow design surface. When prompted to select a script component type, choose Transformation:


Figure 4. Script Component Type

On the Script screen, make sure to add our SSIS Variable as a read only variable:


Figure 5. Script Component Script Screen
     
     With this done, let's go to the Input Columns screen and select the AUDITOR_ID and ASSIGNED_DEPARTMENTS columns, since these are the ones we need to produce our data set that pairs auditors with departments:


Figure 6. Script Component Input Columns Screen

     Next, we need to click on Input and Outputs to create a new output for our auditor/department pairing. Click on the Add Output button and give it a name, in this example I named it Output_Department. Under Output_Department, create 2 columns: AUDITOR_ID DT_I4 and  DEPARTMENT DT_STR 50:


Figure 7. Script Component Inputs and Outputs Screen

     Notice we are leaving alone the Output 0 output. This output is going to redirect the untransformed data from the OLE DB Source. Now we're ready to start coding, go back to the Script screen and click the Edit Script button. Paste the following code in main.cs:


#region Namespaces
using System;
using System.Data;
using System.Windows.Forms;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
#endregion

#region Class
/// <summary>
/// This class parses out delimited values from a column in a database
/// </summary>
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

#region Methods
    /// <summary>
    /// This method is called once for every row that passes through the component from Input0.
    ///
    /// Example of reading a value from a column in the the row:
    ///  string zipCode = Row.ZipCode
    ///
    /// Example of writing a value to a column in the row:
    ///  Row.ZipCode = zipCode
    /// </summary>
    /// <param name="Row">The row that is currently passing through the component</param>
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {

        try
        {
            //Get delimiter
            char delimiter = (char)Variables.DELIMITER;
            //Get parsed departments
            string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter);
         
            //Output new rows for each department
            foreach (string dept in department)
            {
                OutputDepartmentBuffer.AddRow();
                OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;
                OutputDepartmentBuffer.DEPARTMENT = dept.Trim();
            }
        }
        catch (Exception e)
        {
            failComponent(e.ToString());
        }
    }

    /// <summary>
    /// Outputs an error message
    /// </summary>
    /// <param name="errorMsg">The exception message</param>
    private void failComponent(string errorMsg)
    {
        bool fail = false;
        IDTSComponentMetaData100 compMetadata = this.ComponentMetaData;
        compMetadata.FireError(1, "Error Parsing Departments!"errorMsg"", 0, out fail);

    }
#endregion
}
#endregion

     Lets go through and explain some of this code. First thing we want to do is override the public override void Input0_ProcessInputRow(Input0Buffer Row) method. For every record that comes in we want to parse out the departments into an array  string[] department = Row.ASSIGNEDDEPARTMENTS.Split(delimiter) based on the delimiter we declared in our SSIS variable char delimiter = (char)Variables.DELIMITER. We then loop through the array and output each department/auditor id to the OutputDepartmentBuffer:


            //Output new rows for each department
            foreach (string dept in department)
            {
                OutputDepartmentBuffer.AddRow();
                OutputDepartmentBuffer.AUDITORID = Row.AUDITORID;
                OutputDepartmentBuffer.DEPARTMENT = dept.Trim();
            }

     Build, save and lets run the package and see what our outputs look like:

Figure 8. Data Viewers
     As you can see, we were able to parse out each department and add it to its own output. Once these 2 outputs are sent to their data destinations we can use SQL to join them on AUDITOR_ID, if we need to project auditor attributes along with department attributes.