Wednesday, September 30, 2009

How to read data from Excel Sheet in SQL Server?


The agenda for today's session is to discuss how to read data from Excel Sheet from SQL Server.

We all must have faced a scenario where we receive Excel sheet data from user and need to either update the contents in database or do some calculations using that data. Most of the time user is unaware of the codes/IDs used as primary key, so they give only Textual data.

Scenario will be to read data from Excel Sheet, join with other existing tables to get proper data or do some calculations.

There are many ways to achieve this:

Let's do it without using SSIS / DTS. Let's use OpenRowset function to do this.


1. Excel sheet must be available on SQL Server.
2. 'Ad Hoc Distributed Queries' option must be enabled. By default OpenRowset and OpenQuery options are disabled. We need to enable it to read data from external sources.

There are 2 ways in which you can enable OpenRowset and OpenQuery functions.

One way is to do it using SQL Server Surface Area Configuration Tool/Utility.

This tool can also be accessed by typing SQLSAC.EXE from Command Line or
using menu:

Following window will appear:

Once clicked the "Surface Area Configuration for Features" link within SQL Server 2005 Surface Area Configuration screen, will open a new screen which has the list of SQL Server 2005 Database Engine, Analysis Services & Reporting Services features. Such features include CLR Integration, DAC, Database Email, Service Broker, xp_cmdshell, Linked Objects etc which are turned off by default within SQL Server 2005.

As you can see the option is unchecked, so once its checked, we will be able to use this feature.

Another way to do this is using T-SQL sp_configure command.

The OPENROWSET and OPENDATASOURCE functions support ad hoc connections to remote data sources without using linked or remote servers. Enable this feature either by using SQLSAC or by executing the below TSQL code.

USE master
sp_configure 'show advanced options', 1

sp_configure 'Ad Hoc Distributed Queries', 1

As Distributed Queries are enabled, we can start to fetch data from Excel sheet.

'Excel 8.0;DATABASE=C:\TestData.xls', 'Select * from [Sheet1$]')

Openrowset has 1st parameter as 'Provider'
2nd parameter is Data Source
User Id and Password can also be passed.
3rd parameter is the query.

This will display all records from the Excel sheet.

The following example accesses the Customers table in the Microsoft Access Northwind database through the Microsoft OLE DB Provider for Jet.

SELECT CustomerID, CompanyName
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'C:\Program Files\Microsoft Office\OFFICE11\SAMPLES\Northwind.mdb';

The following example selects all data from the Customers table from the local instance of SQL Server Northwind database and from the Orders table from the Access Northwind database stored on the same computer

USE Northwind  
SELECT c.*, o.*
FROM Northwind.dbo.Customers AS c 
'C:\Program Files\Microsoft Office\OFFICE11\SAMPLES\Northwind.mdb';'admin';'', Orders)    
AS o 
ON c.CustomerID = o.CustomerID 

The following example creates a small table for demonstration purposes, and inserts file data from a file named Text1.txt located in the C: root directory into a varbinary(max) column.

USE AdventureWorks
CREATE TABLE myTable(FileName nvarchar(60), 
FileType nvarchar(60), Document varbinary(max))

INSERT INTO myTable(FileName, FileType, Document) 
SELECT 'Text1.txt' AS FileName, 
'.txt' AS FileType, 

How to do the same in SQL Server 2008?

You thought that you have by now become comfortable with the Surface Area Configuration Tool in SQL Server 2005 and are ready to do the configuration once you install SQL Server 2008. Well, the tool does not exist anymore in SQL Server 2008 and has been added to the deprecated feature list.

So, how can you go about making configuration changes via the GUI? It has been divided up into different tools now:

Right click on the instance and select Facets as shown in the image below:

In the next window, scroll down and select "Surface Area Configuration"

Now you can set the different permissions using SSMS, but we can still follow T-SQL to achieve by executing following command:

USE master
sp_configure 'show advanced options', 1

sp_configure 'Ad Hoc Distributed Queries', 1

The other way to read data from EXCEL is using DTS/SSIS.

Hope this information is useful, let me know your feedback / comments.



Wednesday, September 23, 2009

Singapore SQL UG Meeting on 24th Sep


It's the last thursday of the month and we have scheduled our Singapore SQL UG meeting on 24th Sep 2009 at Microsoft.

Venue: Microsoft Office, 22nd Floor One Marina Boulevard

Time: Session starts promptly at 7PM. Registration starts at 6:30 PM

Session 1:

Topic: Self Service BI

Speaker: Sudhir Chawla

On his session, Sudhir will discuss one of the features that we can expect for the next version of SQL – SQL Server 2008 R2.

Known as the “Project Gemini”, SQL Server 2008 R2 enables Excel users real self-service business intelligence. Users can import directly from Excel data from different sources, relate to each other and share and evaluate interactive pivot tables and charts. How to manage relationship, use DAX and perform analysis using Pivot Table & chart.

Session 2:

Topic: Data warehouse “how to” design session

Speaker: Prem Pandey

In our effort to empower our user group members for Business Intelligence, we have come up with 101 sessions that will jumpstart your knowledge for Business Intelligence. For this session we will focus on designing your data warehouse. Part 1 will purely focus on “dimension design” and Part 2 which will happen in future user group sessions will focus on fact/measure designs.

Here’s a quick overview of what you will expect on this session:

  • Understanding start and snowflake schema

  • Data Source Views

  • Creating AS dimension for

  • Star Schema

  • Snowflake Schema

  • Basic Elements of AS Dimension Design

  • Registration:

    Send your registration at:

    Wednesday, September 16, 2009

    Basic SQL Questions


    Today we will discuss some basic SQL questions.

    Q. What is normalization ?
    Ans: Normalization is the process of reducing the redundancy in data in the RDBMS.

    Q. How to create a table structure and not data, from another table in the same database, in one query ?
    Ans: Select * into table2 from table1 where 1 = 2Here we can give any where clause that doesnot return any rows.

    Q. What is a database transaction ?
    Ans: A database transaction is the smallest unit of work in a database.

    Q. What is are properties of database transaction ?
    Ans: A database transaction always comply with ACID properties -- Atomicity, Consistency, Isolation and Durability.

    Q. What are the different Isolation levels ?
    Ans: Read uncommitted, Read committed, Repeatable Read and Serializable.

    Q. What is differnce between DELETE and TRUNCATE ?
    Ans: DELETE is a logged operation. So, each row deleted is logged in the transaction log.TRUNCATE is not a logged operatio. So, each row is not entered in transaction log.The logging makes the DELETE slower than TRUNCATE.

    Q. Compare UDFs and Stored procedure ?
    Ans: UDFs can be used in select, where, join or case statements while SPs cannot be. UDFs cannot be used in DML statements to modify, while we can use them in stored procedures.

    Q. When do you use cursor ?
    Ans: If you have to process one row at a time and loop through a set of rows, you can use cursors.

    Q. What are the special tables that can be used only within trigger ?
    Ans: "Inserted" to check the rows inserted through the trigger and "Deleted" to check the rows deleted through the trigger.

    Q. What is the difference between clustered and non-clustered index ?
    Ans: Clustered index is a special kind of index in which the node of the B tree has the actual value. Clustered index physically sort data whereas Non-Clustered index sort data logically. In Clustered indexe leaf pages are equal to data pages whereas in Non-Clustered index leaf pages point to data pages.

    Q. What is new in date time in sql server 2008 ?
    Ans: In the previous versions of SQL server, date and time cannot be stored seperately. In 2008, there are 2 new seperate datatypes -- date and time.

    Q. What are the new spatial datatypes in sql server 2008 ?
    Ans: Geography and Geometry. Flat data -- such as planar is supported by the geometry data type. Round earth data, such as latitude and longitude, is supported by the geography data type.

    Q. What is new in date time in sql server 2008 ?
    Ans: In the previous versions of SQL server, date and time cannot be stored seperately. In 2008, there are 2 new seperate datatypes -- date and time.

    Q. What is the new hierarchyid data type in sql server 2008 ?
    Ans: Hierarchyid is new datatype introduced to easily store hierarchy data such as organisational structure.

    Q. What is sparse column in sql server 2008 ?
    Ans: A sparse column is a special type of column that has been optimized for columns that has lot of null values. It is recommended to declare a column sparse, if the column has or is expected to have more than 20% null.

    Q. What is new in Filtered Indexes and Statistics in sql server 2008 ?
    Ans: In 2008, we indicate filtered indexes and statistics on specific subset of rows, that are well defined, like the ones that has null.

    Q. What is new in sql server 2008 TDE ?
    Ans: There is a new option for the who database to be automatically encrypted using Transparent data encryption (TDE).

    Q. What is new in backup in sql server 2008 ?
    Ans: Backup Compression in sql server 2008 supports compressing the backups and storing at a less disk space.

    Q. What is news in Audting in sql server 2008 ?
    Ans: customized audits of database events can be created using SQL Server 2008 Audit functionality.

    Q. What is Resource Governor in SQL Server 2008 ?
    Ans: The Resource Governor is a tool given by SQL Server 2008 to control and allocate CPU and memory resources depending on the priority of applications.

    Q. What is External key management in SQl Server 2008 ?
    Ans: If there is a way to store key and data to be stored seperately on the encryption, it is the best way for security. External Key Management enables this and allows keys to be stored by third-party.

    Q. What is Table-Valued Parameters in SQL Server 2008 ?
    Ans: Table-Valued Parameters (TVPs) are the output or input parameters of stored procedures that can accept or return table as parameters rather than just one value at a time as in earlier versions of SQL Server.

    Q. What is Large User Defined Type in SQL Server 2008 ?
    Ans: In SQL Server 2008, large user-defined types (UDTs) can create custom user defined data type of size limit of 2 GB compared to the previous 8 KB.

    Q. What is new in backup compression in SQL Server 2008 ?
    Ans: Backup can be compressed in sql serevr 2008 without actually compressing the database itself. The is data automtically decompressed when data is restored.

    Q. What is hot add CPUs in SQL Server 2008 ?
    Ans: One of the new features in SQL Hot-add CPUs. This feature allows the addition of extra CPUs to the server without having to actually shutdown the server.

    Q. What is special of Data encrytion in SQL Server 2008 ?
    Ans: In SQL Server 2008, data can be encrypted and stored securely by Transparent Data Encryption

    Q. Name one CLR improvement in SQL Server 2008.
    Ans: Common Language Runtime (CLR) has table-valued functions. Now, to run the query more efficiently, an order clause can be used in the create function.

    Q. What is spatial index in SQL Server 2008 ?
    Ans: SQL Server 2008 introduced the support for spatial datatypes as new feature. The index that is created on a spatial column in the table is called spatial index.

    Q. What is the maximum number of columns per select or insert statement in SQL Server 2008 ?
    Ans: 4096

    Q. What is the maximum Database size in SQL Server 2008 ?
    Ans: 524,272 terabytes

    Q. What is the maximum number of databases per instance of SQL Server 2008 ?
    Ans: 32,767

    Q. What is maximum number of non-clustered index per table in SQL Server 2008 ?
    Ans: 999

    Q. What is the maximum number of parameters in stored procedures or user defined functions in SQL Server 2008 ?
    Ans: 2,100

    Q. What is the nesting maximum for stored procedures/triggers/subqueries in SQL Server 2008.
    Ans: 32

    Q. What is the maximum bytes per row in SQL Server 2008 ?
    Ans: 8,060

    Q. What is use of sp_who ?
    Ans: Currently running users and process in SQL Server can be found by running sp_who.

    Q. What is sp_who2 ?
    Ans: It shows little more that sp_who with locking, blocking, what the users logged in are doing and so on.

    Q. What is instead of trigger in SQL Server ?
    Ans: Instead of triggers -- are triggers attached to table in which, the code inside them is executed in place of the original insert/delete/update statement

    Q. What is DBCC DBREINDEX in SQL Server 2008 ?
    Ans: DBCC DBREINDEX is used to rebuilds indexes for a table. Please note that it is deprecated in SQL SERVER 2008 and will be removed in the next version. It will be replaced by REBUILD option in Alter Index.

    Q. What is a Common Table Expression (CTE) ?
    Ans:Common Table Expression can be thought of as temporary result set. But, it has lot of powers including recursive usage.

    WITH MYCTE (ID, NumberOfOrders)
    select * from MYCTE

    Q. What is row_number() in SQL Server ?
    Ans: row_number () returns the sequential number of a row in a table for all the rows. It also has a partition clause that can return sequential number starting fresh with every partition.

    Q. What in newid() in SQL Server ?
    Ans: NewID() creates a unique value. The type of value returned by newid is a uniqueidentifier.

    Q. What are the difference between local and global temporary tables ?
    Ans: Local tables are represented as #temp and global ones are represented as ##temp. Local temporary tables can been seen/accessed by only the user who creates it and are deleted when the user disconnects from the current instance of SQL Server. Global temporary tables can be seen/accessed by any user and are deleted only when all users referencing the table disconnect from the instance of SQL Server.

    Q. What is a Control flow element in SSIS ?
    Ans: Control flow element is one that performs any function or provide structure or control the flow of the elemtents. There must be at least one control flow element in the SSIS package.
    What are report definition and client report definition files ?Ans: Report definition files with .rdl extensions are created when you create a report in report builder. It may contain data sources, queries, report layout, data and other necessary reporting needs.Client report definition files with .rdlc files are created when you create a report using Visual Studio Report Designer to be used with ReportViewer Control.

    Q. What is a Rendered Report ?
    Ans: Report Server process the published report into viewable output format which contains the necessary data and designed layout. The is called the rendered report which can be HTML or XML or any suitable format.

    Q. What are the different types of reports that can be created by SSRS 2008 ?
    Ans: Ad hoc reports, Cached reports, Clickthrough reports, Drilldown reports, Drillthrough reports, Linked reports, Parameterized reports, Snapshot reports and Subreports.
    What is a data region in SSRS 2008 ?Ans: The region where the data is displayed from the dataset is the Data region and columns

    Q. What is a Tablix ?
    Ans: Tablix is a common term for supporting multiple rows and columns. The rows can be static or dynamic. Table, Matrix and list data regions are based on tablix.

    Will post more later.

    Happy SQL Coding.

    Tuesday, September 15, 2009

    Ten Common SQL Programming Mistakes


    Today I will share 10 common SQL programming mistakes.

    Programming in SQL can be both a fun and a challenging task. Those with a background in traditional programming languages such as Java, C, C++, and VB often find it difficult to adapt to the "set-based" mindset. Even seasoned SQL developers and DBAs can get trapped in one of the many pitfalls of the SQL language. Mastering the fundamentals takes time, and even then some mistakes are hard to spot.

    The intention of this article is to highlight some of the more common mistakes that people make when coding SQL. The list could be regarded as the result of the following query:

    SELECT TOP(10) mistake
    FROM CommonSQLProgrammingMistakes

    And the result is :

  • NULLs and the NOT IN predicate

  • Functions on indexed columns in predicates

  • Incorrect subquery column

  • Data type mismatch in predicates

  • Predicate evaluation order

  • Outer joins and placement of predicates

  • Subqueries that return more than one value

  • Use of SELECT *

  • Scalar user-defined functions

  • Overuse of cursors

  • NULLs and the NOT IN predicate
    One of the most common requests is to retrieve data based on some column value not included in a list of values. The following two tables illustrate the scenario. We have tables with colors and products:

    Colors table:


    Products table:
    sku product_description color
    ---- -------------------- ------
    1 Ball Red
    2 Bike Blue
    3 Tent NULL

    Note that these tables do not represent a perfect design, following normalization rules and best practices. Rather, it is a simplified scenario to help illustrate this example better. In reality, the colors table would most likely contain a color code key column that would be referenced in the products table.

    The request is to select a list of colors that have not previously been used on products. In other words, we need to construct a query that returns only those colors for which there is no product with that color. It might seem, at first glance, that the NOT IN predicate provides a very intuitive way to satisfy this request, very close to how the problem would be stated in plain English:

    SELECT C.color
    FROM Colors AS C
    WHERE C.color NOT IN (SELECT P.color
    FROM Products AS P);

    You may have been expecting this query to return two rows (for 'black' and 'green') but, in fact, it returns an empty result set:


    (0 row(s) affected)

    Obviously this is 'incorrect'. What is the problem? It's simply that SQL uses three-valued logic, driven by the existence of NULL, which is not a value but a marker to indicate missing (or UNKNOWN) information. When the NOT operator is applied to the list of values from the subquery, in the IN predicate, it is translated like this:

    "color NOT IN (Red, Blue, NULL)"

    This is equivalent to:

    "NOT(color=Red OR color=Blue OR color=NULL)"

    The expression "color=NULL" evaluates to UNKNOWN and, according to the rules of three-valued logic, NOT UNKNOWN also evaluates to UNKNOWN. As a result, all rows are filtered out and the query returns an empty set.
    This mistake will often surface if requirements change, and a non-nullable column is altered to allow NULLs. It also highlights the need for thorough testing. Even if, in the initial design, a column disallows NULLs, you should make sure your queries continue to work correctly with NULLs.

    One solution is to use the EXISTS predicate in place of IN, since EXISTS uses two-valued predicate logic evaluating to TRUE/FALSE:

    SELECT C.color
    FROM Colors AS C
    FROM Products AS P
    WHERE C.color = P.color);

    This query correctly returns the expected result set:


    Other possible solutions are as follows:

    /* IS NOT NULL in the subquery */
    SELECT C.color
    FROM Colors AS C
    WHERE C.color NOT IN (SELECT P.color
    FROM Products AS P
    WHERE P.color IS NOT NULL);

    /* EXCEPT */
    SELECT color
    FROM Colors
    SELECT color
    FROM Products;

    SELECT C.color
    FROM Colors AS C
    ON C.color = P.color
    WHERE P.color IS NULL;

    While all solutions produce the desired results, using EXCEPT may be the easiest to understand and use. Note that the EXCEPT operator returns distinct values, which works fine in our scenario but may not be correct in another situation.

    Functions on indexed columns in predicates

    We often tend to write code as a direct translation of given request. For example, if we are asked to retrieve all customers whose name starts with the letter L, it feels very natural to write the query like this, using the LEFT function to return the first character of their name:

    SELECT customer_name
    FROM Customers
    WHERE LEFT(customer_name, 1) = 'L';

    Alternatively, if we are asked to calculate the total sales for January 2009, we might write a query like the following, which uses the DATEPART function to extract the relevant month and year from the sale_date column:

    SELECT SUM(sale_amount) AS total_sales
    FROM Sales
    WHERE DATEPART(YEAR, sale_date) = 2009
    AND DATEPART(MONTH, sale_date) = 1;

    While these queries look very intuitive, you will find that the indexes that you (of course!) have on your customer_name and sale_date columns remain unused, and that the execution plan for these queries reveal index scans.

    The problem arises from the fact that the index columns are being passed to a function, which the query engine must then evaluate for every single row in the table. In cases such as these, the WHERE clause predicate is deemed "non-SARGable" and the best that the query optimizer can do is perform a full index or table scan.

    To make sure the indexes get used, we need to avoid the use of functions on the indexed columns. In our two examples, it is a relatively simple task to rewrite the queries to use SARG-able predicates. The first requested can be expressed with this logically equivalent query:

    SELECT customer_name
    FROM Customers
    WHERE customer_name LIKE 'L%';

    The equivalent for the second query is as follows:

    SELECT SUM(sale_amount) AS total_sales
    FROM Sales
    WHERE sale_date >= '20090101'
    AND sale_date < '20090201';

    These two queries are most likely to utilize index seek to retrieve the data quickly and efficiently. It's worth noting that SQL Server is getting "smarter" as it evolves. For example, consider the following query, which uses the CAST function on the indexed sale_date column:

    SELECT SUM(sale_amount) AS total_sales
    FROM Sales
    WHERE CAST(sale_date AS DATE) = '20090101';

    If you run this query on SQL 2005 or earlier, you'll see an index scan. However, on SQL Server 2008 you'll see an index seek, despite the use of the CAST function. The execution plan reveals that the predicate is transformed into something like the following:

    SELECT SUM(sale_amount) AS total_sales
    FROM Sales
    WHERE sale_date >= '20090101'
    AND sale_date < '20090102';

    However, in general, you should use SARGable predicates where possible, rather than rely on the evolving intelligence of the optimizer.

    Incorrect subquery column

    When writing a subquery, it is very easy to abstract yourself from the main query logic and concentrate on the subquery itself. This can lead to the innocent mistake of substituting a column from the subquery source table for a column with similar name from the main query.
    Let's look at two very simple tables; one is a Sales table containing sales data, and the other is an auxiliary Calendar table that has all calendar dates and holidays (abbreviated here):

    Sales table:

    sale_date sale_amount
    ---------- -----------
    2009-01-01 120.50
    2009-01-02 115.00
    2009-01-03 140.80
    2009-01-04 100.50

    Calendar table:

    calendar_date holiday_name
    ------------- ----------------
    2009-01-01 New Year's Day
    2009-01-02 NULL
    2009-01-03 NULL
    2009-01-04 NULL
    2009-01-05 NULL

    The task is to retrieve sales data for holiday dates only. It seems like a trivial query to write:

    SELECT sale_date, sale_amount
    FROM Sales AS S
    WHERE sale_date IN (SELECT sale_date
    FROM Calendar AS C
    WHERE holiday_name IS NOT NULL);

    However, you'll find that query simply returns all rows from the Sales table! A closer look at the query reveals that the culprit to be the SELECT list of the subquery. It accidentally references the sales_date column from the Sales table, instead of the calendar_date column from the Calendar table.

    If that is the case, why did we not get an error? Although the outcome was not what we expected, this is still a valid SQL statement. When using a subquery, the outer query's columns are exposed to the inner query. Here, we unintentionally converted the self-contained subquery, to be executed once and the value passed to the outer query, to a correlated subquery, logically executed once for every row returned by the outer query.

    As a result, the subquery evaluates to sale_date IN (sale_date) which is always true, as long as there is at least one holiday date in the Calendar table, and so our result set returns all rows from the Sales table. Of course, the fix is easy in this case; we simply use the correct date column from the Calendar table:

    SELECT sale_date, sale_amount
    FROM Sales AS S
    WHERE sale_date IN (SELECT C.calendar_date
    FROM Calendar AS C
    WHERE C.holiday_name IS NOT NULL);

    This illustrates another important point: it is a best practice to prefix columns in subqueries with table aliases. For example, if we had used an alias like this:

    SELECT sale_date, sale_amount
    FROM Sales AS S
    WHERE sale_date IN (SELECT C.sale_date
    FROM Calendar AS C
    WHERE holiday_name IS NOT NULL);

    Then this query would have resulted in an error – "Error: Invalid column name 'sale_date'".

    Data type mismatch in predicates

    This is another typical mistake that is sometimes hard to catch. It is very easy to mismatch data types in predicates. It could be in a stored procedure where the parameter is passed as one data type and then used in a query to filter data on a column of different data type. Another example is joining tables on columns with different data types, or simply using a predicate where data types are mismatched.

    For example, we may have a Customers table where the last_name column is of type VARCHAR:

    CREATE TABLE Customers (
    customer_nbr INT NOT NULL PRIMARY KEY,
    first_name VARCHAR(35) NOT NULL,
    last_name VARCHAR(35) NOT NULL);

    Then the following stored procedure is used to retrieve the customer information by customer last name:

    CREATE PROCEDURE GetCustomerByLastName
    @last_name NVARCHAR(35)
    SELECT first_name, last_name
    FROM Customers
    WHERE last_name = @last_name;

    Notice here the parameter @last_name is of data type NVARCHAR. Although the code "works", SQL Server will have to perform implicit conversion of the last name column to NVARCHAR, because NVARCHAR is of higher data precedence. This can result in a performance penalty. The implicit conversion is visible in the query plan as CONVERT_IMPLICIT. Based on collation, and other factors, a data type mismatch may also preclude the use of an index seek. Use of the correct data type resolves the problem:

    CREATE PROCEDURE GetCustomerByLastName
    @last_name VARCHAR(35)
    SELECT first_name, last_name
    FROM Customers
    WHERE last_name = @last_name;

    In many cases, this mistake is the result of splitting responsibilities on the team, and having one team member design the tables and another implement stored procedures or code. Another reason could be using different data sources to join data where the join columns have different data types in the source systems. The same advice applies not only to character data type mismatches, but also to mismatches between numeric data types (like INT and FLOAT), or the mixing of numeric and character data types.

    Predicate evaluation order

    If you are familiar with the logical query processing order, then you may expect that a query is executed in the following order:

    1. FROM
    2. WHERE
    3. GROUP BY
    4. HAVING
    5. SELECT

    The sequence above outlines the logical order for executing query. Logically the FROM clause is processed first defining the source data set, next the WHERE predicates are applied, followed by GROUP BY, and so on.

    However, physically, the query is processed differently and the query optimizer is free to move expressions in the query plan in order to produce the most cost efficient plan for retrieving the data. This leads to a common misunderstanding that a filter in the WHERE clause is applied before the next phases are processed. In fact, a predicate can be applied much later in the physical execution plan. Also, there is no left to right order for execution of predicates. For example, if you have a WHERE clause containing "WHERE x=1 AND y=2", there is no guarantee that "x=1" will be evaluated first. They can be executed in any order.

    For example, consider the following Accounts table where, in the account_reference column, Business accounts are denoted by a numeric reference and Personal accounts by a character reference:

    account_nbr account_type account_reference
    ----------- --------------- -----------------
    1 Personal abc
    2 Business Basic 101
    3 Personal def
    4 Business Plus 5

    In general, this table indicates bad design. The account_reference column should be represented as two different attributes, specific to business and personal accounts and each with the correct data type (not even belonging to the same table). However, in practice, we very often have to deal with systems designed with shortcomings, where altering the design is not an option.

    Given the above scenario, a valid request is to retrieve all business type accounts with an account reference that is greater than 20 (assuming account reference has some meaningful numeric value for business type accounts). The query may look like this:

    SELECT account_nbr, account_reference AS account_ref_nbrFROM AccountsWHERE account_type LIKE 'Business%' AND CAST(account_reference AS INT) > 20;

    However, the query results in error:

    "Conversion failed when converting the varchar value 'abc' to data type int"

    The query fails because, as noted earlier, there is no prescribed order for executing predicates and nothing guarantees that the predicate "account_type LIKE ‘Business%’" will be evaluated before the predicate "CAST(account_reference AS INT) > 20". In our case, the second predicate is evaluated first resulting in a conversion error, due to the incompatible values in the account_reference column, for personal accounts.

    One attempt to resolve this issue might be to use a derived table (or common table expression) to filter the business type accounts first, and then apply the predicate for account_reference column:

    SELECT account_nbr, account_ref_nbr
    FROM (SELECT account_nbr,
    CAST(account_reference AS INT) AS account_ref_nbr
    FROM Accounts
    WHERE account_type LIKE 'Business%') AS A
    WHERE account_ref_nbr > 20;

    However, this results in the exact same error because derived tables and CTEs are expanded in the query plan and a single query plan is produced, where predicates can again be pushed up or down in the plan.

    As indicated earlier, the problem here is a mix of bad design and misunderstanding of how SQL Server performs physical query execution. What is the solution? The best solution is to design the table correctly and avoid storing mixed data in a single column. In this case, a work around is to use a CASE expression to guarantee that only valid numeric values will be converted to INT data type:

    SELECT account_nbr, account_reference AS account_ref_nbr
    FROM Accounts
    WHERE account_type LIKE 'Business%'
    AND CASE WHEN account_reference NOT LIKE '%[^0-9]%'
    THEN CAST(account_reference AS INT)
    END > 20;

    The CASE expression uses a LIKE pattern to check for valid numeric values (a double negation logic is used which can be translated as "there is not a single character that is not a digit"), and only for those values performs the CAST. For the rest of the values the CASE expression results in NULL, which is filtered out because NULL is not matched with any value (even with NULL).

    Outer joins and placement of predicates

    Outer joins are such a great tool but are also much misunderstood and abused. Some people seem to like them so much that they throw one into almost every query, regardless of whether or not it is needed!

    The key to correct use of outer joins is an understanding of the logical steps required to process an outer join in a query. Here are the relevant steps from the query processing phases:

  • A cross join (Cartesian product) is formed for the two input tables in the FROM clause. The result of the Cartesian product is every possible combination of a row from the first table and a row from the second table.

  • The ON clause predicates are applied filtering only rows satisfying the predicate logic.

  • Any Outer rows filtered out by the predicates in step 2 are added back. Rows from the preserved table are added with their actual attribute values (column values), and the attributes (columns) from the non preserved table are set to NULL.

  • The WHERE clause predicates are applied.

  • An outer join query can produce completely different results depending on how you write it, and where predicates are placed in that query. Let's look at one example, based on the following two tables, Customers and Orders:

    Customers table:

    customer_nbr customer_name
    ------------ --------------
    1 Jim Brown
    2 Jeff Gordon
    3 Peter Green
    4 Julie Peters

    Orders table:

    order_nbr order_date customer_nbr order_amt
    ----------- ---------- ------------ ----------
    1 2008-10-01 1 15.50
    2 2008-12-15 2 25.00
    3 2009-01-02 1 18.00
    4 2009-02-20 3 10.25
    5 2009-03-05 1 30.00

    Our task is to retrieve a list of all customers, and the total amount they have spent on orders, since the beginning of year 2009. Instinctively, one may write the following query:

    SELECT C.customer_name, SUM(COALESCE(O.order_amt, 0)) AS total_2009
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr
    WHERE O.order_date >= '20090101'
    GROUP BY C.customer_name;

    But the results do not look good:

    customer_name total_2009
    -------------- ------------
    Jim Brown 48.00
    Peter Green 10.25

    Customers Jeff and Julie are missing from the result set. Where is the problem? In order to understand what went wrong, let’s play back this query one step at a time following the logical processing order. The first step is a cross join between the two input tables:

    SELECT C.customer_name, O.order_amt
    FROM Customers AS C
    CROSS JOIN Orders AS O;

    This results in every possible combination of rows from both input tables:

    customer_name order_amt order_date
    ---------------- ---------- ----------
    Jim Brown 15.50 2008-10-01
    Jim Brown 25.00 2008-12-15
    Jim Brown 18.00 2009-01-02
    Jim Brown 10.25 2009-02-20
    Jim Brown 30.00 2009-03-05
    Jeff Gordon 15.50 2008-10-01
    Jeff Gordon 25.00 2008-12-15
    Jeff Gordon 18.00 2009-01-02
    Jeff Gordon 10.25 2009-02-20
    Jeff Gordon 30.00 2009-03-05
    Peter Green 15.50 2008-10-01
    Peter Green 25.00 2008-12-15
    Peter Green 18.00 2009-01-02
    Peter Green 10.25 2009-02-20
    Peter Green 30.00 2009-03-05
    Julie Peters 15.50 2008-10-01
    Julie Peters 25.00 2008-12-15
    Julie Peters 18.00 2009-01-02
    Julie Peters 10.25 2009-02-20
    Julie Peters 30.00 2009-03-05

    The next step is applying the ON predicates of the JOIN clause:

    SELECT C.customer_name, O.order_amt, O.order_date
    FROM Customers AS C
    INNER JOIN Orders AS O
    ON C.customer_nbr = O.customer_nbr;

    The result of this query includes only customers with orders. Since customer Julie does not have any orders it is excluded from the result set:
    customer_name order_amt order_date
    -------------- ---------- ----------
    Jim Brown 15.50 2008-10-01
    Jeff Gordon 25.00 2008-12-15
    Jim Brown 18.00 2009-01-02
    Peter Green 10.25 2009-02-20
    Jim Brown 30.00 2009-03-05

    The third step of the logical processing order is adding back the outer rows. These rows were excluded in the prior step because they did not satisfy the join predicates.

    SELECT C.customer_name, O.order_amt, O.order_date
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr;

    Now customer Julie is added back in the result set. Notice the added outer rows from the preserved table (Customers) have values for the selected attributes (customer_name) and the non-preserved table (Orders) rows have NULL for their attributes (order_amt and order_date):

    customer_name order_amt order_date
    -------------- ---------- ----------
    Jim Brown 15.50 2008-10-01
    Jim Brown 18.00 2009-01-02
    Jim Brown 30.00 2009-03-05
    Jeff Gordon 25.00 2008-12-15
    Peter Green 10.25 2009-02-20
    Julie Peters NULL NULL

    The last step is applying the WHERE clause predicates:

    SELECT C.customer_name, O.order_amt, O.order_date
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr
    WHERE O.order_date >= '20090101';

    Now the picture is clear! The culprit is the WHERE clause predicate. Customer Jeff is filtered out from the result set because he does not have orders past January 1, 2009, and customer Julie is filtered out because she has no orders at all (since the outer row added for Julie has NULL for the order_date column). In effect, in this case, the predicate in the WHERE clause turns the outer join into an inner join.

    To correct our initial query, it is sufficient to move the WHERE predicate into the join condition.

    SELECT C.customer_name, SUM(COALESCE(O.order_amt, 0)) AS total_2009
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr
    AND O.order_date >= '20090101'
    GROUP BY C.customer_name;

    Now, the query returns correct results because Jeff and Julie are filtered out in the join predicates, but then added back when the outer rows are added.

    customer_name total_2009
    -------------- ------------
    Jeff Gordon 0.00
    Jim Brown 48.00
    Julie Peters 0.00
    Peter Green 10.25

    In a more complex example, with multiple joins, the incorrect filtering may happen on a subsequent table operator (like join to another table) instead in the WHERE clause. For example, say we have an OrderDetails table containing product SKU and quantity, and the request is to retrieve a list of all customers, with order amount and quantity, for selected product SKUs. The following query may seem correct:

    SELECT C.customer_name, O.order_amt, D.qty
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr
    INNER JOIN OrderDetails AS D
    ON D.order_nbr = O.order_nbr
    AND D.sku = 101;

    However, here the INNER join with the OrderDetails table plays the exact same role as the predicate in the WHERE clause in our previous example, in effect turning the LEFT OUTER join to INNER join. The correct query to satisfy this request needs to use a LEFT OUTER join to join to the OrderDetails table:

    SELECT C.customer_name, O.order_amt, D.qty
    FROM Customers AS C
    ON C.customer_nbr = O.customer_nbr
    LEFT JOIN OrderDetails AS D
    ON D.order_nbr = O.order_nbr
    AND D.sku = 101;

    Subqueries that return more than one value

    A very frequent request is to retrieve a value based on some correlation with the main query table. For example, consider the following two tables, storing details of products and the plants that manufacture these products:

    Products table:

    sku product_description
    ----- ------------------
    1 Bike
    2 Ball
    3 Phone

    ProductPlants table:
    sku plant_nbr
    ----- -----------
    1 1
    2 1
    3 2

    The request is to extract the manufacturing plant for each product. One way to satisfy the request is to write the following query using correlated subquery to retrieve the plant:

    SELECT sku, product_description,
    (SELECT plant_nbr
    FROM ProductPlants AS B
    WHERE B.sku = A.sku) AS plant_nbr
    FROM Products AS A;

    Note that the point here is to illustrate a technique; there could be a more efficient way to accomplish the same task. However, all works fine and we get the correct result set:

    sku product_description plant_nbr
    ---- ------------------- -----------
    1 Bike 1
    2 Ball 1
    3 Phone 2

    The query will continue to work happily until the day arrives that the company decides to start manufacturing Balls at plant 3, to cope with increasing demand. The ProductPlants table now looks like this:

    sku plant_nbr
    ----- -----------
    1 1
    2 1
    2 3
    3 2

    All of a sudden, our query starts generating the following error:

    Msg 512, Level 16, State 1, Line 1Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

    The error is descriptive enough. Instead of the expected scalar value, our subquery returns a result set, which breaks the query. Based on our business requirements, the fix is simple. To list all plants manufacturing plant for a particular product, we simply use a JOIN:

    SELECT A.sku, A.product_description, B.plant_nbr
    FROM Products AS A
    JOIN ProductPlants AS B
    ON A.sku = B.sku;

    Now the query completes without errors and returns the correct results:

    sku product_description plant_nbr
    ---- -------------------- -----------
    1 Bike 1
    2 Ball 1
    2 Ball 3
    3 Phone 2

    Note that the same error can occur in a predicate where a column or expression is tested against a subquery, for example "… column = (SELECT value FROM Table)". In that case, the solution is to use the IN predicate in place of "=".

    Use of SELECT *

    On first encounter with SQL we always praise the genius who invented the syntax SELECT *! It's so handy and easy to use! Instead of explicitly listing all column names in our query, we just use the magic wildchar '*' and retrieve all columns. For example, a common misuse of SELECT * is to extract a set of all plastic products and to insert them into another table with the same structure:

    INSERT INTO PlasticProducts
    SELECT *
    FROM Products
    WHERE material_type = 'plastic';

    Job done! However, one day business requirements change and two new columns are added to the Products table:

    ALTER TABLE Products
    ADD effective_start_date DATETIME,
    effective_end_date DATETIME;

    Msg 213, Level 16, State 1, Line 1
    Insert Error: Column name or number of supplied values does not match table definition.

    The fix is to explicitly list the column names in the query:

    INSERT INTO PlasticProducts (sku, product_description, material_type)
    SELECT sku, product_description, material_type
    FROM Products
    WHERE material_type = 'plastic';

    The situation can get even worse if a view is created using SELECT *, and later the base tables are modified to add or drop columns.

    Note: If a view is create using the SCHEMABINDING option, then the base tables cannot be modified in a way that will affect the view definition.

    To conclude, do not use SELECT * in production code! One exception here is when using the EXISTS predicate. The select list in the subquery for the EXISTS predicate is ignored since only the existence of rows is important.

    Scalar user-defined functions

    Reuse of code is one of the fundamental principles we learn when programming in any language, and the SQL language is no exception. It provides many means by which to logically group code and reuse it.

    One such means in SQL Server is the scalar user-defined function. It seems so convenient to hide away all those complex calculations in a function, and then simply invoke it in our queries. However, the hidden "sting in the tail" is that it can bring a heavy toll in terms of performance. When used in a query, scalar functions are evaluated for each row and, with large tables, this can result in very slow running queries. This is especially true when the scalar function needs to access another table to retrieve data.

    Here is one example. Given tables with products and sales for products, the request is to retrieve total sales per product. Since the total sales value can be reused in another place, you decide to use a scalar function to calculate the total sales for a product:

    CREATE FUNCTION dbo.GetTotalSales(@sku INT)
    RETURN(SELECT SUM(sale_amount)
    FROM Sales
    WHERE sku = @sku);

    Then the query to retrieve the total sales for each product will look like this;

    SELECT sku, product_description, dbo.GetTotalSales(sku) AS total_sales
    FROM Products;

    Isn't this a very neat and good looking query? But just wait until you run it over a large data set. The total sales calculation will be repeated for each and every row, and the overhead will be proportional to the number of rows. The correct way to handle this is, if possible, is to rewrite the function as a table-valued function, or simply perform the calculation in the main query. In our example, performing the calculation in the query will look like this:

    CREATE FUNCTION dbo.GetTotalSales(@sku INT)
    RETURN(SELECT SUM(sale_amount) AS total_sales
    FROM Sales
    WHERE sku = @sku);

    Now the table-valued function can be invoked in the query using the APPLY operator:

    SELECT sku, product_description, total_sales
    FROM Products AS P
    CROSS APPLY dbo.GetTotalSales(P.sku) AS S;

    Overuse of cursors

    Let's face it – we love loops! Whether we start programming with VB, C, C++, Java, or C#, one of the first constructs we encounter is some form of a loop. They can helpfully solve pretty much any challenge you might face.

    And so, it is only natural on the day we start programming with SQL to seek out our favorite loop construct. And here it is – the mighty cursor (and its little WHILE brother)! Then we hurry to put the well known tool to use in solving our problems.

    Let's look at one example. Given a table with product prices, we have to perform a monthly update of prices for products; the price updates are stored in another table with new prices.

    ProductPrices table:

    sku price effective_start_date effective_end_date
    ---- ------ -------------------- ------------------
    1 10.50 2009-01-01 NULL
    2 11.50 2009-01-01 NULL
    3 19.00 2009-01-01 NULL
    4 11.25 2009-01-01 NULL

    NewPrices table:

    sku price
    ---- ------
    2 11.25
    4 12.00

    A cursor solution may look like this:

    DECLARE @sku INT;
    DECLARE @price DECIMAL(15, 2);

    DECLARE PriceUpdates
    FOR SELECT sku, price
    FROM NewPrices;

    OPEN PriceUpdates;

    FETCH NEXT FROM PriceUpdates
    INTO @sku, @price;


    UPDATE ProductPrices
    SET price = @price,
    effective_start_date = CURRENT_TIMESTAMP
    WHERE sku = @sku;

    FETCH NEXT FROM PriceUpdates
    INTO @sku, @price;


    CLOSE PriceUpdates;
    DEALLOCATE PriceUpdates;

    Mission accomplished! Now we can take a well-deserved break while the query is running. Soon, the realization dawns that procedural row by row processing is not working well in SQL. Besides being very slow, our solution is long, hard to read and maintain. This is the moment we understand the power of SQL is its set-based nature. The same task can be accomplished using a very efficient set-based query that is easier to understand and maintain:

    UPDATE ProductPrices
    SET price = (SELECT N.price
    FROM NewPrices AS N
    WHERE N.sku = ProductPrices.sku),
    effective_start_date = CURRENT_TIMESTAMP
    FROM NewPrices AS N
    WHERE N.sku = ProductPrices.sku);

    There are different ways to write a set based query to solve this problem: using the MERGE statement, update with Common Table Expression, or the SQL Server specific update with join. But the point is to utilize the natural power of the SQL language and use set based techniques to solve problems and to avoid procedural solutions.

    Note: While you should avoid cursors as much as possible, there are certain problems, such as running total aggregations, that today are still best solved using cursors. We can be optimistic that future enhancements will provide better tools to solve those problems in a set based way.

    Maybe now is a good time for the next code review of your SQL code. Hopefully some of these examples will help you be a better developer/DBA and produce more reliable and efficient code. It is not always easy to spot these patterns, especially in more complex queries. But as long as you understand the fundamentals, avoiding these mistakes is easy.

    Happy SQL coding!

    Monday, September 14, 2009

    Script that generates data from an existing table


    Following script will help generate data script from an existing table.

    NOTE: This does not work if you have MONEY data type field in your table.

    CREATE Procedure [dbo].[usp_Generate_Inserts] ( @p_tablename varchar(50) )
    This is a utility stored procedure to generate insert statements.
    Declare @strSQLMain varchar(8000)
    Declare @sSQLInsert varchar(500)
    Declare @sSQLFrom varchar(8000)
    Declare @sComma char(1)
    Declare @sOpenParenthesis char(1)
    Declare @sCloseParenthesis char(1)
    Declare @singleQuote varchar(10)
    Declare @concat varchar(10)
    Set @sComma = ','
    Set @sOpenParenthesis = '('
    Set @sCloseParenthesis = ')'
    Set @singleQuote = ''''
    Set @concat = '''+'''
    Set @sSQLFrom = ''
    -- drop if the temp table is not deleted from the previous RUN.
    If Exists ( Select 1 From Information_Schema.Tables where Table_Type = 'BASE TABLE' and Table_Name = 'tmpResultsuspGI')
    Drop table tmpResultsuspGI
    -- Check , if table exists.
    If Exists ( Select 1 From Information_Schema.Tables where Table_Type = 'BASE TABLE' and Table_Name = @p_tablename )
    -- Get the columns.
    declare @name varchar(50),@xtype varchar(50)
    declare curColumns cursor
    for Select,
    from sysColumns s
    inner join sysTypes st On s.xtype = st.xtype
    where id = Object_ID(@p_tablename) and st.status=0
    --based on their data type
    select @sSQLInsert = 'INSERT INTO [dbo].[' + @p_tablename+']' +@sOpenParenthesis
    open curColumns;
    fetch next from curColumns into @name,@xtype
    while @@fetch_status = 0
    /** Query Format
    select cast(countryID as varchar(30) )+ ',''' + CountryCode + '''' + ',''' + countryname + ''''
    from Country
    select @sSQLInsert = @sSQLInsert + @name + @sComma
    if @xtype in ('char','varchar','datetime','smalldatetime','nvarchar','nchar','uniqueidentifier')
    select @sSQLFrom = @sSQLFrom + '''''''''' + '+ IsNull(cast(' + @name + ' as varchar(500)),''NULL'') +' + '''''''''' + '+' + ''',''' + '+'
    select @sSQLFrom = @sSQLFrom + 'cast(IsNull(cast(' + @name + ' as varchar(500)),''NULL'') as varchar(500)) ' + '+' + ''',''' + '+'
    fetch next from curColumns into @name,@xtype
    close curColumns;
    deallocate curColumns;
    select @sSQLInsert = substring(@sSQLInsert,1,Len(@sSQLInsert) -1 )
    select @sSQLInsert = @sSQLInsert + @sCloseParenthesis
    select @sSQLFrom = substring(@sSQLFrom,1,Len(@sSQLFrom) -5 )
    select @sSQLFrom = @sSQLFrom + ' as DText'
    Print 'Table does not exists.'
    Set @strSQLMain = 'Select ' + @sSQLFrom + ' into [dbo].tmpResultsuspGI From [' + @p_tablename + ']'
    --print @strSQLMain
    exec (@strSQLMain)
    --Check if there is an identity column or not
    If ObjectProperty(Object_ID(@p_tablename),'TableHasIdentity') = 1
    Select 'Set IDENTITY_INSERT [' + @p_tablename + '] ON '
    Select @sSQLInsert + ' VALUES' + @sOpenParenthesis + Replace(DText,'''NULL''','NULL') + @sCloseParenthesis As [--Statements]
    From [dbo].tmpResultsuspGI
    If ObjectProperty(Object_ID(@p_tablename),'TableHasIdentity') = 1
    Select 'Set IDENTITY_INSERT ' + @p_tablename + ' OFF '
    Drop table [dbo].tmpResultsuspGI
    Now execute it like this:
    use northwind
    Exec usp_Generate_Inserts 'Order Details'

    Exec usp_Generate_Inserts 'Customers'

    Do let me know if you have any comments.

    Friday, September 4, 2009

    Questions on SSIS (Source =

    Question 1 - True or False - Using a checkpoint file in SSIS is just like issuing the CHECKPOINT command against the relational engine. It commits all of the data to the database.

    False. SSIS provides a Checkpoint capability which allows a package to restart at the point of failure.

    Additional information: Checkpoints in SQL Server Integration Services (SSIS) Packages to restart from the point of failure

    Question 2 - Can you explain what the Import\Export tool does and the basic steps in the wizard?

    The Import\Export tool is accessible via BIDS or executing the dtswizard command.
    The tool identifies a data source and a destination to move data either within 1 database, between instances or even from a database to a file (or vice versa).

    Additional information:
    SQL Server 2000 vs SQL Server 2005 Import and Export Wizard

    Question 3 - What are the command line tools to execute SQL Server Integration Services packages?

    DTSEXECUI - When this command line tool is run a user interface is loaded in order to configure each of the applicable parameters to execute an SSIS package.

    DTEXEC - This is a pure command line tool where all of the needed switches must be passed into the command for successful execution of the SSIS package.

    Additional information:
    Execute a SQL Server 2000 DTS Package vs a SQL Server 2005 Integration Services Package

    Question 4 - Can you explain the SQL Server Integration Services functionality in Management Studio?

    You have the ability to do the following:

  • Login to the SQL Server Integration Services instance

  • View the SSIS log

  • View the packages that are currently running on that instance

  • Browse the packages stored in MSDB or the file system

  • Import or export packages

  • Delete packages

  • Run packages

  • Question 5 - Can you name some of the core SSIS components in the Business Intelligence Development Studio you work with on a regular basis when building an SSIS package?

  • Connection Managers

  • Control Flow

  • Data Flow

  • Event Handlers

  • Variables window

  • Toolbox window

  • Output window

  • Logging

  • Package Configurations

  • Question 6 - True or False: SSIS has a default means to log all records updated, deleted or inserted on a per table basis.

    False, but a custom solution can be built to meet these needs.

    Additional information: Custom Logging in SQL Server Integration Services Packages (SSIS)

    Question 7 - What is a breakpoint in SSIS? How is it setup? How do you disable it?

    A breakpoint is a stopping point in the code. The breakpoint can give the Developer\DBA an opportunity to review the status of the data, variables and the overall status of the SSIS package.

    10 unique conditions exist for each breakpoint.

    Breakpoints are setup in BIDS. In BIDS, navigate to the control flow interface. Right click on the object where you want to set the breakpoint and select the 'Edit Breakpoints...' option.

    Additional information:
    Breakpoints in SQL Server 2005 Integration Services

    Question 8 - Can you name 5 or more of the native SSIS connection managers?

  • OLEDB connection - Used to connect to any data source requiring an OLEDB connection (i.e., SQL Server 2000)

  • Flat file connection - Used to make a connection to a single file in the File System. Required for reading information from a File System flat file

  • ADO.Net connection - Uses the .Net Provider to make a connection to SQL Server 2005 or other connection exposed through managed code (like C#) in a custom task

  • Analysis Services connection - Used to make a connection to an Analysis Services database or project. Required for the Analysis Services DDL Task and Analysis Services Processing Task

  • File connection - Used to reference a file or folder. The options are to either use or create a file or folder

  • Excel

  • FTP

  • HTTP

  • MSMQ

  • SMO

  • SMTP

  • SQLMobile

  • WMI

  • Additional information:
    Connection Managers in SQL Server 2005 Integration Services

    Question 9 - How do you eliminate quotes from being uploaded from a flat file to SQL Server?

    In the SSIS package on the Flat File Connection Manager Editor, enter quotes into the Text qualifier field then preview the data to ensure the quotes are not included.

    Additional information: How to strip out double quotes from an import file in SQL Server Integration Services

    Question 10 - Can you name 5 or more of the main SSIS tool box widgets and their functionality?
  • For Loop Container

  • Foreach Loop Container

  • Sequence Container

  • ActiveX Script Task

  • Analysis Services Execute DDL Task

  • Analysis Services Processing Task

  • Bulk Insert Task

  • Data Flow Task

  • Data Mining Query Task

  • Execute DTS 2000 Package Task

  • Execute Package Task

  • Execute Process Task

  • Execute SQL Task

  • etc.

  • Question 11 - Can you explain one approach to deploy an SSIS package?

    One option is to build a deployment manifest file in BIDS, then copy the directory to the applicable SQL Server then work through the steps of the package installation wizard

    A second option is using the dtutil utility to copy, paste, rename, delete an SSIS Package

    A third option is to login to SQL Server Integration Services via SQL Server Management Studio then navigate to the 'Stored Packages' folder then right click on the one of the children folders or an SSIS package to access the 'Import Packages...' or 'Export Packages...'option.

    A fourth option in BIDS is to navigate to File Save Copy of Package and complete the interface.

    Additional information:
    Deploying a SQL Server 2000 DTS Package vs. a SQL Server 2005 Integration Services Package (SSIS)

    Import, Export, Copy and Delete Integration Services Packages in SQL Server 2005

    Question 12 - Can you explain how to setup a checkpoint file in SSIS?

    The following items need to be configured on the properties tab for SSIS package:

    CheckpointFileName - Specify the full path to the Checkpoint file that the package uses to save the value of package variables and log completed tasks. Rather than using a hard-coded path as shown above, it's a good idea to use an expression that concatenates a path defined in a package variable and the package name.

    CheckpointUsage - Determines if/how checkpoints are used. Choose from these options: Never (default), IfExists, or Always. Never indicates that you are not using Checkpoints. IfExists is the typical setting and implements the restart at the point of failure behavior. If a Checkpoint file is found it is used to restore package variable values and restart at the point of failure. If a Checkpoint file is not found the package starts execution with the first task. The Always choice raises an error if the Checkpoint file does not exist.

    SaveCheckpoints - Choose from these options: True or False (default). You must select True to implement the Checkpoint behavior.

    Additional information: Checkpoints in SQL Server Integration Services (SSIS) Packages to restart from the point of failure

    Question 13 - Can you explain different options for dynamic configurations in SSIS?

  • Use an XML file

  • Use custom variables

  • Use a database per environment with the variables

  • Use a centralized database with all variables

  • Additional information: Using XML Package Configuration with SQL Server Integration Services (SSIS) Packages

    Question 14 - How do you upgrade an SSIS Package?

    Depending on the complexity of the package, one or two techniques are typically used:
  • Recode the package based on the functionality in SQL Server DTS

  • Use the Migrate DTS 2000 Package wizard in BIDS then recode any portion of the package that is not accurate

  • Additional information:
    Upgrade SQL Server DTS Packages to Integration Services Packages

    Question 15 - Can you name five of the Perfmon counters for SSIS and the value they provide?

    SQLServer:SSIS Service
  • SSIS Package Instances - Total number of simultaneous SSIS Packages running

  • SQLServer:SSIS Pipeline

  • BLOB bytes read - Total bytes read from binary large objects during the monitoring period.

  • BLOB bytes written - Total bytes written to binary large objects during the monitoring period.

  • BLOB files in use - Number of binary large objects files used during the data flow task during the monitoring period.

  • Buffer memory - The amount of physical or virtual memory used by the data flow task during the monitoring period.

  • Buffers in use - The number of buffers in use during the data flow task during the monitoring period.

  • Buffers spooled - The number of buffers written to disk during the data flow task during the monitoring period.

  • Flat buffer memory - The total number of blocks of memory in use by the data flow task during the monitoring period.

  • Flat buffers in use - The number of blocks of memory in use by the data flow task at a point in time.

  • Private buffer memory - The total amount of physical or virtual memory used by data transformation tasks in the data flow engine during the monitoring period.

  • Private buffers in use - The number of blocks of memory in use by the transformations in the data flow task at a point in time.

  • Rows read - Total number of input rows in use by the data flow task at a point in time.
    Rows written - Total number of output rows in use by the data flow task at a point in time.

  • Additional information: Perfmon Counters for the Data Flow Engine in SQL Server Integration Services (SSIS)

    Useful Info on Triggers

    Hi All,

    I come across a nice article from SQL-SERVER-PERFORMANCE.Com on triggers by Brad Mcgehee.

    If you need to implement cascading referential integrity (such as cascading deletes) in your SQL databases, use the cascading referential integrity constraint instead of triggers to perform the cascading delete, as they are much more efficient and can boost performance. If you have an older (7.0 and older) application that you have moved to SQL Server 2000/2005, and it used triggers to perform cascading deletes, consider removing the triggers and using cascading referential integrity instead.


    While INSTEAD OF triggers are technically interchangeable with conventional AFTER triggers, the main reason to use an INSTEAD OF triggers is to allow you to update certain types of views. What does this mean in regards to performance? Assuming most of the triggers you write only rollback transactions rarely, then you will want to continue using AFTER triggers. This is because the overhead of an INSTEAD OF trigger is higher than an AFTER trigger, assuming rollbacks are rare. But if rollbacks are common (more than half the time), then an INSTEAD OF trigger would be the better deal because its overhead is less than an AFTER trigger that rolls back. So for most triggers, stick with the conventional AFTER trigger and save INSTEAD OF triggers to update views.


    SQL Server 2000/2005 allows you to somewhat control the order in which triggers are fired. I say "somewhat" because you don't have full control. You have the option to specify which trigger fires first and last, but if you have more than two triggers on a table, you can't control the order in which the other ones fire.So how can selecting the order that a trigger is fired help your application's performance? To optimize trigger performance, you should specify that the trigger most likely to rollback (for whatever reason) as the first trigger to fire. This way, if the trigger does cause a rollback, it only affects the first trigger.

    Let's say you have three triggers on a table, but instead of having the most likely to rollback trigger specified as the first trigger, you have it specified as the last. In this case, assuming it is rolled back, then all three triggers have to be rolled back. But if the trigger were the first trigger, and not the last, then only one trigger, not three triggers, would have to be rolled back. Reducing the number of triggers rolled back reduces SQL Server's overhead and boosts its performance.


    The amount of time that a trigger takes to run is mostly a function of the number of tables referenced in the trigger and the number of rows affected by the code inside the trigger. Because of this, always try to minimize the number of tables referenced in a trigger, and minimize the number of rows being affected.

    In addition, keep the code in your triggers to the very minimum to reduce overhead. This is important because triggers typically fire during INSERTs, UPDATEs, and DELETEs, all of which can be common occurrences in OLTP applications. The more code that runs in the trigger, the slower each INSERT, UPDATE, and DELETE that fires will be.


    If your trigger includes a WHERE clause, don't forget to include any appropriate indexes for them to use. WERE clauses hidden in triggers are often easy to forget, and like any statements, indexes can often significantly affect their performance.

    One way to do this is to run your trigger code from Query Analyzer or Management Studio and then check the resulting execution plan. Doing so will quickly tell you if you need to add an appropriate index.


    If you have an INSERT, UPDATE, or DELETE statement that seems to be taking longer to run that you would expect it to run, be sure to check to see if there is a trigger associated with that table. The performance problem you are seeing may well be because of the trigger, not the data modification statement itself.Don't forget to tune trigger code just like you would any other code. Because trigger code is "hidden," many people forget about it and don't realize the potential performance problems that they can cause.

    You can use Profiler and Query Analyzer/Management Studio to find out how triggers in your database are working.


    Don't use a trigger to enforce referential integrity if you have the option to use SQL Server's built-in referential integrity instead. Using SQL Server's built-in referential integrity is much faster than using a trigger to perform the same task.


    If you have a choice between using a trigger or a CHECK constraint to enforce rules or defaults within your SQL Server databases, you will generally want to choose a CHECK constraint as they are faster than using triggers when performing the same task.


    Try to avoid rolling back triggers because of the overhead involved. Instead of letting the trigger find a problem and rolling back a transaction, instead catch the error before it can get to the trigger (if possible based on your code). Catching an error early (before the trigger fires) consumes much fewer server resources than letting the trigger roll back.

    Catching errors before a transaction can fire a trigger can be done through code in the initiating transaction, and it sometimes can be accomplished by adding a constraint to the table to catch common mistakes. If a constraint catches an error, the trigger will not fire.


    Sometimes, for performance reasons, it is necessary to maintain denormalized data. For example, you might need to maintain derived data (such as cumulative data), in a table because it is too time consuming to calculate it on the fly from within SELECT statements. One way to easily maintain denormalized data is to use triggers. For example, every time a new sale is added to a Sales table, a trigger could fire, adding the value of the sale to a SalesTotal table.


    The code that is included inside an UPDATE trigger runs every time its related table is updated. In most UPDATE triggers, the code in the trigger affects only certain columns, not all of them. Because of this, it is pointless (and a waste of SQL Server resources) to run all of the code in the trigger if the column or columns you are interested in have not been updated. In other words, even if a column you are not interested in is updated, the UPDATE trigger will fire and run its code.To help reduce the unnecessary running of code in an UPDATE trigger, you can take advantage of one of two different functions: UPDATE() (available in SQL Server 2000/2005) or COLUMNS_UPDATED() (available in SQL Server 7.0 and 2000/2005).Either function can be used to test to see if a particular column you are interested in has changed or not. Because of this, you can write code in your trigger to only run if the column you are interested in has changed, otherwise you can prevent the code from running if the column you are interested in has not changed. This can reduce the amount of work the trigger needs to do, boosting overall performance of your database.

    The UPDATE() function is used to check only one column at a time. The COLUMNS_UPDATED() function can be used to check multiple columns at a time.

    xp_cmdshell stored procedure


    The extended stored procedure xp_cmdshell allows you to shell out and execute an valid operating system command. By default, all users allocated to the fixed system role sysadmin (i.e. SA account) have execute access. This is a real tough command to administer. Why? Every project I have worked on to date has some need for it. Because of the fact that SA is the only user with sysadmin access, rather than creating a special account or looking at other work-arounds, the SA account is used to execute the task. Classic examples are stored procedures wrapped up in DTS jobs or supposed global "administrative" functions. This is problematic because:

  • Now "application" databases are using the SA account and rely on it to run their jobs.

  • Altering the SA password has application impact

  • Xp_cmdshell will be executed under the security context in which the SQL Server service is running

  • Points a) and b) are obvious and the fix is a simple matter of explicitly granting execute access to xp_cmdshell via a role and allocating that role to a managed and secure database user whose actual login is rarely used (i.e. DTS connectivity only to run the command shell tasks).Point c) is the very important. If the database user is a member of sysadmin then, more than likely, the user will have "local administrative" privileges to the server, as this is the user running the MS SQL Server and associated services. This is even more the reason why not to use the SA account. If the user is not a sysadmin but has been granted execute access as described above, then the SQL Server Agent proxy service user will be used instead.

    The SQL Server Agent proxy account can be altered via xp_sqlagent_proxy_account (undocumented in BOL), which defines the account used to run the SQLServerAgent service. This may be your SQLServer NT user if you are not using the Administrator account (which is bad security practice), so you may decide to alter this to another user with restricted access rights for finer control to the operating system.

    In the end, xp_cmdshell should be carefully evaluated before using it. Ideally it should be totally disabled (revoke execute permission) for ultimate security. Look at use isql jobs scheduled via NT and look closer at the options available to you via DTS.

    Example: Stored Procedure to delete files in a folder

    When your only way of managing SQL Server is through Enterprise Manager or Query Analyzer, you will have to do a lot of tasks through command shell. If you find yourself routinely conduct some activities using xp_cmdshell, it is a good idea to wrap those routines into a stored procedure. This example takes an input parameter for directory name and delete all files within that directory. The directory name can be UNC path. I put some comments within this procedure so it should be easy to follow.
    CREATE proc usp_DeleteFileInFolder
    @FolderName varchar(150)
    declare @DOSCommand varchar(150)
    --Check whether the user supply \ in the directory name
    if not (right(@FolderName, 1) = '\')
    set @FolderName = @FolderName + '\'
    --Delete all files within this folder. Note: del *.* only deletes files,
    --not folders. Q is the quite switch so DOS will not ask for confirmation
    set @DOSCommand = 'del /Q ' + '"' + @FolderName + '*.*' + '"'
    print @DOSCommand
    exec master..xp_cmdshell @DOSCommand

    Thursday, September 3, 2009

    Using Indexes to Bypass Locks

    One of the issues you'll face with SQL Server is blocking which is caused by other processes that are holding locks on objects. Until the locks are removed on an object the next process will wait before proceeding. This is a common process that runs within SQL Server to ensure data integrity, but depending on how transactions are run this can cause some issues. Are there ways to get around blocking by using different indexes to cover the queries that may be running?


    In order to explain my point I am going to use one table and run queries from two different sessions. The table that we will be using has the following columns:

  • display name - my application use this name in order to display customer related information.

  • current quota - this is a number stating the current quota (of irrelevant item) of a customer, in my scenario this field is monitored by a specific process for specific customers.

  • next month's quota - this is a number stating the planned quota for next month, note this field is used in a long running transactions calculating quota for all customers.

  • support level - a number between 1 and 3 stating the level of support granted to a customer (1=high, 3=low), we have several applications, processes and transaction (business transactions) that deal with different customers according to their support level.

  • Create Table Customers
    IF OBJECT_ID ('customers') IS NOT NULL
    DROP TABLE customers
    CREATE TABLE customers
    ( id INT NOT NULL,
    display_name VARCHAR(10) NOT NULL,
    current_quota bigint NOT NULL,
    next_month_quota bigint NOT NULL,
    support_level smallint NOT NULL,
    some_other_fields_size_1k CHAR(1000) NULL )
    ALTER TABLE customers
    ADD CONSTRAINT customers_pk PRIMARY KEY (id)
    As you can see I have added an additional column called "some_other_fields_size_1k". This column is simulating an additional 1K of customer data; I like adding it to my tests to make the optimizer respond to more authentic requests.

    Let's fill the table with some data, we will do the following:

  • Insert 1000 records

  • Every 150 customers I'll place a customer with support level = 1 and the rest will have to settle for support level = 3.

  • Extend some quota as current and set next month's quota to 0.

  • Table Customers - fill with data
    SET nocount ON
    SET @i = 0
    WHILE @i <1000>
    SET @i = @i + 1 
    INSERT INTO customers ( id, display_name, current_quota, support_level, next_month_quota, some_other_fields_size_1k) 
    VALUES ( @i, 'name-' + CAST (@i AS VARCHAR (10)), 100000 + @i,
    --making customer with id 150, 300, 450, 600, 750 
    -- and 900 with support level 1 
    CASE @i%150 WHEN 0 THEN 1 ELSE 3 END, 0, 'some values ...') END 
    SET nocount OFF 

    As mentioned earlier, there are two processes that will be running:

  • A general dashboard application that checks the status of the current quota of our top customers.

  • A planning module that performs various calculations and changes quotas of customers based on calculations done by the application.

  • Dashboard Application SELECT Statement
    SELECT display_name, current_quota 
    FROM customers 
    WHERE support_level = 1 
    ORDER BY 1
    Planning Module UPDATE Statement

    UPDATE customers 
    SET next_month_quota = "any quota"
    WHERE id= "any id"
    I would like to state this scenario assumes that we are working with the default isolation level (READ COMMITTED) of Microsoft SQL Server.

    READ COMMITTED isolation level allows good concurrency with reasonable integrity (some might argue it is not that good and not that reasonable - and we'll get to that but hey, most of the OLTP applications we know use it).

    The core of the READ COMMITTED isolation level is composed of two building blocks called exclusive locks and shared locks that comply with the following guidelines:

  • An exclusive lock (A.K.A. "X lock") is taken for any resource that requires a write.

  • A shared lock (A.K.A. "S lock") is taken for any resource that requires a read.

  • S lock can be taken on a specific resource unless there is an X lock already taken for it.

  • X lock can be taken on a specific resource as long as there is not already another lock taken (not even an S lock).

  • Now to the challenge: there are cases where the planning module (long running transactions) locks a specific record and does not allow the dashboard application to view specific information for a long time; believe me, it happens - I know - I have built this scenario :).

    We all know that indexes boost database performance by keeping an ordered list (tree based) of keys and a linkage to the actual data location. So when our application has some performance issues the intuitive solution would be to look at the SQL statement that does not perform well and possibly add indexes. Here is our statement that we are trying to run that is getting blocked.

    Dashboard Application SELECT Statement
    SELECT display_name, current_quota
    FROM customers 
    WHERE support_level = 1 
    ORDER BY 1
    We could add an index on support_level, but since adding this index will not solve our blocking issue we will not bother adding this index.
    Let's start from the beginning...

    Phase 1 - Check for bottlenecks

    In short, we first check the database machine's CPU, disk and memory. When we see it is not that, some of us would turn to the event log. Not finding anything unusual in the event log, scratching our head over enough time - the locks issue may pop up (hard to admit but the locks part is somehow always a surprising part :) - I usually turn to check the locks when I see that some SQL is stuck and the machine is saying: "I am going to sleep, please wake me when you need something".

    Phase 2 - Drill down into locks

    In order to view the locks there is of course the famous sp_lock and even sp_lock2, but I like to use my own get_locks procedure (which also gives additional information regarding the locked objects (e.g. owner object and index name etc') this can be created in any SQL Server 2005 server with the following SQL script:

    get_locks stored procedure - creation script

    CREATE PROCEDURE get_locks 
    SELECT OBJECT_NAME(tl.resource_associated_entity_id%2147483647) obj, tl.request_session_id , tl.request_mode, tl.request_status, tl.resource_type, OBJECT_NAME (pa.OBJECT_ID) owner_obj, pa.rows, tl.resource_description, index_name, si.type_desc index_type_desc 
    FROM sys.dm_tran_locks tl 
    LEFT OUTER JOIN sys.partitions pa 
    ON (tl.resource_associated_entity_id = pa.hobt_id) 
    LEFT OUTER JOIN sys.indexes si 
    ON (pa.OBJECT_ID = si.OBJECT_ID AND pa.index_id = si.index_id) 
    WHERE resource_type NOT LIKE 'DATABASE' 
    ORDER BY tl.request_session_id, tl.resource_type 

    Now let's check to see that there are any locks, by running:

    exec get_locks
    We get the following result, which shows there are no current locks.

    Now let's open two sessions (SQL Server Management Studio query windows), in the first (we'll call it Session A) we'll run the following SQL:

    Session A - Dashboard Application SELECT

    SELECT display_name, current_quota 
    FROM customers 
    WHERE support_level = 1 
    ORDER BY 1 
    This statement (when not blocked by others) is fetching the display_name and current_quota of customers with support level = 1:

    In the second session (we'll call it Session B) let's run the following update statement:

    Session B - Planning Module UPDATE


    --added just for session identification in the dm views.

    UPDATE customers 
    SET next_month_quota = 2500 
    WHERE id=150 
    This is returned:

    The above statement opened a transaction, updated a record and did not close the transaction (with either commit or rollback command), so it is currently locking some of the resources.

    Let's review the locks with get_locks

    exec get_locks 
    This time the result would be:

    What we can see here is the process # 56 (Our Session B) has locked:

  • The object 'customers' which is a table with "IX lock".

  • The page '1:334' which is actually page 334 in file #1 with "IX lock".

  • The key '(96009b9e9046)' which is an indicator describing the record with id = 150 in the 'customers_pk' with "X lock". "IX lock" is a way to notify the database not to allow any shared locking that will block my update later on, when an object is locked with "IX lock" it:

  • Assumed that a lower granularity will acquire a specific "X Lock".

  • It allows only IX and IS locks to be acquired on the same resource.

  • Now let's re-run the dashboard application query within Session A:
    Session A - Dashboard Application SELECT
    SELECT display_name, current_quota 
    FROM customers 
    WHERE support_level = 1 
    ORDER BY 1 
    This time we can see that the query is blocked, because it is still executing.

    Re-checking the locks with get_locks We can see the following lock status:
    exec get_locks

    As you can see the previous locks from session B (spid = 56) are still there and we've got some new locks from session A (spid = 55):

  • Same as session B this session also succeeded placing an "I Lock" (this time "IS Lock") on:

  • 'customers' object (table).

  • Page '1:334' (which is actually page 334 in file #1).

  • Unlike Session B, when trying to place an "S Lock" on the key '(96009b9e9046)' which is an indicator describing the record with id = 150 in the 'customers_pk' this session finds out that there is already "X Lock" there and gets into a 'WAIT' state (see request_status column).

  • Phase 3 - Trying an index

    Well we've started the 'Solution' description by telling that adding an index on the support_level column in order to boost performance still won't help, but it will get us closer - let's try:
    -- Before doing that we'll need to:
    -- 1. Stop Session A (which is still running).
    -- 2. Rollback Session B (which is still locking).
    -- 3. Run:
    CREATE INDEX customers_ix1
    ON customers (support_level)
    -- 4. Rerun Session B (for locking)
    -- 5. Re-run Session A (to be blocked)

    OK, now that we have done the above if we run Session B UPDATE, Session A SELECT and get_locks, we will get the following:

    As you can see same locks as before are used plus customers_ix1 index's got a new page "IS Lock" (page 77 in file 1) and a new "S Lock" on key '(970081334a1d)' placed by Session A.

    Some might ask why did Session A succeed with placing a lock on customers_ix1 (line 1 in the above table) let's check the execution plan for the select statement by running:
    -- 1. Stop running Session B
    -- 2. Press Ctrl-T to change results to text:
    -- 3. Run:

    SELECT display_name, current_quota
    FROM customers
    WHERE support_level = 1
    ORDER BY 1
    This will yield:

    As you can see Session A is accessing the data through the customers_ix1, so it tries to place an "S Lock" on it (and of course succeeds).

    Well, this encapsulates two great hints for our solution:

  • When performing a write operation, SQL Server does not lock related indexes (e.g. our Session B did not lock customers_ix1; note it would not have locked it regardless if we used it or not!), only the relevant data row.

  • When performing a read operation, SQL Server locks only the objects (e.g. indexes, data rows etc') that it found and used within its access path.

  • So current status is that we can use indexes as a solution for Session A activity as long as they do not access the actual data row.

    By adding an index that will cover all columns of the table that are required for Session A's specific query (A.K.A. Covering Index), we will have a solid bypass without requiring specific access to the actual data row. Having achieved this, Session A will not be blocked while fetching data from a row that is actually blocked for writing (as done by Session B).

    Adding the index is done by the following:

    -- Before doing that we'll need to: 
    -- 1. Rollback Session B which is still locking. 
    -- 2. Run: 
    CREATE INDEX customers_ix2 
    ON customers (support_level, display_name, current_quota) 
    Now let's re-run the locking update of Session B and the select for Session A.

    Session A
    SELECT display_name, current_quota  
    FROM customers  
    WHERE support_level = 1 ORDER BY 1   
    Trying to run Session A's SELECT statement (while Session B's transaction is still locking) will produce the following results without being blocked:

    Review New Query Plan

    SELECT display_name, current_quota 
    FROM customers 
    WHERE support_level = 1 
    ORDER BY 1  

    the result is:

    As you can see the only database object that is participating in fetching the data is customers_ix2 which we've just created. By using this covering index we were able to use a totally different index than the update statement and therefore have the statement complete without issue.

    Some might argue that we would not need to go through a whole article just for putting a covering index and they might be correct, still there are certain situations when a covering index is not that straight forward, but it can be used to overcome locking challenges such as the above example.