I am going to start this post with a bit of a disclaimer. At heart, I am a .NET developer and not a SQL expert so if there is anything below you disagree with, I welcome the constructive criticism in the comments! Hopefully these SQL query performance tuning tips will be helpful all the same.
These SQL tips and tricks have come from experience of optimising SQL queries over the last year. In several cases these changes have taken queries taking hours down to a few minutes or even seconds.
SQL optimisation is a bit of an art. What will work on one server and data set might not always work on another system, so please take the below with a pinch of salt and do your own analysis.
Loops are a stable construct in any programmers tool box. SQL can do loops as well however they should be used with caution. In .NET we are used to looping over data. Dealing with data a row at a time can also make the logic simpler.
SQL has While loops and Cursors for this process. However, in most cases where I have seen cursors used they can be replaced with set based queries and joins. Loops and Cursors are fine for small one off tasks or batch operations but in my opinion, they have no place in a high performance systems.
If you are doing a large number of updates or inserts I would always recommend using a loop (see batching below).
Functions in the Where clause
Sometimes you need to transform the data before you can specify your filter criteria. This is sometimes done by using a function in the where clause. The most common place I have seen this is around dates. In some cases you aren’t bothered at what time something happened, just that it happened that day. So you will see something like this:
SELECT * FROM dbo.SomeTable WHERE CONVERT(DATE, CreatedDate) = '2017-01-10'
This is a problem. Especially if you are using user defined functions, as SQL can’t interpret what the value is going to be until after it calls the function. Dependent on your other criteria this could cause SQL to do a full table scan. It is therefore much better to use a date range in these cases.
SELECT * FROM dbo.SomeTable WHERE CreatedDate >= '2017-01-10' AND CreatedDate < '2017-01-11'
Doing the above makes it easy for SQL to interpret your query and therefore use the correct index. Note, I have used ‘2017-01-11’ instead of ‘2017-01-10 23:59:59’ as the later will miss off the last second of the day.
Logic in the Where clause
In a similar vain to my point above. Any logic that is done in the where clause needs to be removed. I have often seen this done with CASE statements in the criteria. If SQL has to do some processing before it can understand your criteria then it is almost certainly going to be doing more work than it needs to.
SELECT * FROM dbo.SomeTable WHERE CASE WHEN Number > 1000 THEN 'High' ELSE 'Low' END = Threshold
The above is an overly simplified case but if you find you need to do some transformation of the data to be able to query it then this should be done in a temp table.
CREATE TABLE #Temp ( Id BIGINT, Number BIGINT, Threshold VARCHAR(4) ) INSERT INTO #Temp (Id, Number, Threshold) SELECT Id, Number, CASE WHEN Number > 1000 THEN 'High' ELSE 'Low' END AS Threshold FROM dbo.SomeTable SELECT * FROM #Temp WHERE Threshold = 'High'DROP TABLE #Temp
Again still too simple but I hope you get the point. I would then recommend if the data is large enough to put an index on your temp table. Depending on the query you have to take into account the cost of creating the index on top of how fast it makes your query. After all it is going to be doing this every time your SP runs.
Implicit Data Conversion
This one isn’t going to have as much of an impact compared to the 3 above but every millisecond counts! The most common of these is specifying a varchar instead of nvarchar in your query criteria.
However, I have seen integer columns being queried with varchar and varchars being queried with integers.
Every time you do this SQL has to convert the value on the fly instead of doing a direct comparison. This uses a little bit more CPU and can take a little longer while doing it.
SELECT * FROM dbo.SomeFrom WHERE Counter = ‘5'
where you should be using:
SELECT * FROM dbo.SomeTable WHERE Counter = 5
If you look at the information supplied by an actual query plan then it will tell you about these implicit conversions.
UPPER and LOWER
It is quite common in C# to convert the case of a string to upper and lower when doing comparisons. SQL also has upper and lower functions so it is quite common to see this:
SELECT * FROM dbo.Currency WHERE UPPER(CurrencySymbol) = 'GBP'
Apart from violating my no functions in criteria rule it does seem sensible. However, SQL has a thing called collation. By default the collations for UK is Latin1_General_CI_AS and US is SQL_Latin1_General_CP1_CI_A (source).
The CI in those names stands for Case Insensitive. Which means you can actually do this:
SELECT * FROM dbo.Currency WHERE CurrencySymbol = 'GBP'
or even this:
SELECT * FROM dbo.Currency WHERE CurrencySymbol = 'gBp'
and still get the same result while saving yourself some CPU cycles in the process.
Temp Tables vs Table Variables
If you need to store data temporarily as part of your query you have 2 options.
Temp Tables which are created like this:
CREATE TABLE #TempTableName ( SomeId BIGINT, SomeOtherData varchar(50) )
Or a table variable which is created like this:
DECLARE @TempTableVariableName TABLE ( SomeId BIGINT, SomeOtherData varchar(50) )
Both can be used to store data, however table variables are only really meant to be used for 1 or 2 rows of data. This is because SQL doesn’t provide any statistics for table variables and you can’t add indexes to them to improve performance.
In some cases such as functions, temp tables aren’t an option, although if you need a big table in your function it is probably doing too much. In either case, putting large amounts of data into a temp table or table variable is not good either as this could bloat tempDB.
If you have to deal with large amounts of data it is sometimes better to put it into a staging table to work off. This only really works for batch reports, I am still looking for a better way other than temp tables in real-time reports.
If you need to do transformation as well of summation in a large dataset you might be tempted to use a temp table. This is fine if your data set is of a medium size but what if you have over a million records to process? You don’t really want to be storing all of that information in temp DB while your query runs.
You could insert the transformed data into a table however if you do this in bulk the initial query will again be stored in temp DB before being inserted into your table. The solution is to do your transformation in batches and store the data in a staging table.
DECLARE @Start DATETIME = '2016-01-01' DECLARE @End DATETIME = '2017-01-01' DECLARE @Next DATETIME = DATEADD(DAY,1,@Start) WHILE (@Next <= @End) BEGIN INSERT INTO dbo.SomeStagingTable (Id, Value1, Value2) SELECT Id, CASE WHEN IsFlag = 1 THEN 'Y' ELSE 'N' END, CreatedDate FROM dbo.SomeTable WHERE CreatedDate >= @Start AND CreatedDate < @Next SET @Start = @Next SET @Next = DATEADD(DAY,1,@Start) END
By batching your data by day, the amount of data stored at once in temp DB will be smaller and you will usually find it will be a lot quicker. Just remember to add indexes to your staging table if you have a lot of data.
SELECT only what you need
This is more apparent on tables with a large number of columns but it is common sense to only select what your application needs. If you have a 100 column table and your application does a
SELECT * then that is a lot of data that is being retrieved and sent to your application. If your application then only uses 1 or 2 columns this is a massive waste of resources.
You may find if you only need a few columns that those will be covered by an index saving a lot of expensive key lookups to get data you aren’t even using.
Avoid using correlated sub queries
New SQL developers will sometimes take the easy route and use a sub query to return some related data.
SELECT C.Name, C.Email, (SELECT CountryName FROM dbo.Country WHERE CountryId = C.CountryId) FROM dbo.Customers C
The problem with this query is they tend to run row by row and will therefore have bad performance.
This could be done with a left join which will run much quicker.
SELECT C.Name, C.Email, CN.CountryName FROM dbo.Customers C LEFT JOIN dbo.Country CN ON CN.CountryId = C.CountryId
This bad coding is often done in C# as well and is around retrieving too much information when only a little will do.
IF (SELECT COUNT(*) FROM dbo.Customers WHERE FirstName = @FirstName AND LastName = @LastName) > 0) BEGIN PRINT ‘Customer exists' END
Why do we need to count all rows when we only need to see if one exists.
SQL has EXISTS for this purpose. So how about this:
IF EXISTS (SELECT * FROM dbo.Customers WHERE FirstName = @FirstName AND LastName = @LastName) BEGIN PRINT 'Customer exists' END
We are still grabbing too much information here. SQL might be clever and ignore the fact you have done
SELECT * but I wouldn’t give SQL the option.
I have found the fastest thing to use for existence checks is
SELECT TOP 1 1 . With this you will get a 1 if it exists and a NULL if it doesn’t. I have seen people do just
SELECT 1 before but this will give you a 1 for every record returned. The code can now be changed to this.
IF EXISTS (SELECT TOP 1 1 FROM dbo.Customers WHERE FirstName = @FirstName AND LastName = @LastName) BEGIN PRINT 'Customer exists' END
Analysing Execution Plans
A post about SQL performance wouldn’t be complete without talking about SQL execution plans. To really improve the performance of your queries you need to understand and use execution plans. I have found SentryOne Plan Explorer to be really helpful here in making sense of these plans and highlighting particular pain points. I am not going to go into details here though as execution plans warrant a post all by themselves.
Execution plans are incredibly useful, especially the actual execution plans. These are the things you need to pay particular attention to when reading execution plans.
- Missing indexes – if you query would benefit from an index then SQL will tell you. This can obviously have massive performance gains.
- Actual vs Estimated rows – The estimated rows come from SQL statistics on the tables you are using. If there is a massive difference here then this is likely to have performance implications as SQL probably isn’t using the correct execution plan. It will be a bit of trial and error to see if your query can be written differently to improve this.
- Scan vs Seek – SQL will tell you what operation was done to retrieve the information. Generally you want to avoid Full Table Scans like the plague and in some cases Index Scans as these require reading through the whole table or index until it finds the correct value. Ideally you want Index Seeks especially Clustered Index Seeks as these will be the fastest. If your index doesn’t include the columns that you need, a Key Lookup is done to retrieve the data. These can’t aways be avoided but if required you can add an index that uses these columns to make the key lookup go away.
I will write a post in the future about how to read execution plans as there is so much I haven’t covered here.
It almost goes without saying that using the correct indexes can improve performance by huge amounts. Again this probably warrants a post by itself and I am not knowledgable enough yet to write it!
The execution plans will tell you about missing indexes just make sure they are added where appropriate. You will want to consult your resident DBA before adding in any new indexes though.
Dynamic SQL seems like a blessing. It gives you the ability to create fewer SPs that you can bend to your willing from the application layer. I am sure there are some great uses for dynamic SQL but generally it should be approached with great caution (like when you have forgotten something you wife has asked you to do 🙂 ).
The problem with dynamic SQL is that the execution plans for your queries won’t be cached. It is much better and safer (avoids SQL Injection Attacks) to use parameterised SQL. Why not even do it via Dapper or some other ORM and avoid the stored procedure altogether.
Use SQL or Not Use SQL
SQL is great for doing calculations and large scale data manipulation. As your application grows it is very easy to add that bit of logic at the stored procedure level. Before you know it all you business logic is stored in the database with no unit tests validating what it is doing.
Keep your database for data and leave the business logic in your application. Your database CPU will be able to take a breather and your DBA will thank you for it.
Well Formatted Code
Did you know that making sure your code is nicely formatted with correct indentation can actually improve performance?
OK, I am pulling your leg but seriously do your fellow developers a favour and make sure your code is easily readable. SQL is code too and there is nothing worse than trying to make sense of messy SQL code.
I have tried to cover the main aspects that I have seen that have caused performance issues for me. SQL is a funny beast and what works in one environment or even one query might not work in another. Therefore, it is always important to performance test these queries especially in high traffic systems.