April 21, 2017
One benefit of being covered under our Managed Services program is daily monitoring of SQL Server wait times and tuning of the database to decrease those waits. Moser just had a project-based client transition into our Managed Services program and it was time to begin evaluating their query wait stats and determining which ones to attack first. There was an obvious one that stood out like a sore thumb that was screaming for attention.
It was a single query that ran for approximately 9 hours and 30 minutes every single day. It was full of linked server references from inserting from the local instance to the remote and vice versa, multiple joins on tables from linked servers, and a NOT IN sub-select to a local table.
Hard to spot I know...
Alright kids, can you find the slow query?
Looking at the waits from the instance that ran the job, there were mostly CPU waits with a portion of OLEDB waits. On linked instance, it shows all OLEDB waits for the same timeframe. CPU waits are just the processor(s) crunching on data in the way that SQL Server determined was the best way based on the SQL code written and the underlying table and index structure. OLEDB waits are time taken the transfer data over the network from one machine to another.
The first thing I noticed in the query plan was that a Clustered Index Scan was happening for the local table sub-select. This was due to there being no non-clustered index for the column in the WHERE clause of this query. I thought it odd not to have an index on a date field. The query optimizer had no choice but to go to the Clustered Index which was not in order by CreatedDate, thus the Index Scan having to thumb through every 8k page looking for this value. I modified the clustered index to be more SARGable and added a non-clustered index that included the CreatedDate column so the plan was now performing an Index Seek operation. The old operation took 97% of the plan and the new one took only 86%. However, the entire SQL job dropped from 9.5 hours to 1.5 hours with that simple change. There is a lot more going on in that query than just this one INSERT -> SELECT statement, so I was very surprised that this change made such a difference. However, I wasn't done yet.
Falling off a cliff
Falling off a smaller cliff, but a cliff nonetheless
Most concerning was a query that involved joins amongst seven tables on a linked server with the only local table reference being in the NOT IN sub-select tuned above. Most the tables in this query were remote, so my idea was to translate this stored procedure to run native on the remote server and only grab items from the (currently) local server as needed. A typical linked server trick is to grab the data needed into temp tables and deal with the data there as needed, particularly if there are joins involved between servers. The plan was to use that trick as well as one more. Instead of a "NOT IN (SELECT column", often a LEFT JOIN between the tables with a "WHERE column IS NULL" is faster. In fact, it was by a huge margin again faster. With this change, the job now ran in just under 16 minutes.
How 'bout them apples?
So from nine and half hours (570 minutes), to 1 and a half hours (90 minutes), to a quarter hour (16 minutes) I took the longest running query on two instances to something that gets lost in the noise of many other queries' wait times. So much fun!