A Case Study in SQL Performance Troubleshooting

The Problem

A few months ago, we started to get reports of performance issues from one of our clients. The reports were pretty light on detail, but the gist was that SSAS cube reports built in excel were running slower than usual and sometimes failing to refresh, and our application was slow to load data (which usually occurs if queries to the SQL database are running slowly).

Both SSAS and SQL Server run on the same server for this client, so performance issues across both would typically point towards resource issues on the server. We logged in to check server resource usage and everything looked normal except that SQL Server had an unusually large amount of paged memory. Of 120GB of process memory, only 12GB was in physical memory and the rest was paged, and this despite the fact that there was around 100GB of free physical memory!

Our initial guess was memory constraints on the VM Host which is managed by our client, but their IT department reported no memory issues on the host. Without much else to go on, we restarted the SQL Server service, and hoped for the best. Unfortunately the issue continued to re-occur sporadically, and we couldn’t figure out why SQL Server was paging its memory when there seemed to be so much free memory on the server.

The Root Cause

A week later we logged onto the server again after reports of another performance incident, and this time we found the culprit in the very act of the crime! SSAS memory usage was unusually high (40GB+ vs a normal 1-5GB), and as we watched, this rose to 100GB before rapidly dropping back down to the normal range. A few minutes later this occurred again, and we were able to use the SSAS DMVs to isolate this down to a single user running a cube report in excel. It seems SSAS was gobbling up memory for running this particular query, forcing SQL Server to page its less used memory to disk. Although SSAS quickly dropped back to normal RAM usage after the query finished, SQL Server was much slower in restoring its paged memory, and in most cases needed a restart to get back to normal.

We got the client to send us a copy of this godzilla excel report, and it actually wasn’t too large, but the query hitting the cube was quite complex, with 32 columns, and 6 levels of dimensionality in the rows (making about 6000 rows). With this design, the report took several minutes to run, and 100GB of memory on the server…

With the root cause clearly identified, it only took about half an hour to make a few simple changes to restructure and split the query to get the same data in about a minute with roughly 10GB of RAM usage on SSAS.

The Moral

Transient performance issues are a classic time sink and challenge to replicate and debug, but as it turns out there is a better way to do this than just hoping you connect to the server at the right time. We could (and should) have shortcut straight to the root cause if we had just set up some simple logging on the server when the issue was first identified. In this case, this we could have used the easy and simple tracing/logging mechanisms built into Windows, SSAS and SQL Server, e.g:

  1. A trace in Performance Monitor, including performance counters for SSAS and SQL Server memory usage
  2. A trace on SSAS for all queries slower than X seconds
  3. A trace on SQL Server for all queries slower than X seconds

The Code

And because you should always script everything, here’s the code to create/delete these simple traces in SQL Server and SSAS.

    -- Create Trace
    declare @trace_id int, @maxfilesize bigint = 50
    exec sp_trace_create @trace_id output, 2, N'C:\temp\DBQueryTrace', @maxfilesize = @maxfilesize, @stoptime = null, @filecount = 20

    -- Add RPC:Completed event
    exec sp_trace_setevent @trace_id, 10, null, 1
    exec sp_trace_setevent @trace_id, 10, 1, 1
    exec sp_trace_setevent @trace_id, 10, 10, 1
    exec sp_trace_setevent @trace_id, 10, 12, 1
    exec sp_trace_setevent @trace_id, 10, 13, 1
    exec sp_trace_setevent @trace_id, 10, 14, 1
    exec sp_trace_setevent @trace_id, 10, 15, 1
    exec sp_trace_setevent @trace_id, 10, 16, 1
    exec sp_trace_setevent @trace_id, 10, 17, 1
    exec sp_trace_setevent @trace_id, 10, 18, 1
    exec sp_trace_setevent @trace_id, 10, 35, 1

    -- Add SQL:BatchCompleted event
    exec sp_trace_setevent @trace_id, 12, null, 1
    exec sp_trace_setevent @trace_id, 12, 1, 1
    exec sp_trace_setevent @trace_id, 12, 10, 1
    exec sp_trace_setevent @trace_id, 12, 12, 1
    exec sp_trace_setevent @trace_id, 12, 13, 1
    exec sp_trace_setevent @trace_id, 12, 14, 1
    exec sp_trace_setevent @trace_id, 12, 15, 1
    exec sp_trace_setevent @trace_id, 12, 16, 1
    exec sp_trace_setevent @trace_id, 12, 17, 1
    exec sp_trace_setevent @trace_id, 12, 18, 1
    exec sp_trace_setevent @trace_id, 12, 35, 1

    -- Add filter on Duration > 1000
    declare @duration bigint = 1000000
    exec sp_trace_setfilter @trace_id, 13, 0, 2, @duration
    exec sp_trace_setfilter @trace_id, 1, 0, 7, N'waitfor%'

    -- Start the trace
    exec sp_trace_setstatus @trace_id, 1
    -- Delete SQL Trace
    declare @trace_id int = (select traceid from fn_trace_getinfo(null) where cast(value as varchar(max)) like 'C:\temp\DBQueryTrace%')
    if @trace_id is not null
    begin
      exec sp_trace_setstatus @trace_id, 0
      exec sp_trace_setstatus @trace_id, 2
    end
    <!-- Create Trace for Command, Query and Error events -->
    <Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
        <Create xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
            <ObjectDefinition>
                <Trace>
                    <ID>Blueshift Query Trace</ID>
                    <Name>Blueshift Query Trace</Name>
                    <LogFileName>C:\temp\SSASQueryTrace.trc</LogFileName>
                    <LogFileAppend>1</LogFileAppend>
                    <AutoRestart>0</AutoRestart>
                    <LogFileSize>5000</LogFileSize>
                    <LogFileRollover>1</LogFileRollover>
                    <Events>
                        <Event>
                            <EventID>15</EventID>
                            <Columns>
                                <ColumnID>32</ColumnID>
                                <ColumnID>1</ColumnID>
                                <ColumnID>25</ColumnID>
                                <ColumnID>33</ColumnID>
                                <ColumnID>2</ColumnID>
                                <ColumnID>3</ColumnID>
                                <ColumnID>28</ColumnID>
                                <ColumnID>36</ColumnID>
                                <ColumnID>37</ColumnID>
                                <ColumnID>41</ColumnID>
                                <ColumnID>42</ColumnID>
                                <ColumnID>44</ColumnID>
                                <ColumnID>45</ColumnID>
                            </Columns>
                        </Event>
                        <Event>
                            <EventID>16</EventID>
                            <Columns>
                                <ColumnID>1</ColumnID>
                                <ColumnID>2</ColumnID>
                                <ColumnID>3</ColumnID>
                                <ColumnID>5</ColumnID>
                                <ColumnID>6</ColumnID>
                                <ColumnID>24</ColumnID>
                                <ColumnID>25</ColumnID>
                                <ColumnID>28</ColumnID>
                                <ColumnID>32</ColumnID>
                                <ColumnID>33</ColumnID>
                                <ColumnID>36</ColumnID>
                                <ColumnID>37</ColumnID>
                                <ColumnID>41</ColumnID>
                                <ColumnID>42</ColumnID>
                            </Columns>
                        </Event>
                        <Event>
                            <EventID>17</EventID>
                            <Columns>
                                <ColumnID>24</ColumnID>
                                <ColumnID>3</ColumnID>
                                <ColumnID>25</ColumnID>
                                <ColumnID>28</ColumnID>
                                <ColumnID>32</ColumnID>
                                <ColumnID>33</ColumnID>
                                <ColumnID>36</ColumnID>
                                <ColumnID>37</ColumnID>
                                <ColumnID>41</ColumnID>
                                <ColumnID>42</ColumnID>
                            </Columns>
                        </Event>
                        <Event>
                            <EventID>9</EventID>
                            <Columns>
                                <ColumnID>32</ColumnID>
                                <ColumnID>1</ColumnID>
                                <ColumnID>25</ColumnID>
                                <ColumnID>33</ColumnID>
                                <ColumnID>2</ColumnID>
                                <ColumnID>3</ColumnID>
                                <ColumnID>28</ColumnID>
                                <ColumnID>36</ColumnID>
                                <ColumnID>37</ColumnID>
                                <ColumnID>41</ColumnID>
                                <ColumnID>42</ColumnID>
                                <ColumnID>44</ColumnID>
                                <ColumnID>45</ColumnID>
                            </Columns>
                        </Event>
                        <Event>
                            <EventID>10</EventID>
                            <Columns>
                                <ColumnID>1</ColumnID>
                                <ColumnID>2</ColumnID>
                                <ColumnID>3</ColumnID>
                                <ColumnID>5</ColumnID>
                                <ColumnID>6</ColumnID>
                                <ColumnID>24</ColumnID>
                                <ColumnID>25</ColumnID>
                                <ColumnID>28</ColumnID>
                                <ColumnID>32</ColumnID>
                                <ColumnID>33</ColumnID>
                                <ColumnID>36</ColumnID>
                                <ColumnID>37</ColumnID>
                                <ColumnID>41</ColumnID>
                                <ColumnID>42</ColumnID>
                            </Columns>
                        </Event>
                    </Events>
                </Trace>
            </ObjectDefinition>
        </Create>
    </Batch>
    <!-- Delete Trace -->
    <Delete xmlns="http://schemas.microsoft.com/analysisservices/2003/engine" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
        <Object>
            <TraceID>Blueshift Query Trace</TraceID>
        </Object>
    </Delete>