performance tuning, SQL Server

Slow SQL Server Query and Key Lookup

Why do some queries perform poorly even with seemingly correct indexes in place?  On top of that, your SQL Server seems to be running hot, gobbling up disk IO and running up the CPU.  The culprit might be a Key Lookup, and fortunately this can be a relatively easy problem to fix, once it is found.  Let me illustrate with a simple query on a table containing 20 million rows of film ratings, with a non-clustered index on the movieId column.

SELECT movieId, rating
FROM rating
WHERE movieId in (1694, 1946, 6166);

This query returned just under 5500 rows, taking 2.1 seconds on my laptop and almost 100 ms of CPU time.  These measurements were taken by running the following diagnostic commands in SQL Server Management Studio (before executing the query):

dbcc freeproccache
dbcc dropcleanbuffers
set statistics io on
set statistics time on

We will run the above commands each time we test the query, ensuring that SQL Server does not reuse cached execution plans or table data in memory, as as well capture our diagnostic data.  In addition, we can look at the query execution plan in Management Studio.  This is done by looking for “Show Actual Execution Plan” in the toolbar.  You can also get to this from menus by right-clicking in the query window or from Query in the menu bar.

Key Lookup Query Plan

The execution plan shows that 100 percent of the query cost (it’s really over 99.9%, just rounded up) comes from the Key Lookup operator.  Above that is the seek operation on the movieID index, which we expect to see for this query.  Since SQL Server execution plans read from right to left, the next operator is a Nested Loops join.  We don’t have any joins in this query, so what exactly is going on here?

The index in this case satisfies the filter condition on movieID in the WHERE clause.  This returns almost 5500 matching rows, but we also want to have the rating in the result (the SELECT portion of the statement).  Since this column is not part of the index, the optimizer runs  Key Lookups to retrieve it.  This manifests itself as a clustered index seek, returning the entire row for every match on MovieID.  That’s around 5500 clustered index seeks, adding significant IO cost to this query.  In addition, the Nested Loops join which follows ties together the results of the two preceding operators.  This adds CPU overhead to the query proportionate to the amount of Key Lookups that occur, that is to say every row matching our WHERE condition.

The solution is to add an Included Column to the existing nonclustered index. Ordinarily an index consists of one or more Key Columns that are used for retrieving data based on the WHERE or JOIN conditions.  When the remaining columns in your query are defined as Included Columns in the same index, there is no need to perform Key Lookups to retrieve this data.  This concept is known as creating a covering index on your query.  In the past, creating an index to cover a query meant cramming as many columns as possible into the index key, hopefully arranged in the correct order.  This resulted in overly large indexes and often difficult tuning efforts.  Included Columns provide the query covering benefit in a more compact, less complex form.

Now we drop the existing nonclustered index and create a new one with the Included Column:

CREATE NONCLUSTERED INDEX [IX_rating_incl] ON [dbo].[rating]
([movieId] ASC)
INCLUDE ([rating]);

Now the query execution time drops from over 2 seconds to a mere 79 ms (0.079 seconds)!  Logical reads are a measure of IO cost which can cause disk and memory pressure.  Before adding this index our query ran up more than 32,000 logical reads.  With the new index logical reads decreased by 99.9%, to just 26!  Also, the CPU overhead went away.  Take a look at the new query plan:

Query plan for covering index

There is a single index seek operator, assigned a cost of 0.018 by the optimizer.  Formerly, the query plan had Key Lookups adding up to a cost of around 15.7.  Note that query plan costs in SQL Server do not mean much as absolute values, as they are intended for relative comparison.  With this purpose in mind, the costs here do clearly indicate a tremendous improvement with the new covering index.

This example illustrates the benefit of solidly understanding query tuning principles and strategy.  If a query like the above executes several times per second on a busy system, improving it more than 99% is a big win. Doing so without expensive hardware purchases or lengthy refactoring of the application is an even bigger win.  The challenge here is in finding these database optimization opportunities quickly and efficiently.  Having access to an experienced SQL Server tuning expert puts you on the fast track.  Contact me to find out how I can improve your database performance!

 

Geospatial, MongoDB

Import GeoJSON Data Into MongoDB

If you want good support for location data and maps in your database, MongoDB can be a great choice.  One of the most widely supported formats on the Web for geospatial data is GeoJSON, and MongoDB location data looks and behaves pretty much exactly like GeoJSON.  This is very cool if you want to create a map using a JavaScript library like Leaflet, or run common types of geospatial queries on your data.

This is all pretty awesome, until you find or generate some GeoJSON data in a file outside of MongoDB and decide you want to import it.

For some reason, the native mongoimport tool, that will import JSON just fine (well, mostly, anyway), falls to pieces dealing with GeoJSON.  There are multiple ways to hack your way through this, as any Web search will tell you, but they generally involve opening your HUGE GeoJSON file in a text editor to remove some stuff (or using a command line tool for this) then running mongoimport on the cleaned up file.  I went ahead and created a Python script that does the cleanup and import in one step.  The script is also fast at importing as it leverages the bulk-write operations available in MongoDB 3.2 and later.  This is a 10x improvement over using normal inserts with the PyMongo driver and is very noticeable with large GeoJSON files.  The script tops things off at the end by creating a 2dsphere index on your collection.  Without this index, you can’t run geospatial queries.

My script is freely available on GitHub, and to run it you could just do this:

python geojson-mongo-import.py -f points.geojson -d geospatial -c points

This assumes you are running against a local instance of Mongo.  There are additional parameters for host name and user credentials, along with other things to know about the script in the README

So there you have it, a complete, fast, easy solution to importing your GeoJSON data into MongoDB!

In addition, if your geospatial data happens to be in shapefile format, I have a similar tool for importing that into MongoDB.

performance tuning, SQL Server

Fix SQL Server Error 1919 During Index Create

When working on a Microsoft SQL Server database tuning project with a client, I encountered a puzzling error message. Based on performance metrics and analyzing query execution plans I had recommended a couple of new indexes. The index creation scripts looked similar to the following (I redacted the actual object names here):

CREATE NONCLUSTERED INDEX [IX_tblA_colX] ON [dbo].[tblA] 
    ([ColX] ASC) ON [PRIMARY]
GO

The first index went in just fine, while the second failed with an error message similar to this one:

Msg 1919, Level 16, State 1, Line 1
Column 'ColX' in table 'dbo.tblA' is of a type that is invalid for use as a key column in an index.

This error commonly occurs when the key column for the index is of type NVARCHAR(MAX) which is not supported, or if the index key exceeds 900 bytes. Neither of these conditions were true, as the column was defined with something on the order of NVARCHAR(150), well within the constraints for a non-clustered index key.

One fact I was already aware of is that this column had actually started out as an NVARCHAR(MAX). In fact there were many columns like this across the entire database, which I suspect was a result of using Microsoft Entity Framework to develop the application. On my advice the client had converted all of these due to all of the performance issues they were experiencing.

Converting the column type was straightforward using this SQL statement:

ALTER TABLE [dbo].[tblA]
    ALTER COLUMN [ColX] [nvarchar](150) NOT NULL
GO

So how is it that this error message was encountered and we could not create the index? This one nearly had me stumped. I was fairly certain this had something to do with the data type conversion somehow appearing incomplete from the viewpoint of the index creation statement.

I suggested to my client that they run two T-SQL maintenance commands similar to the following:

UPDATE STATISTICS tblA WITH FULL SCAN

Followed by:

dbcc updateusage (mydb,'dbo.tblA')

Once these commands were finished, creating the index succeeded.

Evidently what had happened is that either the table column statistics or space allocation had not updated to reflect the column data type change. By default SQL Server maintains this information automatically, but on large, busy databases this should be taken care of proactively. This is typically done with a weekly or even nightly SQL Server Agent job. I subsequently learned that my client up to this point had no such maintenance job in place. They were in startup mode on project that had gone live fairly recently. As is the common case they did not have an in-house DBA and relied on application developers to run the database.

Typically out-of-date statistics results in less-than-optimal query performance. Inaccurate space allocation usually shows up when reporting the size of your tables. Getting an error message when creating an index due to these conditions is extremely rare, and did not turn up in my online search. Fortunately I was able to draw upon my many years of experience as a SQL Server DBA to make an educated guess and root out the problem.

The best defense against this type of issue is to always implement a good scheduled maintenance regimen in your database environment. Automatic backups go without saying regardless of the kind of database you use (SQL or NoSQL). In the case of MS SQL Server this regularly scheduled maintenance also includes index defragmentation and statistics updates. An excellent script for doing this can be found here.

Rick Ciesla owns and operates VenaData, a provider of database administration solutions, services and consulting. Contact Rick for more information.

Uncategorized

Introducing VenaData

Your databases deserve respect, pure and simple.

Database DevOps and DBA Services

Delivered remotely, anywhere on the planet.

  • Professional Remote DBA Services and Consulting
    • Installation and configuration
    • Backups and high availability
    • Problem troubleshooting
  • Query and database performance tuning
  • Process automation for database environments
  • Database upgrades and migration
  • Database architecture and development

With core expertise in Microsoft SQL Server and MongoDB, the VenaData goal is cross-platform support on a wide variety of relational and NoSQL systems in your data center or on the cloud.

To learn more about how VenaData can help your organization have reliable, high-performing database systems, contact Rick Ciesla at rick@venadata.com

%d bloggers like this: