Many developers believe that database performance tuning is entirely the responsibility of a database administrator (DBA). After all, a DBA’s job is to administer database servers, right? And shouldn’t that responsibility encompass the smooth, efficient operation of the database? Well, yes, strictly speaking. But it’s important to understand that a DBA is ultimately responsible for the safety of the entire organization’s data. It is a huge responsibility that overlaps with many other I.T. roles. A DBA’s job role also includes data security, storage requirements and implementation, planning and architecture, SANs, hardware, etc. Performance is just one of their many concerns, and when it comes to performance, DBAs are mainly concerned with the performance of the Database Management Systems (DBMS) as a whole, not necessarily the individual databases that back your applications.
Databases are typically designed in their entirety by teams of developers, while working on applications. Databases often grow as progress on the application is made, with more tables and other functionality being added with each iteration of development. Developers not only design the databases, they also write the queries that are sent to the databases from their applications. The developers of a software application know best how the application stores, uses, and queries data. They have more contextual knowledge of the application data than any other department, even the DBAs. Based on those facts, one could (very strongly) argue that the responsibility for performance of the application database rests with the development teams. That’s not to say you should never involve a DBA in your planning or that you can never ask a DBA for help - especially if you don’t know what you’re doing!!! In fact, some companies (particularly big corporations) may have policies in place that require you to have a DBA approve changes or SQL scripts that you plan to apply to databases. But having a working knowledge of how database management systems answer queries, and how you can optimize and tailor this for your specific application will yield great rewards. Your application will perform much faster and your users will be happier.
In this article I will be using Microsoft SQL Server
as an example, but many of these same principles are applicable across different relational DBMSs.
Introducing the Query Optimizer
How is a database table that has hundreds of thousands of rows able to very quickly answer a query that returns only those rows matching a given criteria? It can’t possibly be checking each and every row in the table. The answer is the query optimizer! When you submit a query, the query optimizer analyzes your SQL code and determines the quickest way to return the desired data. It produces what is known as an execution plan, which is like a blueprint for the storage engine to process. An execution plan
is a series of precise algebraic instructions for the storage engine to follow that will yield the expected result. The query optimizer uses heuristics and performs statistical analyses to produce many different execution plans, all of which produce the same query result. The one that costs the least in terms of performance and speed is chosen, and sent to the storage engine, which executes it. As you can probably imagine, the process of producing multiple execution plans and deciding which one is best is an expensive operation that will delay your query response. To address this, SQL Server maintains a cache of execution plans for the most frequently submitted queries. When an identical query is submitted, if a previously executed plan exists in the cache for this query, SQL Server will execute this plan instead. You can try this out for yourself - if you run the exact same query twice, you will notice that the second time, the results are returned almost instantaneously. Both the execution plan and the result set are cached in memory, which avoids plan generation and redundant round trips to the disk.
Introduction to Indexes
Saying “the query optimizer figures it out” and “caching” doesn’t really answer the original question. Wouldn’t SQL Server still have to examine each and every row in the table, regardless of which algorithm the query optimizer decides to use? Does choosing one execution plan over another really make any difference? Like many things in computer science, the answer is - it depends. If your table was merely a collection of rows with no organization structure, then yes, SQL Server would always have to examine every row to answer a query. Such a table would be called a heap in database terminology. Fortunately, most tables are organized by ancillary data structures called indexes
. You are most likely using indexes on your tables without even realizing it, because SQL Server will create them automatically when certain types of constraints are added to your table columns.
The best analogy to describe how an index works is a phonebook. A phonebook is sorted alphabetically by last name, so if you are looking for “Smith”, you can discard the first half of the book right away because you intuitively know that last names starting with ‘S’ will be closer to the end of the book. A binary search algorithm operates in a similar fashion - assuming a sorted list, start in the middle of the list, and if the thing being searched for is greater than the middle element, discard the lower half of the list, otherwise discard the higher half. Then recompute the midpoint, and repeat the process until the element is found or there is nothing left in the list to divide. Each iteration of the binary search reduces the search window by half, meaning we can discard many thousands of elements at a time without even needing to examine them. As a software developer, you are probably familiar with the concept of a binary search tree. Trees have a root node, with connectors to other nodes, forming “branches”. The bottom-most level nodes in the tree are the “leaf level” nodes of the tree. Binary search trees can be searched very quickly because the nodes are ordered by their relative values, and traversing such a tree has the same net effect as searching a sorted list using the binary-search process described above. Here is an example of such a tree:
In this tree, 8 is the root. If we searched this tree for 6, 6 is less than 8 so we move “left” and
compare 6 with 3. 6 is larger than 3 so we move “right” and 6 is located! Database indexes resemble an upside-down tree like the above. However, a database index is not a binary search tree; it is a B+ tree to be exact. For the purposes of this introductory article, all you need to know is that B+ tree nodes can be connected to more than just two other nodes. Each node in a database index tree represents a page of data. In SQL Server, a page is a tiny 8KB file containing table data and pointers to other pages. All data in SQL Server are stored in pages, and they are the smallest unit of data that can be read or written. At the intermediate level (also called the “branch level”) of the index tree, pages contain pointers to the other branch pages they are connected to. At the intermediate and leaf levels of the index tree, all of the adjacent pages are linked together in a doubly-linked list fashion. In other words, the pages contain pointers to the page immediately before and the page immediately after. This is so that sequential data can be quickly retrieved by following the pointers to the next and previous leaf pages. If I submit a query, and all of the matching rows are not on the same page, SQL Server can simply go to the next page to get the rest of the rows without having to re-traverse the entire tree from the top.
For a database index to be useful, it must have some knowledge of the table data it’s organizing. Therefore, an index is redundancy built into your database. When you create an index on a table, the index tree will be ordered by the column that you create the index on, and the index will contain an exact copy of that column’s data. The column becomes the “search key” for the index. For example, if I create an index on an Employees table, using the LastName column, the index will be sorted alphabetically in ascending order by LastName. You can also use descending order if desired, but that’s primarily only advantageous if you have a composite index with multiple columns - a concept beyond the scope of this article. When I query my Employees table and I use the LastName column in my WHERE clause, the query optimizer will see that an index exists for the LastName column on Employees, so it may decide to traverse that index structure. As an added benefit, if my query also had an ORDER BY LastName clause, SQL Server does not need to take any additional steps to sort, because the result set it produces from the index will already be sorted by LastName! If my table had hundreds of thousands of rows, this would improve query performance tremendously because the index can be quickly traversed to find only those last names which satisfy the query conditions.
Cost of Indexes
Because indexes introduce redundancy by their very nature, it is important to recognize that all of the benefits that indexes provide come at a cost. Whenever you write data to the table, whether it is INSERT, UPDATE, or DELETE, the index must also be updated in order to remain internally consistent with the table it is on. SQL Server will automatically update the associated index(es) for you, but this comes at a performance cost when making changes to table data. Therefore, indexes should be applied judiciously. Having too many indexes on a table will slow your database down during writes. It is also a colossal waste of disk space, as each index maintains a copy of column data. As a general best practice, indexes should be created only on those columns that are being queried in your application.
Types of Indexes
There are two fundamental types of indexes - clustered
indexes. SQL Server provides a few other types of indexes, but for this introductory article I am going to focus on these two, as these fundamental structures must be understood first before you can make an informed decision about other types of indexes and index configuration options. As I stated before, database indexes are composed of pages that are connected to one another in a sorted tree fashion via their internal pointers, and that all SQL Server data are stored in pages. The B+ tree structure of both clustered and nonclustered indexes are identical, and are both traversed in an identical fashion. The main difference is in the leaf level of the two index types. In a clustered index
, the leaf level of the index is composed of the actual table data pages. That is, the entire table is managed by the index. Because the clustered index essentially becomes the table itself, a table can only have one clustered index. In a nonclustered
index, the leaf level pages merely contain pointers to the table data, which are stored elsewhere. Because a non-clustered index simply refers to the table, you may have several non-clustered indexes on a table if you wish.
Performance Tuning with Indexes
When SQL Server traverses a nonclustered index, the way it ultimately fetches the table data depends on the type of table the index refers to. If the table does not already have a clustered index (hence, a heap table), the leaf level will contain pointers to the table rows. However, if the table is a clustered index, the leaf level page of the nonclustered index will contain the relevant search key for the corresponding row in the clustered index, and SQL Server will use that key to search the clustered index for the row. This is called a “clustering key”. To illustrate, imagine you are querying the Employees table for the last name of “Smith”. The table is a clustered index built on the primary key of the table (say, EmployeeID). The table also contains a nonclustered index on the LastName field. In this scenario, SQL Server will first use the nonclustered index to find all rows containing the last name of “Smith”. But because the table is a clustered index, the leaf level of the nonclustered index will contain a copy of the EmployeeID key for each row in the clustered index. After locating the EmployeeID values for all rows having a LastName of “Smith”, SQL Server then uses the EmployeeID field to ultimately traverse the clustered index for the matching rows. This is known as a “Key Lookup” in SQL Server and it can be detrimental to the performance of your application. Key lookups require more traversal time and I/O activity. The solution to this problem is to carefully analyze the queries being sent by your application, paying particular attention to the SELECT columns as well as columns in the WHERE clause, and to build indexes which can fully “cover” queries. Let’s illustrate, as this is better explained through example. Suppose the following simple query:
SELECT FirstName, LastName
WHERE DateOfHire >= ‘2/1/2012’ AND DateOfHire <= ‘2/1/2013’
Assume the Employees table has a clustered index on the primary key (EmployeeID), and a non-clustered index on DateOfHire. You won’t notice any performance issues now, as this is a relatively new table with very few rows in it. In fact, you might think that because we have the additional non-clustered index on DateOfHire - a column which is used in the WHERE clause - we don’t have a problem. However, as time goes on, and more rows are added to the Employees table, the performance of this query is going to rapidly deteriorate. Why? Because even though the DateOfHire index is being used, the clustered index must still be traversed. SQL Server must still perform a key lookup for each row found in the non-clustered index on DateOfHire, because it still needs to fetch the FirstName and LastName fields to return to the application. The solution is to modify the non-clustered index on DateOfHire, and to add FirstName and LastName as “included” columns in this index. This way, the non-clustered index will have all of the requested information, and can answer the query directly without needing to traverse the clustered index. The modified DateOfHire index is now referred to as a “covering index” because it fully “covers” the query above.
It’s also important to note that just because an index exists does not mean it will be used every time, even for the same query. There are situations where traversing the index would be slower than just scanning the entire table. This can happen if the table does not contain many rows, or if the result of the query would return the majority of the rows in the table anyway. If the query optimizer determines this to be the case, it will ignore the index and just scan the entire table. So do not be alarmed if your index does not get used right away. However, there are many undesirable situations where indexes are not being properly utilized, and thus your database is not performing to it’s full potential. Missing, unused, or duplicate indexes can cause performance problems. To diagnose these problems, you will need to look at the execution plans being generated for your application’s queries. In SQL Server, you can show the actual execution plan being used by pressing CTRL + M, or by selecting “Include Actual Execution” plan in the Query menu. When you turn this on before submitting a query, SQL Server will include the chosen plan in a graphical format, along with the query results. If you notice that your application is sending a query that isn’t performing as quickly as it could be, look at the execution plan being generated. Is it always scanning the clustered index (or heap) when it could benefit from a covering index? Is the query always filtering on a certain column, but no index exists for that column? If so, your entire index would need to be scanned to find matching rows, defeating the purpose of an index in the first place. If writes seem to be slow, are there duplicate indexes, perhaps because someone created one without first checking to see if an index already exists for that query? Duplicate indexes are a waste of disk space and hurt I/O write performance.
To summarize, database performance tuning is a large, complicated subject. Whole books have been written on the many different facets of the topic. I have only begun to scratch the surface in this article. I haven’t even discussed other important details like statistics, index fragmentation, filtering indexes, the implications of unique indexes vs. non-unique indexes, deciphering execution plans, index maintenance, etc. etc. etc. If you’re concerned about the performance of your website (and why wouldn’t you), I would strongly encourage you to study these topics even further. The performance of your database can make a real difference in website visits.