SUMMARY:
Azure SQL Automatic Tuning successfully identified and reverted a newly created nonclustered index after investigation using the SQL Server Query Store revealed that the massive 108.15% DTU regression was caused by the index’s imposed maintenance overhead on frequent BULK INSERT write operations, rather than inefficient data reads.
- Automatic Tuning detected a massive performance penalty, with the overall database workload regressing by 3.89% DTU, while the specific “affected queries” regressed by 108.15% DTU.
- The investigation was initially complicated because the Azure Portal validation reports and the underlying
sys.dm_db_tuning_recommendationsDMV failed to provide the necessaryquery_idor query text linking the index to the negatively impacted operation. - Using Query Store and filtering by table name and time window, the team discovered that the top resource consumer was a BULK INSERT statement, confirming the index was harming data writes rather than helping SELECT queries.
- The standard best practice for environments with heavy data loading is to temporarily disable or drop nonclustered indexes before running the bulk load, and then rebuild them afterward, as this is far more efficient than row-by-row maintenance during the load.
This case validates that while automated systems protect the workload by reverting harmful changes, database optimization still requires human skill, diagnostic tools like Query Store, and collaboration across teams to understand the total impact on a mixed read/write workload.
Table of contents
The Initial Anomaly: A 108% Performance Regression
One of the most powerful features within Azure SQL Managed Instance is Automatic Tuning. It monitors your database workload and applies performance-enhancing actions without manual intervention, such as creating a potentially missing index. But what happens when the machine gets it wrong? Or, more accurately, what happens when it gets it so right that it has to immediately undo its own work?
That’s the scenario we recently faced. On a reporting database, Automatic Tuning suggested and created a new nonclustered index. Shortly after, it automatically reverted the change, providing a validation report with some puzzling metrics:
- DTU regression (overall): 3.89%
- DTU regression (affected queries): 108.15%
For context, a Database Transaction Unit (DTU) is Azure’s blended measure of CPU, memory, and I/O. The report told us that while the overall database workload got slightly more expensive (by ~4%), the specific query the index was meant to help got more than twice as expensive (a 108% regression).
Because this was a reporting environment, our initial analysis was laser-focused on read performance. Our first logical hypothesis was that the index was a good idea in principle, but that it needed page compression. In reporting scenarios with large tables, creating an index with page compression can drastically reduce its size on disk, leading to fewer I/O operations and better query performance. Our theory was that Azure created the standard index, its I/O cost was still too high, and this caused the regression.
This hypothesis seemed sound and provided a clear path forward: find the affected query, and manually recreate the index with page compression to fix the issue. This kicked off a deeper investigation that would ultimately prove our initial theory completely wrong.
The Investigation Begins: Finding the “Affected Query”
Our first step was to identify the query that was so negatively impacted. This should be easy, right? The recommendation details in the Azure Portal should have the query text or at least a query_id.
Unfortunately, this was our first dead end. The validation report in the portal UI, while showing the regression percentages, had no mention of a query_id.
No problem. We can go deeper. The next logical step is to query the underlying Dynamic Management View (DMV) that feeds the portal: sys.dm_db_tuning_recommendations. This DMV contains a details column with rich JSON data that should hold the key. We ran a query to parse this JSON, but to our surprise, the queryId field was NULL. After inspecting the raw JSON, we confirmed it wasn’t just in the wrong place—it was completely missing from the log for this specific event.
This was a frustrating but realistic roadblock. We had hit a limitation in the diagnostic data. It was time to pivot. If we couldn’t link the reverted index to the query via an ID, we had to link them by circumstance. We had two critical clues:
- The Object: The index was created on the table
[dbo].[Staging_ReportData]on the columns([RecordId], [EventTimestamp]). - The Time Window: The JSON details gave us the exact timestamps for when the index was created and when it was reverted.
The mission was now clear: find a resource-intensive query that ran against [dbo].[Staging_ReportData] within that specific time window.
The Breakthrough: Unmasking the Culprit with Query Store
With our new mission defined, we turned to the ultimate tool for historical query analysis: the SQL Server Query Store. To simplify the process, we used the excellent open-source stored procedure sp_QuickieStore, which provides a user-friendly way to analyze Query Store data.
Since we couldn’t filter by query_id, we used the time window and the @query_text_search parameter to find our culprit. The command was simple and precise:
SQL
EXEC sp_QuickieStore
@start_date = '2025-07-04 05:50:11',
@end_date = '2025-07-04 16:07:10',
@sort_order = 'cpu',
@query_text_search = 'Staging_ReportData';
We ran the command, expecting to see a complex reporting query at the top of the list—one with a WHERE clause that was somehow inefficiently using the new index.
The result was something else entirely. The top resource-consumer, by a wide margin, was a BULK INSERT statement targeting our [dbo].[Staging_ReportData] table. It had run over 700 times during the validation window with a high average CPU cost. At first, this seemed wrong. Why would a data loading operation be the “affected query”?
The “Aha!” Moment: Understanding the Real Problem
Then, it all clicked into place. The BULK INSERT wasn’t the query the index was meant to help; it was the victim of the index.
This revealed a fundamental database tuning principle: Indexes speed up data reads, but they slow down data writes.
Every time you INSERT, UPDATE, or DELETE data, the database engine doesn’t just change the data in the table; it must also update every single nonclustered index to reflect that change. When Automatic Tuning created the new index [IX_AutoTuned_ReportData_EventTimestamp], it inadvertently imposed a “tax” on our data loading process.
For each of the 712 BULK INSERT executions, the database now had to perform the extra work of inserting a new entry into our new index. That extra CPU and I/O overhead, multiplied 712 times, was the source of the massive 108% performance regression. The “affected query” was the one being harmed by the index’s existence. This discovery also invalidated our initial theory about page compression. Adding compression to the index would have increased the CPU ‘tax’ on the BULK INSERT, likely making the 108% regression even worse.
Key Lessons and Best Practices
This investigation highlighted several crucial takeaways for anyone managing a database with a mixed workload.
Consider the Entire Workload
Don’t assume an index is only for SELECT queries. You must always consider the impact on the entire workload, especially data modification patterns. An index that helps a monthly report run faster might be unacceptable if it cripples a data loading process that runs every five minutes.
The Best Practice for Bulk Loads 🚀
Our scenario is precisely why a standard best practice exists for environments with heavy data loading operations. To maximize performance, the recommended pattern is:
- Disable or Drop Indexes: Before the data load begins, disable or drop the nonclustered indexes on the target table.
- Perform the Data Load: Run the
BULK INSERTor other loading process. This will now run much faster without the index maintenance overhead. - Rebuild or Recreate Indexes: After the load is complete, rebuild or recreate the indexes. This single, large operation is far more efficient than maintaining the indexes row-by-row during the load.
Trust Your Automation (But Know How to Verify It)
Azure’s Automatic Tuning worked perfectly. It identified a potential improvement, tested it, and, upon seeing the massive negative impact, protected the workload by reverting the change. This event reinforces the value of automated tuning, but also shows the importance of having the skills and tools like Query Store to investigate why a decision was made.
To emphasize this, consider this scenario as outlined above. You have a database that is autotuning itself by adding indexes for reporting queries, but now notice many of those indexes are being rolled back by the same mechanism that put them in place. This goes directly to the hand-in-glove or it-takes-a-village approach to maintaining database optimization. It is not uncommon for reporting and ETL developers to be separate people/teams. Recognizing this sooner will save time and resources later.
In short, the Azure Autuning feature is excellent! However, it requires people to work together for maximum optimization!
Conclusion
What started as a confusing alert about a reverted index turned into a valuable reminder about the fundamentals of database performance. By following the clues and digging deep into the diagnostic data, we were able to solve the mystery and confirm that our automated systems were making the right call, even if the reason wasn’t immediately obvious.
FAQs
What is Azure Automatic Tuning, and what was the initial problem encountered?
Azure Automatic Tuning is a powerful feature in Azure SQL Managed Instance that automatically monitors database workloads and applies performance-enhancing actions, such as creating missing indexes, without manual intervention. In the described scenario, Automatic Tuning created a new nonclustered index, but then quickly reverted it. The puzzling part was the validation report: while the overall database workload experienced a minor 3.89% DTU regression, the specific “affected query” showed a massive 108.15% DTU regression, meaning it became more than twice as expensive.
Why was identifying the “affected query” so challenging?
The process of identifying the negatively impacted query faced several roadblocks. Initially, the Azure Portal’s validation report, while showing regression percentages, did not provide the query_id or query text. Further investigation by querying the sys.dm_db_tuning_recommendations Dynamic Management View (DMV) also revealed that the queryId field was NULL and completely missing from the rich JSON details for this specific event. This diagnostic data limitation meant the team couldn’t directly link the reverted index to a specific query ID.
How was the “affected query” ultimately identified without a query_id?
Unable to link by query_id, the team pivoted to linking by circumstance, using two critical clues: the specific table and columns on which the index was created ([dbo].[Staging_ReportData] on ([RecordId], [EventTimestamp])), and the exact time window provided by the JSON details when the index was created and reverted. They then used the SQL Server Query Store, specifically the sp_QuickieStore stored procedure, to find resource-intensive queries against Staging_ReportData within that time window. This led them to discover that a BULK INSERT statement was the top resource consumer.
What was the “Aha!” moment regarding the BULK INSERT?
The “Aha!” moment came when it was realized that the BULK INSERT wasn’t the query the index was intended to help; instead, it was the query being significantly harmed by the index’s presence. This highlighted a fundamental database tuning principle: while indexes speed up data reads (SELECT queries), they inherently slow down data writes (INSERT, UPDATE, DELETE operations). Every time data is modified, all associated nonclustered indexes must also be updated. The new index imposed an “extra tax” on each of the 712 BULK INSERT executions, leading to increased CPU and I/O overhead, which manifested as the 108% performance regression.
Why did the initial hypothesis about page compression prove wrong?
The initial hypothesis was that the index was conceptually good but needed page compression to reduce I/O costs, especially in a reporting environment with large tables. However, once the BULK INSERT was identified as the culprit, this hypothesis was invalidated. Adding page compression to the index would have further increased the CPU overhead for the BULK INSERT operation, making the 108% regression even worse, as compression adds more processing during write operations.
What key lessons were learned about considering the entire workload?
A crucial lesson learned was the importance of considering the entire database workload, not just SELECT queries, when implementing indexes. An index beneficial for read performance might severely impact write operations if those operations are frequent or critical. The example showed that an index helping a monthly report could be detrimental if it cripples a data loading process that runs much more frequently, like every five minutes. Database optimization requires a holistic view of all operations.
What is the recommended best practice for environments with heavy data loading operations?
For environments with heavy data loading operations (like BULK INSERT), the standard best practice is to:
Disable or Drop Indexes: Temporarily remove nonclustered indexes on the target table before the data load.
Perform the Data Load: Execute the BULK INSERT or other loading process, which will run much faster without the index maintenance overhead.
Rebuild or Recreate Indexes: After the data load is complete, rebuild or recreate the indexes. This single, large operation is far more efficient than maintaining indexes row-by-row during the load.
How does this scenario reinforce the value of automated tuning and collaboration?
This scenario reinforces that Azure’s Automatic Tuning works effectively; it identified a potential improvement, tested it, and correctly reverted the change upon detecting a significant negative impact, thus protecting the workload. It highlights the value of automated systems. However, it also underscores the importance of human skills and tools like Query Store to investigate why a decision was made. The incident demonstrates that while automation is excellent, maximum optimization often requires collaboration between different teams (e.g., reporting and ETL developers) who understand the full scope of database operations to address issues proactively.
For more on effortless Azure Management, contact us today!