One of the most interesting things in SQL Server from my point of view are Statistics and how the optimizer interprets them to produce optimal query plans. But today I want to write about something just happened to me and made think. This refers to the very first stage of the process, the creation.
If we haven’t changed anything in our databases (or model), auto create statistics should be on by default which means SQL Server will create automatically statistics whenever it thinks that information would be beneficial to generate a query plan.
This sounds like a treat, right? Sure, it is nice not to have to think which statistics to create and so on, but like everything, it can have a dark side.
Background
Maybe I’m going to spoil the whole story but this is how things happened.
I was reviewing the logs for the last database maintenance which includes, of course, updating statistics. Then I saw there was one specific that took more than 10 hours to complete, which made me suspicious. So I tried to find out why.
Basically everything was correct, and the reasons why it took so long were
- The table has 41 million rows
- It updated them using FULLSCAN
I didn’t specify FULLSCAN, but using RESAMPLE will use the last sample, if that was a FULLSCAN… there you go.
OK, so we have this statistics named _WA_xxxxxxx, hence auto created, on a column of my table. And this column was (drum roll) NVARCHAR(MAX) (badum ts)
My table contains xml documents stored in NVARCHAR(MAX) data type, so my statistics looked like this
See how [Rows Sampled] is the same as [Rows], so FULLSCAN was performed. And the number is not small at all.
But the worst part is the one supposed to be most helpful and where the query optimizer gathers insights of the data to produce nice query plans, the Histogram. In this case doesn’t look very useful, right?
Also you need to know that statistics for [N]VARCHAR() columns have a maximum of 200 characters in the RANGE_HI_KEY, so for a MAX data type like this is absolutely pointless.
Auto Create statistics and Data types
After seeing this, I got a bit concerned and I wondered if all other data types will create statistics automatically. So I decided to create a table with all data types to check it out!
USE master GO CREATE DATABASE stats_sample GO USE stats_sample GO CREATE TABLE stats_sample (c_bigint bigint NULL DEFAULT CONVERT(bigint, '1') , c_binary binary NULL DEFAULT CONVERT(binary, '1') , c_bit bit NULL DEFAULT CONVERT(bit, '1') , c_char char NULL DEFAULT CONVERT(char, '0') , c_date date NULL DEFAULT GETDATE() , c_datetime datetime NULL DEFAULT GETDATE() , c_datetime2 datetime2 NULL DEFAULT GETDATE() , c_datetimeoffset datetimeoffset NULL DEFAULT GETDATE() , c_decimal decimal NULL DEFAULT CONVERT(decimal, '1') , c_float float NULL DEFAULT CONVERT(float, '1') , c_geography geography NULL DEFAULT 'CURVEPOLYGON(CIRCULARSTRING(-1 1, -1 1, -1 1, -1 1, -1 1))' , c_geometry geometry NULL DEFAULT geometry::Parse('POINT(1 1 NULL NULL)') , c_hierarchyid hierarchyid NULL DEFAULT '/' , c_image image NULL DEFAULT CONVERT(image, '1') , c_int int NULL DEFAULT CONVERT(int, '1') , c_money money NULL DEFAULT CONVERT(money, '1') , c_nchar nchar NULL DEFAULT CONVERT(nchar, '0') , c_ntext ntext NULL DEFAULT CONVERT(ntext, '0') , c_numeric numeric NULL DEFAULT CONVERT(numeric, '1') , c_nvarchar nvarchar NULL DEFAULT CONVERT(nvarchar, '0') , c_real real NULL DEFAULT CONVERT(real, '1') , c_smalldatetime smalldatetime NULL DEFAULT GETDATE() , c_smallint smallint NULL DEFAULT CONVERT(smallint, '1') , c_smallmoney smallmoney NULL DEFAULT CONVERT(smallmoney, '1') , c_sql_variant sql_variant NULL DEFAULT CONVERT(sql_variant, '1') , c_sysname sysname NULL DEFAULT CONVERT(sysname, '0') , c_text text NULL DEFAULT CONVERT(text, '0') , c_time time NULL DEFAULT GETDATE() , c_timestamp timestamp NULL , c_tinyint tinyint NULL DEFAULT CONVERT(tinyint, '1') , c_uniqueidentifier uniqueidentifier NULL DEFAULT NEWID() , c_varbinary varbinary NULL DEFAULT CONVERT(varbinary, '1') , c_varchar varchar NULL DEFAULT CONVERT(varchar, '0') , c_xml xml NULL DEFAULT CONVERT(xml, '1')) GO INSERT INTO dbo.stats_sample DEFAULT VALUES GO 1000
Note that I have to create a dummy default value to be able to generate some stats.
Once my table is in place and loaded, auto create stats will be triggered if the query optimizer thinks it’d be nice to have them in order to get estimates and generate a better query plan, so there we go
SELECT * FROM stats_sample WHERE c_bigint IS NOT NULL SELECT * FROM stats_sample WHERE c_binary IS NOT NULL SELECT * FROM stats_sample WHERE c_bit IS NOT NULL SELECT * FROM stats_sample WHERE c_char IS NOT NULL SELECT * FROM stats_sample WHERE c_date IS NOT NULL SELECT * FROM stats_sample WHERE c_datetime IS NOT NULL SELECT * FROM stats_sample WHERE c_datetime2 IS NOT NULL SELECT * FROM stats_sample WHERE c_datetimeoffset IS NOT NULL SELECT * FROM stats_sample WHERE c_decimal IS NOT NULL SELECT * FROM stats_sample WHERE c_float IS NOT NULL SELECT * FROM stats_sample WHERE c_geography IS NOT NULL SELECT * FROM stats_sample WHERE c_geometry IS NOT NULL SELECT * FROM stats_sample WHERE c_hierarchyid IS NOT NULL SELECT * FROM stats_sample WHERE c_image IS NOT NULL SELECT * FROM stats_sample WHERE c_int IS NOT NULL SELECT * FROM stats_sample WHERE c_money IS NOT NULL SELECT * FROM stats_sample WHERE c_nchar IS NOT NULL SELECT * FROM stats_sample WHERE c_ntext IS NOT NULL SELECT * FROM stats_sample WHERE c_numeric IS NOT NULL SELECT * FROM stats_sample WHERE c_nvarchar IS NOT NULL SELECT * FROM stats_sample WHERE c_real IS NOT NULL SELECT * FROM stats_sample WHERE c_smalldatetime IS NOT NULL SELECT * FROM stats_sample WHERE c_smallint IS NOT NULL SELECT * FROM stats_sample WHERE c_smallmoney IS NOT NULL SELECT * FROM stats_sample WHERE c_sql_variant IS NOT NULL SELECT * FROM stats_sample WHERE c_sysname IS NOT NULL SELECT * FROM stats_sample WHERE c_text IS NOT NULL SELECT * FROM stats_sample WHERE c_time IS NOT NULL SELECT * FROM stats_sample WHERE c_timestamp IS NOT NULL SELECT * FROM stats_sample WHERE c_tinyint IS NOT NULL SELECT * FROM stats_sample WHERE c_uniqueidentifier IS NOT NULL SELECT * FROM stats_sample WHERE c_varbinary IS NOT NULL SELECT * FROM stats_sample WHERE c_varchar IS NOT NULL SELECT * FROM stats_sample WHERE c_xml IS NOT NULL GO
Cool, that should be enough. Let’s see whether we have statistics in our table or not.
SELECT OBJECT_NAME(stc.object_id) AS object_name, st.auto_created, st.name AS stats_name, c.name, t.name FROM sys.stats_columns AS stc INNER JOIN sys.stats AS st ON st.object_id = stc.object_id AND st.stats_id = stc.stats_id INNER JOIN sys.columns AS c ON c.object_id = stc.object_id AND c.column_id = stc.column_id INNER JOIN sys.types AS t ON t.user_type_id = c.user_type_id WHERE stc.object_id = OBJECT_ID('dbo.stats_sample')
Yeah, there you go, all these _WA_Sys_ stats tell me they have been automatically created (there is a flag in sys.stats if you don’t believe me) but I can see there are only 31, where I created 34 columns.
That’s funny, let’s see which data types did get statistics.
SELECT * FROM sys.types AS t LEFT JOIN (SELECT c.user_type_id FROM sys.stats_columns AS stc INNER JOIN sys.columns AS c ON c.object_id = stc.object_id AND c.column_id = stc.column_id WHERE stc.object_id = OBJECT_ID('dbo.stats_sample')) AS st ON st.user_type_id = t.user_type_id WHERE st.user_type_id IS NULL
There they are. XML and CLR data types (geometry and geography are CLR) do not have statistics, and that means you cannot manually create them either.
Just for the shake of it.
CREATE STATISTICS st_stats_sample_geometry ON dbo.stats_sample (c_geometry) GO CREATE STATISTICS st_stats_sample_geography ON dbo.stats_sample (c_geography) GO CREATE STATISTICS st_stats_sample_xml ON dbo.stats_sample (c_xml) GO
So apart from these special data types, the rest just created the stats without any question, for good and bad.
The good part
Pretty obvious, statistics are very important for the query optimizer so the fact of getting them off the shelf (it’s the default behavior) it’s a big win.
The bad part
Sometimes, statistics are not as helpful as we’d desire, and if we mix it with some other bad practices, like not using full-text indexes for [N]VARCHAR(MAX) and querying like any other column, which will trigger auto-create stats, the result can be a disaster.
What to do?
Is it time to panic and run in circles? Not really.
Check your maintenance and your stats in general, if you are in SQL 2008R2 SP2, SQL 2012 SP1 or higher, you can use the DMV dm_db_stats_properties, which is great, but for earlier versions you might have to script something yourself using DBCC SHOW_STATISTICS
Some query to start with might look like
SELECT OBJECT_NAME(st.object_id) AS [object_name] , st.name AS stats_name , st.object_id , st.stats_id , stp.last_updated , stp.rows , stp.rows_sampled , stp.modification_counter -- , * FROM sys.stats AS st CROSS APPLY sys.dm_db_stats_properties(st.object_id, st.stats_id) AS stp WHERE OBJECTPROPERTYEX(st.object_id, 'IsMSShipped') = 0
If you spot something that makes you suspicious, there still some things we can do about it. This post is getting kind of long, but whatever, let’s dive into some light theoretical talk.
Auto Update Stats
Probably you use either your own scripts or Ola Hallengren’s ones to maintain both indexes and stats, but if you don’t, SQL Server still wants you to have stats up to date, so it will automatically trigger update statistics when they are out dated.
So, when stats are out dated? Long story short, more or less when there are 20% ratio between rows and the modification counter from dm_db_stats_properties, and this is important because BOL states
- SQL Server (2014 and earlier) uses a threshold based on the percent of rows changed. This is regardless of the number of rows in the table.
- SQL Server (starting with 2016 and under the compatibility level 130) uses a threshold that adjusts according to the number of rows in the table. With this change, statistics on large tables will be updated more often.
See how it speaks about the percent of rows changed. And that is misleading, for me at least, since each modification on that column counts, so if you modify the very same row again and again the counter will keep increasing, and while the rest of the table is totally unchanged, auto update stats (or even your scripts) will believe it’s time to update.
And again back to BOL, that will happen before the query we want to run is executed, unless AUTO_UPDATE_STATISTICS_ASYNC is ON, and the update is delayed to a better time.
Now go figure if my 10 hours update stats was triggered because someone run a query instead during a maintenance window… ugly stuff there.
You still can do some tricks like updating stats manually with no rows and NORECOMPUTE and that will kill your stats until you revert it, but please be cautious and don’t jump to do it before evaluating all the consequences.
Conclusion
Like most of the times, there is no one solution to fit all situations, so the better understanding we have about internals, the better choice we will be able to make and forecast the caveats of that choice.
Some useful links
- UPDATE STATISTICS
- Statistics 2008 Whitepaper
- Understanding When Statistics Will Automatically Update
Thanks for reading and if you have any question, please ask in the comments below.