Feb
9
2017

Statistics and Data types, What can go wrong?

One of the most interesting things in SQL Server from my point of view are Statistics and how the optimizer interprets them to produce optimal query plans. But today I want to write about something just happened to me and made think. This refers to the very first stage of the process, the creation.

If we haven’t changed anything in our databases (or model), auto create statistics should be on by default which means SQL Server will create automatically statistics whenever it thinks that information would be beneficial to generate a query plan.

This sounds like a treat, right? Sure, it is nice not to have to think which statistics to create and so on, but like everything, it can have a dark side.

 
Background

Maybe I’m going to spoil the whole story but this is how things happened.

I was reviewing the logs for the last database maintenance which includes, of course, updating statistics. Then I saw there was one specific that took more than 10 hours to complete, which made me suspicious. So I tried to find out why.

Basically everything was correct, and the reasons why it took so long were

  • The table has 41 million rows
  • It updated them using FULLSCAN

 
I didn’t specify FULLSCAN, but using RESAMPLE will use the last sample, if that was a FULLSCAN… there you go.

OK, so we have this statistics named _WA_xxxxxxx, hence auto created, on a column of my table. And this column was (drum roll) NVARCHAR(MAX) (badum ts)

My table contains xml documents stored in NVARCHAR(MAX) data type, so my statistics looked like this

01_dbcc_show_statistics

See how [Rows Sampled] is the same as [Rows], so FULLSCAN was performed. And the number is not small at all.

But the worst part is the one supposed to be most helpful and where the query optimizer gathers insights of the data to produce nice query plans, the Histogram. In this case doesn’t look very useful, right?

Also you need to know that statistics for [N]VARCHAR() columns have a maximum of 200 characters in the RANGE_HI_KEY, so for a MAX data type like this is absolutely pointless.

 
Auto Create statistics and Data types

After seeing this, I got a bit concerned and I wondered if all other data types will create statistics automatically. So I decided to create a table with all data types to check it out!

USE master
GO
CREATE DATABASE stats_sample
GO

USE stats_sample
GO
CREATE TABLE stats_sample (c_bigint bigint NULL  DEFAULT CONVERT(bigint, '1')
, c_binary binary NULL  DEFAULT CONVERT(binary, '1')
, c_bit bit NULL  DEFAULT CONVERT(bit, '1')
, c_char char NULL  DEFAULT CONVERT(char, '0')
, c_date date NULL  DEFAULT GETDATE()
, c_datetime datetime NULL  DEFAULT GETDATE()
, c_datetime2 datetime2 NULL  DEFAULT GETDATE()
, c_datetimeoffset datetimeoffset NULL  DEFAULT GETDATE()
, c_decimal decimal NULL  DEFAULT CONVERT(decimal, '1')
, c_float float NULL  DEFAULT CONVERT(float, '1')
, c_geography geography NULL  DEFAULT 'CURVEPOLYGON(CIRCULARSTRING(-1 1, -1 1, -1 1, -1 1, -1 1))'
, c_geometry geometry NULL  DEFAULT geometry::Parse('POINT(1 1 NULL NULL)')
, c_hierarchyid hierarchyid NULL  DEFAULT '/'
, c_image image NULL  DEFAULT CONVERT(image, '1')
, c_int int NULL  DEFAULT CONVERT(int, '1')
, c_money money NULL  DEFAULT CONVERT(money, '1')
, c_nchar nchar NULL  DEFAULT CONVERT(nchar, '0')
, c_ntext ntext NULL  DEFAULT CONVERT(ntext, '0')
, c_numeric numeric NULL  DEFAULT CONVERT(numeric, '1')
, c_nvarchar nvarchar NULL  DEFAULT CONVERT(nvarchar, '0')
, c_real real NULL  DEFAULT CONVERT(real, '1')
, c_smalldatetime smalldatetime NULL  DEFAULT GETDATE()
, c_smallint smallint NULL  DEFAULT CONVERT(smallint, '1')
, c_smallmoney smallmoney NULL  DEFAULT CONVERT(smallmoney, '1')
, c_sql_variant sql_variant NULL  DEFAULT CONVERT(sql_variant, '1')
, c_sysname sysname NULL  DEFAULT CONVERT(sysname, '0')
, c_text text NULL  DEFAULT CONVERT(text, '0')
, c_time time NULL  DEFAULT GETDATE()
, c_timestamp timestamp NULL 
, c_tinyint tinyint NULL  DEFAULT CONVERT(tinyint, '1')
, c_uniqueidentifier uniqueidentifier NULL  DEFAULT NEWID()
, c_varbinary varbinary NULL  DEFAULT CONVERT(varbinary, '1')
, c_varchar varchar NULL  DEFAULT CONVERT(varchar, '0')
, c_xml xml NULL  DEFAULT CONVERT(xml, '1'))
GO

INSERT INTO dbo.stats_sample DEFAULT VALUES
GO 1000

Note that I have to create a dummy default value to be able to generate some stats.

Once my table is in place and loaded, auto create stats will be triggered if the query optimizer thinks it’d be nice to have them in order to get estimates and generate a better query plan, so there we go

SELECT * FROM stats_sample WHERE c_bigint IS NOT NULL
SELECT * FROM stats_sample WHERE c_binary IS NOT NULL
SELECT * FROM stats_sample WHERE c_bit IS NOT NULL
SELECT * FROM stats_sample WHERE c_char IS NOT NULL
SELECT * FROM stats_sample WHERE c_date IS NOT NULL
SELECT * FROM stats_sample WHERE c_datetime IS NOT NULL
SELECT * FROM stats_sample WHERE c_datetime2 IS NOT NULL
SELECT * FROM stats_sample WHERE c_datetimeoffset IS NOT NULL
SELECT * FROM stats_sample WHERE c_decimal IS NOT NULL
SELECT * FROM stats_sample WHERE c_float IS NOT NULL
SELECT * FROM stats_sample WHERE c_geography IS NOT NULL
SELECT * FROM stats_sample WHERE c_geometry IS NOT NULL
SELECT * FROM stats_sample WHERE c_hierarchyid IS NOT NULL
SELECT * FROM stats_sample WHERE c_image IS NOT NULL
SELECT * FROM stats_sample WHERE c_int IS NOT NULL
SELECT * FROM stats_sample WHERE c_money IS NOT NULL
SELECT * FROM stats_sample WHERE c_nchar IS NOT NULL
SELECT * FROM stats_sample WHERE c_ntext IS NOT NULL
SELECT * FROM stats_sample WHERE c_numeric IS NOT NULL
SELECT * FROM stats_sample WHERE c_nvarchar IS NOT NULL
SELECT * FROM stats_sample WHERE c_real IS NOT NULL
SELECT * FROM stats_sample WHERE c_smalldatetime IS NOT NULL
SELECT * FROM stats_sample WHERE c_smallint IS NOT NULL
SELECT * FROM stats_sample WHERE c_smallmoney IS NOT NULL
SELECT * FROM stats_sample WHERE c_sql_variant IS NOT NULL
SELECT * FROM stats_sample WHERE c_sysname IS NOT NULL
SELECT * FROM stats_sample WHERE c_text IS NOT NULL
SELECT * FROM stats_sample WHERE c_time IS NOT NULL
SELECT * FROM stats_sample WHERE c_timestamp IS NOT NULL
SELECT * FROM stats_sample WHERE c_tinyint IS NOT NULL
SELECT * FROM stats_sample WHERE c_uniqueidentifier IS NOT NULL
SELECT * FROM stats_sample WHERE c_varbinary IS NOT NULL
SELECT * FROM stats_sample WHERE c_varchar IS NOT NULL
SELECT * FROM stats_sample WHERE c_xml IS NOT NULL
GO

Cool, that should be enough. Let’s see whether we have statistics in our table or not.

SELECT OBJECT_NAME(stc.object_id) AS object_name, st.auto_created, st.name AS stats_name, c.name, t.name
    FROM sys.stats_columns AS stc
        INNER JOIN sys.stats AS st
            ON st.object_id = stc.object_id
                AND st.stats_id = stc.stats_id
        INNER JOIN sys.columns AS c
            ON c.object_id = stc.object_id
                AND c.column_id = stc.column_id 
        INNER JOIN sys.types AS t
            ON t.user_type_id = c.user_type_id  
    WHERE stc.object_id = OBJECT_ID('dbo.stats_sample')

02_auto_created_stats

Yeah, there you go, all these _WA_Sys_ stats tell me they have been automatically created (there is a flag in sys.stats if you don’t believe me) but I can see there are only 31, where I created 34 columns.

That’s funny, let’s see which data types did get statistics.

SELECT * 
	FROM sys.types AS t
		LEFT JOIN (SELECT c.user_type_id
						FROM sys.stats_columns AS stc
							INNER JOIN sys.columns AS c
								ON c.object_id = stc.object_id
									AND c.column_id = stc.column_id 
						WHERE stc.object_id = OBJECT_ID('dbo.stats_sample')) AS st
			ON st.user_type_id = t.user_type_id
	WHERE st.user_type_id IS NULL

03_data_types_without_statistics

 
There they are. XML and CLR data types (geometry and geography are CLR) do not have statistics, and that means you cannot manually create them either.

Just for the shake of it.

CREATE STATISTICS st_stats_sample_geometry		ON dbo.stats_sample (c_geometry)
GO
CREATE STATISTICS st_stats_sample_geography		ON dbo.stats_sample (c_geography)
GO
CREATE STATISTICS st_stats_sample_xml			ON dbo.stats_sample (c_xml)
GO

04_create_stats_error

 
So apart from these special data types, the rest just created the stats without any question, for good and bad.

 
The good part

Pretty obvious, statistics are very important for the query optimizer so the fact of getting them off the shelf (it’s the default behavior) it’s a big win.

 
The bad part

Sometimes, statistics are not as helpful as we’d desire, and if we mix it with some other bad practices, like not using full-text indexes for [N]VARCHAR(MAX) and querying like any other column, which will trigger auto-create stats, the result can be a disaster.

 
What to do?

Is it time to panic and run in circles? Not really.

Check your maintenance and your stats in general, if you are in SQL 2008R2 SP2, SQL 2012 SP1 or higher, you can use the DMV dm_db_stats_properties, which is great, but for earlier versions you might have to script something yourself using DBCC SHOW_STATISTICS

Some query to start with might look like

SELECT OBJECT_NAME(st.object_id) AS [object_name]
		, st.name AS stats_name
		, st.object_id
		, st.stats_id
		, stp.last_updated
		, stp.rows
		, stp.rows_sampled
		, stp.modification_counter
		-- , *
	FROM sys.stats AS st
		CROSS APPLY sys.dm_db_stats_properties(st.object_id, st.stats_id) AS stp
	WHERE OBJECTPROPERTYEX(st.object_id, 'IsMSShipped') = 0

If you spot something that makes you suspicious, there still some things we can do about it. This post is getting kind of long, but whatever, let’s dive into some light theoretical talk.

 
Auto Update Stats

Probably you use either your own scripts or Ola Hallengren’s ones to maintain both indexes and stats, but if you don’t, SQL Server still wants you to have stats up to date, so it will automatically trigger update statistics when they are out dated.

So, when stats are out dated? Long story short, more or less when there are 20% ratio between rows and the modification counter from dm_db_stats_properties, and this is important because BOL states

  • SQL Server (2014 and earlier) uses a threshold based on the percent of rows changed. This is regardless of the number of rows in the table.
  • SQL Server (starting with 2016 and under the compatibility level 130) uses a threshold that adjusts according to the number of rows in the table. With this change, statistics on large tables will be updated more often.

 
See how it speaks about the percent of rows changed. And that is misleading, for me at least, since each modification on that column counts, so if you modify the very same row again and again the counter will keep increasing, and while the rest of the table is totally unchanged, auto update stats (or even your scripts) will believe it’s time to update.

And again back to BOL, that will happen before the query we want to run is executed, unless AUTO_UPDATE_STATISTICS_ASYNC is ON, and the update is delayed to a better time.

Now go figure if my 10 hours update stats was triggered because someone run a query instead during a maintenance window… ugly stuff there.

You still can do some tricks like updating stats manually with no rows and NORECOMPUTE and that will kill your stats until you revert it, but please be cautious and don’t jump to do it before evaluating all the consequences.

 
Conclusion

Like most of the times, there is no one solution to fit all situations, so the better understanding we have about internals, the better choice we will be able to make and forecast the caveats of that choice.

Some useful links

 
Thanks for reading and if you have any question, please ask in the comments below.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.