{"id":323,"date":"2016-09-30T08:24:02","date_gmt":"2016-09-30T07:24:02","guid":{"rendered":"https:\/\/sqldoubleg.live-website.com\/?p=323"},"modified":"2017-02-11T17:21:40","modified_gmt":"2017-02-11T17:21:40","slug":"multi-column-statistics-are-back-in-sql-server-2016","status":"publish","type":"post","link":"https:\/\/www.sqldoubleg.com\/es\/2016\/09\/30\/multi-column-statistics-are-back-in-sql-server-2016\/","title":{"rendered":"Multi-column statistics are back in SQL Server 2016"},"content":{"rendered":"<p>The new cardinality estimator in SQL Server 2014 brought new and improved query performance but as always there are caveats. Let&#8217;s check a specific case to see the different behaviors for the different versions of the CE&nbsp;Don&#8217;t get me wrong, they&#8217;ve always been there, but somehow someone forgot about them and weren&#8217;t used.<\/p>\n<p>Since last week my mind has been on other things, I&#8217;m intending to write a series of articles and didn&#8217;t have much of my brain to get me write a post for this week, so finally I decided to speak about some issues which have been already described, but I believe they are good to be reminded.<\/p>\n<p>&nbsp;<br \/>\n<b>What am I talking about? yeah, statistics<\/b><\/p>\n<p>So statistics are the base for the cardinality estimator (CE from now on) to figure out how many rows will be returned by an operator during the execution of a query, I described some behaviors already in previous <a href=\"https:\/\/sqldoubleg.live-website.com\/2016\/03\/03\/fast-with-values-slow-with-variables\/\" target=\"_blank\">posts<\/a>, so this time I&#8217;ll go a bit more to the specifics.<\/p>\n<p>So far and for recent versions of SQL Server, we have 2 (I&#8217;d say 3) different versions of the CE. It was one of the big announcements for SQL Server 2014 and that left the picture like this.<\/p>\n<ul>\n<li>SQL Server 2012 and earlier version till 2000, CE version 70<\/li>\n<li>SQL Server 2014, <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dn600374(v=sql.120).aspx\" target=\"_blank\">new CE version 120<\/a><\/li>\n<\/ul>\n<p>&nbsp;<br \/>\nThese numbers match with the compatibility level of the SQL Server versions they were released, coincidence? I don&#8217;t think so \ud83d\ude42<\/p>\n<p>So in SQL Server 2016 we have been given an improved version of the CE, which for the examples of this post, I think would be version 130.<\/p>\n<p>I&#8217;m sure there are many small differences but I just want to talk about how the estimated number of rows differ from one version to another when we query a single table using two predicates and we have multi-column statistics that cover the combination.<\/p>\n<p>In order to provide the 3 examples, I run my SQL Server 2016 Dev Edition instance with the sample database AdventureWorks2014.<\/p>\n<p>Let&#8217;s get a fresh copy of the database and create the statistics<\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nUSE master\r\nGO\r\nIF DB_ID('AdventureWorks2014') IS NOT NULL BEGIN\r\n\tALTER DATABASE AdventureWorks2014 SET SINGLE_USER WITH ROLLBACK IMMEDIATE\r\n\tDROP DATABASE AdventureWorks2014\r\nEND\r\n\r\nRESTORE DATABASE AdventureWorks2014\r\nFROM DISK = '.\\AdventureWorks2014.bak'\r\nWITH RECOVERY\r\n, MOVE 'AdventureWorks2014_Data' TO 'C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQL2016\\MSSQL\\DATA\\AdventureWorks2014_Data.mdf'\r\n, MOVE 'AdventureWorks2014_Log' TO 'C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQL2016\\MSSQL\\DATA\\AdventureWorks2014_Log.ldf'\r\nGO\r\n\r\nUSE AdventureWorks2014\r\n\r\nCREATE STATISTICS Person_Address_PostalCode_City ON Person.Address (PostalCode, City) WITH FULLSCAN\r\n\r\n<\/pre>\n<p>Once it&#8217;s done, we can test the different CE<\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nALTER DATABASE AdventureWorks2014 SET COMPATIBILITY_LEVEL = 110 WITH ROLLBACK IMMEDIATE\r\nGO\r\nSELECT * \r\nFROM Person.Address\r\nWHERE City = N'Wokingham'\r\nAND PostalCode = N'RG41 1QW'\r\nOPTION (RECOMPILE)\r\nGO\r\n-- \r\n<\/pre>\n<p><a href=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/01_estimate_CE_70.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/01_estimate_CE_70.png\" alt=\"01_estimate_ce_70\" width=\"800\" height=\"451\" class=\"aligncenter size-full wp-image-324\" srcset=\"https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/01_estimate_CE_70.png 800w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/01_estimate_CE_70-300x169.png 300w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/01_estimate_CE_70-150x85.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>This estimated number comes from the statistics we have just created, if we have a look at them, I&#8217;ll show you the maths.<\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nDBCC SHOW_STATISTICS ('Person.Address', 'Person_Address_PostalCode_City') \r\n<\/pre>\n<p><a href=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/02_DBCC_SHOW_STATISTICS_multi_column.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/02_DBCC_SHOW_STATISTICS_multi_column.png\" alt=\"02_dbcc_show_statistics_multi_column\" width=\"800\" height=\"215\" class=\"aligncenter size-full wp-image-325\" \/><\/a><\/p>\n<p>By multiplying the number of rows by the all density for the combination of columns, we get the number shown above <b>19614 * 0.001485884 = 29.144128776<\/b>.<\/p>\n<p>The actual rows is 31, so in this case the estimated are not far off the actual.<\/p>\n<p>This is with the called &#8216;legacy CE&#8217; so let&#8217;t see how is with the first review of the new CE<\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nALTER DATABASE AdventureWorks2014 SET COMPATIBILITY_LEVEL = 120 WITH ROLLBACK IMMEDIATE\r\nGO\r\nSELECT * \r\nFROM Person.Address\r\nWHERE City = N'Wokingham'\r\nAND PostalCode = N'RG41 1QW'\r\nOPTION (RECOMPILE)\r\nGO\r\n<\/pre>\n<p><a href=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/03_estimate_CE_120.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/03_estimate_CE_120.png\" alt=\"03_estimate_ce_120\" width=\"800\" height=\"450\" class=\"aligncenter size-full wp-image-327\" srcset=\"https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/03_estimate_CE_120.png 800w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/03_estimate_CE_120-300x169.png 300w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/03_estimate_CE_120-150x84.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>See how this is way off, all this marketing about the benefits of the new CE and this&#8230; Disappointing!<\/p>\n<p>This version of the CE does not care about the existence of multi-column stats and just does the maths looking at single column stats.<\/p>\n<p>If we look at the single column stats, we can work the numbers out<\/p>\n<p><strong><em>**Thanks to Fraser Watson for pointing this out<\/em><\/strong><\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nDBCC SHOW_STATISTICS ('Person.Address', 'City') WITH HISTOGRAM\r\nDBCC SHOW_STATISTICS ('Person.Address', 'Person_Address_PostalCode_City') WITH HISTOGRAM\r\n<\/pre>\n<p><a href=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/04_DBCC_SHOW_STATISTICS_single_column_Historgram.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/04_DBCC_SHOW_STATISTICS_single_column_Historgram.png\" alt=\"04_dbcc_show_statistics_single_column_historgram\" width=\"934\" height=\"347\" class=\"aligncenter size-full wp-image-353\" srcset=\"https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/04_DBCC_SHOW_STATISTICS_single_column_Historgram.png 934w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/04_DBCC_SHOW_STATISTICS_single_column_Historgram-300x111.png 300w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/04_DBCC_SHOW_STATISTICS_single_column_Historgram-150x56.png 150w\" sizes=\"(max-width: 934px) 100vw, 934px\" \/><\/a><\/p>\n<p>The new CE assumes some correlation between predicates within a single table, using the following formula<\/p>\n<blockquote><p>\nmost selective * SQRT(2nd most selective) * number of rows\n<\/p><\/blockquote>\n<p>In our case the numbers are <\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">SELECT  (CONVERT(FLOAT, 31) \/ 19614) * SQRT((CONVERT(FLOAT, 31) \/ 19614)) * 19614\r\n-- 1.23242203675524\r\n<\/pre>\n<p>which makes the estimated value. <\/p>\n<p>Forgetting about multi-column is bad in this case, probably it works better for some cases, but let&#8217;s see the facts <\/p>\n<ul>\n<li>&#8216;Wokingham&#8217; has an estimate of 31 in the histogram of the column stats [City]<\/li>\n<li>&#8216;RG41 1QW&#8217; has an estimate of 31 in the histogram of the multi column stats [Person_Address_PostalCode_City]<\/li>\n<\/ul>\n<p>&nbsp;<br \/>\nThe average is 29 and the current is 31 just because both are directly correlated, but what if we were requesting a post code from a different city? The result would be zero, so 1,24 would be way better&#8230; Difficult to estimate correctly every time.<\/p>\n<p>And finally, how things are in SQL Server 2016?<\/p>\n<pre class=\"brush: tsql; title: ; notranslate\" title=\"\">\r\nALTER DATABASE AdventureWorks2014 SET COMPATIBILITY_LEVEL = 130 WITH ROLLBACK IMMEDIATE\r\nGO\r\nSELECT * \r\nFROM Person.Address\r\nWHERE City = N'Wokingham'\r\nAND PostalCode = N'RG41 1QW'\r\nOPTION (RECOMPILE)\r\nGO\r\n<\/pre>\n<p><a href=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/05_estimate_CE_130.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqldoubleg.live-website.com\/wp-content\/uploads\/2016\/09\/05_estimate_CE_130.png\" alt=\"05_estimate_ce_130\" width=\"800\" height=\"475\" class=\"aligncenter size-full wp-image-329\" srcset=\"https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/05_estimate_CE_130.png 800w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/05_estimate_CE_130-300x178.png 300w, https:\/\/www.sqldoubleg.com\/wp-content\/uploads\/2016\/09\/05_estimate_CE_130-150x89.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/a><\/p>\n<p>Again we have the 29.1, so the new new CE is looking at multi column stats again.<\/p>\n<p>&nbsp;<br \/>\n<b>Conclusion<\/b><\/p>\n<p>The thing we can learn from this is that is impossible to be always right when you have to estimate the number of rows if your only resource is statistics, doesn&#8217;t matter single or multi-column, there is a set of values out there ready to defeat your logic.<\/p>\n<p>However I think it&#8217;s a good idea that SQL Server 2016 gets back to look into multi-column for a simple reason, these are user created stats and therefore gives us (DBA&#8217;s, DEV&#8217;s) more power over how rows are estimated.<\/p>\n<p>And <em><a href=\"https:\/\/en.wikiquote.org\/wiki\/Stan_Lee\" target=\"_blank\">With great power there must also come great responsibility<\/a><\/em>, if we create multi-column stats randomly, we might be hurting performance for a lot of queries that would benefit from some correlation, but imagine we have for example dynamically loaded select lists which depend on the previous, this can be a great enhancement.<\/p>\n<p>As always there&#8217;s no single truth. It will vary for each and every case, so test, test and test before opting for one or another.<\/p>\n<p>Just to wrap up, a few links for reference.<\/p>\n<ul>\n<li><a href=\"https:\/\/technet.microsoft.com\/en-us\/library\/dd535534%28v=sql.100%29.aspx?f=255&#038;MSPPError=-2147217396\" target=\"_blank\">Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 <\/a>(White paper)<\/li>\n<li><a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dn600374(v=sql.120).aspx\" target=\"_blank\">Cardinality Estimator (SQL Server 2014)<\/a><\/li>\n<li><a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dn600374(v=sql.130).aspx\" target=\"_blank\">Cardinality Estimator (SQL Server 2016)<\/a><\/li>\n<li><a href=\"http:\/\/www.sqlskills.com\/blogs\/kimberly\/multi-column-statistics-exponential-backoff\/\" target=\"_blank\">Multi-column statistics and exponential backoff<\/a><\/li>\n<\/ul>\n<p>&nbsp;<br \/>\nThanks for reading!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The new cardinality estimator in SQL Server 2014 brought new and improved query performance but as always there are caveats. Let&#8217;s check a specific case to see the different behaviors for the&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,21],"tags":[32,48,23,26],"_links":{"self":[{"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/posts\/323"}],"collection":[{"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/comments?post=323"}],"version-history":[{"count":0,"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/posts\/323\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/media?parent=323"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/categories?post=323"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sqldoubleg.com\/es\/wp-json\/wp\/v2\/tags?post=323"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}