oct
17
2015

Divide and Conquer Date Search Hack

Divide and Conquer Date Search Hack (because binary search sounds boring)

This week has been an amazing experience, not only I have attended to one of the most valuable SQL Server training you can find (IEPTO2 by @SqlSkills team), I’ve also met really nice people and brains, all of them with something in common, a big passion for technology and specially SQL Server.

Speaking of crazy ideas and ‘hacks’ we’ve done through the years to make things to work, one of them got my attention for its simplicity, the idea of hacking a Date search by hitting a numeric key.

The idea of binary search has been there for a while, but taking advantage of the algorithm and applying it to SQL Server how it’s described in a blog post by Rainer Unwin (See original post) was so great that I couldn’t stop myself for writing my own version, so I got to it.

The approach is such that having a numeric clustered key, with an ever increasing pattern (Identity fulfils that perfectly), by doing individual seeks for particular keys and checking the date values for those rows, you can narrow down the range to end up pointing the desired date, understanding that the date we’re looking for, should be also ever increasing as the clustered key is.

Let’s say this is the whole range of possible values for the clustered key within our table

divide_01

If the value we’re looking for is neither of those, we divide the range and ask in which side our value can be, so we can eliminate half of the values at once.

divide_01

Doing consecutive divisions of the range, we will narrow down the subset to the value we are looking for (if exists) and then we can return the key for that value

divide_01

This can be extremely useful if you have a medium or big size table without a nonclustered index on the date column you want to search on. The performance of individual seeks on the clustered primary key comparing to a clustered index scan can go from several minutes to less than a second.

An that scales like a treat, just for a second imagine the biggest number you can… Not that one… BIGGER!!!

Take it to the extreme… the limit of BIGINT in SQL Server is 9,223,372,036,854,775,807… 9 quintillions.

To be able to insert that amount of records in a table, you would need to insert 1 million rows per second during 292471 YEARS!!!!

And even though, the loop would only need 63 cycles… yes sixty three cycles.

DECLARE @n		BIGINT = 9223372036854775807
DECLARE @count	INT = 0

WHILE @n / 2. > 1 BEGIN
	SET @n = @n / 2.
	SET @cycles += 1
END

SELECT @cycles

Or also can be calculated as follows

SELECT FLOOR(LOG(9223372036854775807)/LOG(2))

First I’ll show you a specific example for a table and then I’ll provide you with a stored procedure which will take some parameters and dynamically will look for absolutely any date within any table on any database that might exist in your server

USE AdventureWorks2014

DECLARE @maxID			INT 
		, @minID		INT
		, @cutoff		INT
		, @dateToFind	DATETIME = '2011-10-14 00:00:00.000'

SELECT @minID = MIN(SalesOrderID), @maxID = MAX(SalesOrderID) FROM [Sales].[SalesOrderHeader]

WHILE ISNULL((SELECT 1 FROM [Sales].[SalesOrderHeader] WHERE SalesOrderID IN (@minID, @maxID) AND OrderDate = @dateToFind),0) <> 1 BEGIN

	-- To avoid infinite loop if not found
	IF @maxID = (SELECT MIN(SalesOrderID) FROM [Sales].[SalesOrderHeader] WHERE SalesOrderID > @minID)  BREAK
	
	SET @cutoff = @maxID - (@maxID - @minID) / 2

	IF @dateToFind < (SELECT OrderDate FROM [Sales].[SalesOrderHeader] WHERE SalesOrderID = @cutoff) 
		SET @maxID = @cutoff
	ELSE
		SET @minID = @cutoff

END

SELECT * FROM [Sales].[SalesOrderHeader] WHERE SalesOrderID IN (@minID, @maxID) AND (OrderDate = @dateToFind)

In this example we have 31465 rows and the loop does 14 cycles to find a matching row, which can become 14*3 Selects, (one of them to 2 key values so 4), 56 index seeks (plus the bookmark lookups when we check for the date, don’t take this as true as it’s a pretty rough calculation). Still nothing if we compare it to a Clustered index scan.

Please see the original post for more info about IO

And you’ve seen than the bigger the number of rows, the better performance you get out of it, just imagine that table in the billions of rows.

The stored procedure would be something as follows

USE master
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.sqlg_Divide_And_Conquer') IS NULL EXECUTE sp_executesql N'CREATE PROCEDURE dbo.sqlg_Divide_And_Conquer AS RETURN'
GO
-- =============================================
-- Author:		Raul Gonzalez (@SQLDoubleG)
-- Create date: 15/10/2015
-- Description:	Returns the row(s) within the given table which match the date searching criteria
--
-- Parameters:
--				@DatabaseName	, Optional, if not specified will search on the current database	
--				@SchemaName		, Optional
--				@TableName		, Table where we want to search
--				@KeyColumnName	, Name of the Clustered key column
--				@DateColumnName	, Name of the Date(/time) column 
--				@DateToFind		, Date(/time) we are looking for
--				@Aprox			, Will return the closest values in case no there is no exact matching
--
-- Assumptions:	
--				The Clustered Key column provided is a number ever increasing.
--					In case of Clustered composite keys, use the leftmost key column if that follows the desired pattern.
--				The Date to find follows the same pattern, is ever increasing, hence correlated to the Clustered Key value
--
-- Log History:	
--				15/10/2015	RAG	Created
--
-- Copyright:	(C) 2015 Raul Gonzalez (@SQLDoubleG http://www.sqldoubleg.com)
--
--				THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
--				ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
--				TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
--				PARTICULAR PURPOSE.
--
--				THE AUTHOR SHALL NOT BE LIABLE TO YOU OR ANY THIRD PARTY FOR ANY INDIRECT, 
--				SPECIAL, INCIDENTAL, PUNITIVE, COVER, OR CONSEQUENTIAL DAMAGES OF ANY KIND
--
--				YOU MAY ALTER THIS CODE FOR YOUR OWN *NON-COMMERCIAL* PURPOSES. YOU MAY
--				REPUBLISH ALTERED CODE AS LONG AS YOU INCLUDE THIS COPYRIGHT AND GIVE DUE CREDIT. 
--
-- =============================================
ALTER PROCEDURE dbo.sqlg_Divide_And_Conquer 
		@DatabaseName		SYSNAME = NULL
		, @SchemaName		SYSNAME = NULL
		, @TableName		SYSNAME 
		, @KeyColumnName	SYSNAME 
		, @DateColumnName	SYSNAME 
		, @DateToFind		DATETIME
		, @Aprox			BIT = 0
AS 
BEGIN

SET NOCOUNT ON

DECLARE @sql NVARCHAR(4000)
DECLARE @FullObjectName NVARCHAR(386)

-- sanitize the inputs
SET @DatabaseName		= QUOTENAME(REPLACE(REPLACE(@DatabaseName	, '[', ''), ']', ''))
SET @SchemaName			= QUOTENAME(REPLACE(REPLACE(@SchemaName		, '[', ''), ']', ''))
SET @TableName			= QUOTENAME(REPLACE(REPLACE(@TableName		, '[', ''), ']', ''))
SET @KeyColumnName		= QUOTENAME(REPLACE(REPLACE(@KeyColumnName	, '[', ''), ']', ''))
SET @DateColumnName		= QUOTENAME(REPLACE(REPLACE(@DateColumnName	, '[', ''), ']', ''))

SET @FullObjectName		= ISNULL(@DatabaseName + '.', '') + ISNULL(@SchemaName + '.', '') + @TableName

SET @sql = '

DECLARE  @max			INT 
		, @min			INT
		, @cutoff		INT

	SELECT @min=MIN(' + @KeyColumnName +') , @max=MAX(' + @KeyColumnName +') FROM ' + @FullObjectName + '

	WHILE ISNULL((SELECT TOP 1 1 FROM ' + @FullObjectName + ' WHERE ' + @KeyColumnName +' IN (@min, @max) AND ' + @DateColumnName + ' = @DateToFind),0) <> 1 BEGIN

		-- To avoid infinite loop if not found
		IF @max = (SELECT MIN(' + @KeyColumnName +') FROM ' + @FullObjectName + ' WHERE ' + @KeyColumnName +' > @min)  BREAK
	
		--PRINT ''min => ''  + CONVERT(VARCHAR,@min) + '', max => ''+ CONVERT(VARCHAR,@max)

		SET @cutoff = @max - (@max - @min) / 2

		IF @DateToFind < (SELECT MAX(' + @DateColumnName + ') FROM ' + @FullObjectName + ' WHERE ' + @KeyColumnName +' = @cutoff) 
			SET @max = @cutoff
		ELSE
			SET @min = @cutoff

	END

	SELECT * FROM ' + @FullObjectName + ' WHERE ' + @KeyColumnName +' IN (@min, @max) AND (' + @DateColumnName + ' = @DateToFind OR @Aprox=1)
'
--PRINT @sql

EXECUTE sp_executesql @sql, @params = N'@DateToFind DATETIME, @Aprox BIT', @DateToFind = @DateToFind, @Aprox = @Aprox

END
GO

 
And you can test it against the [AdventureWorks2014] database
 

EXECUTE dbo.sqlg_Divide_And_Conquer 'AdventureWorks2014', '[Sales]', '[SalesOrderHeader]', '[SalesOrderID]', '[OrderDate]', '2013-05-30 00:00:00.000', 0

 
The SP needs a bit more of code and doesn’t look at neat and tidy as the script, but it does what is supposed to do, so for me this is “Mission Accomplished”

Hope you guys like it and throw me any question you might have (about the script, not in general 🙂 )

5 comments
  1. Dom dice:

    Great article mate! Keep it rolling Sql hero!

  2. Darek dice:

    Hi. Code that’s written without paying attention to the length of the line is hard to read. Forcing the reader to have to scroll sideways is a very bad practice. The best software companies in the world have a very good rule when it comes to code formatting: stick to the 80-char limit. Yes, that’s true. This rule not only promotes readability but also, and that’s even more important, forces the writer to think how to simplify the code. Each SQL statement should be put on its own line and not crammed together with other statements and constructs; for instance, when selecting many things, each of the things should be put on its own line. What this rule does for you is that debugging and commenting (in and out) code is much easier (without using the mouse and without the need for changing the structure of the code). Some things to think of… Apart from that, it’s a great article. Thanks.

    • Raul dice:

      Thanks for your comment! I totally agree with coding standards but in this case the SELECT statements are within the IF – ELSE and WHILE structures and I wanted to make those neater than the SELECT themselves, also because they are fairly simple.
      Cheers!

      • Darek dice:

        Raul, what looks to be simple for you will not necessarily be simple for others. I can tell you honestly that the code where the first line is USE AdventureWorks2014; is not easy to read (mainly because of formatting) and therefore not easy to understand, either. Writing code that a computer can understand is easy, almost anyone can do it with little effort. Writing code that humans can easily understand… well, that takes a bit more thought and effort. When you write for others, you’ll be much better off if you don’t assume anything; just make things simple and as clear as the sun. They will thank you. By the way… Statements should always be terminated by semicolons. This will become mandatory very soon (as Microsoft say). Sorry to be that critical but… I’ve been working with poor code written by others for such a long time that when I see it I just can’t help it 🙂 Some people don’t care, they just want to copy and run. I’m different. When I see code I want to be able to quickly understand what it does (because this is where power comes from). This is also what I’m being paid good money for – to understand things and make them understandable for others.

        • Raul dice:

          We won’t live long enough to see Microsoft making code without semicolons to break 🙂

          I know what dealing with poorly written / commented code feels like and I always try to get mine as clear and documented as possible, but there is always room for improvements so I’ll keep your advice in mind.

          Thanks for your time to commenting, I appreciate it.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *