Sad Geeks In Snow: C#

Showing posts with label C#. Show all posts

Monday, September 17, 2012

SQL and finding a Floating Point Error

So we have some SQL which goes out and grabs some data. Then we use that data for processing. Pretty normal. But the fun part is, sometimes we reuse that data in the same SQL. No this is not about stored SQL Injection, it's really just about a dumb bug we ran into recently, but a reuse bug would be pretty fun too.

So we have a list of people, and we've compared this list of people to another list of people using the Levenshtein Algorithm. Then we store the results of that comparison in the database so some users can check it out later, a time/memory trade off which has actually been fairly useful (especially with the speed increases we've given it over the years).

Now when a user loads this list of comparisons, we track what the current "match_score" is. This is the decimal percentage difference between the two people's information. Weighted to our liking. So our application gets the match_score:

SELECT TOP 1 * FROM People_Comparisons P WHERE ID = 13689385

Return: 13689385 | 0.214736842105263

From the perspective of the application, and the database, ID is effectively random. We sort the results of comparisons by the match_score then by the ID, so that when doing reviews we're looking at each ID (x) in the subset of records with the same match_score (y). But the point is, the ID is sorted only within each match_score across the domain of Y values. To get the next record we run a query with the currenct ID and and match_score, N(x,y), which uses the following SQL to return the next record of interest:

SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > {0}

In the above {0} is replaced with the previously retrieved match_score, in this case the value:

0.214736842105263

Now some experienced database type individuals probably already see the problem. But to less experienced individuals, this can be quite the snake-bite. Because when we run the SQL we have a somewhat unexpected result:

SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > 0.214736842105263

Return: 13689385 | 0.214736842105263

Obviously there is a mistake, we've used the same value (0.21473...) and it seems impossible that this SQL should return the same row when we're expecting a row with a match_score greater than that value! Enter Floating-Point Precision.

The quick rundown of it is this: What the database reports is a truncated version of the raw stored value. We have 15 significant figures displayed, but the database is storing much more data than this. This truncation gives us a raw binary value which is different than that which is stored - and in this case, less than what is stored. Thus, the next greater match_score, after truncation/rounding, is the same match_score! To solve this I select the value of the match score into a SQL variable then us this value to maintain the full precision of the data type and reuse this variable for comparisons.

Posted by Catatonic at 6:55 PM 0 comments

SELECT TOP 1 * FROM People_Comparisons P WHERE ID = 13689385

Return: 13689385 | 0.214736842105263

SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > {0}

In the above {0} is replaced with the previously retrieved match_score, in this case the value:

0.214736842105263

SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > 0.214736842105263

Return: 13689385 | 0.214736842105263

Thursday, October 14, 2010

Optimizing the Levenshtein Algorithm in C#

For fun and profit. So maybe that phrase is used too much, but it is very applicable here. Optimization of any program can be fun and more importantly profitable. There are just some algorithms which take a while to run. The Levenshtein Distance is one of these algorithms which can take a while. Its not terribly complex and its O function in a typical case is n². A brief description is that it is basically an extended Hamming Distance and it works in a very similar manner, but has a larger alphabet. We generally concern ourselves with ASCII, or Unicode as our alphabets because it's useful.

For small amounts of data it's really not that bad, a quadratic run time is pretty damned good in fact - I mean at least its not exponential!

So let's paste some code here to look at:

static public int LevenshteinDistance (string s, string t)
{
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++);
for(int j = 0; j <= m; d[0, j] = j++);
// Step 3
for(int i = 1; i <= n;i++)
{
//Step 4
for(int j = 1; j <= m;j++)
{
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // *We'll be looking at this line*.
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
Originally authored by: Lasse Johansen

So most of this is pretty straight forward, and we can just skip to the meat of the implementation, Step 5 & 6. Here we see some pretty simple stuff, get two characters and compare them. If they match, cool, don't increase the cost for this iteration. Add that cost to the corresponding table cell and go to the next one. Rinse and repeat for the length of string s (n) times the length of string t (m).

So everyone can see this is the part that runs the most and consumes the most time and resources, thus we will focus here. Some of you may be having calculus flash backs and are currently being punched in the face with a derivative concept that the maximum run time of m * n occurs when n=m (maxima) and you'd like to cry yourself to sleep while holding a teddy bear.

But I digress...

Lets talk about objects and why optimization may be necessary =)
//Just the t.Substring(j-1,1) from above.
  IL_009a: ldarg.1
  IL_009b: ldloc.s j
  IL_009d: ldc.i4.1
  IL_009e: sub
  IL_009f: ldc.i4.1
  IL_00a0: callvirt instance string [mscorlib]System.String::Substring(int32, int32)

So what is happening here? We push an object onto the stack, in this case t; then we push j, push 1, sub does two pops and 1 push, and yet another constant pushed to the stack. This leaves us with a stack that kinda looks like "t | (j -1) | 1" and we call Substring which goes off and runs lots more logic and loops for doing its thing. Sure seems like a lot, doesn't it? I propose a transform:

t.Substring(j-1,1) => t[j - 1]

Post transform we'll get something like...
//After transform to t[j - 1]
  IL_0097: ldarg.1
  IL_0098: ldloc.s j
  IL_009a: ldc.i4.1
  IL_009b: sub
  IL_009c: callvirt instance char [mscorlib]System.String::get_Chars(int32)

So this already looks better. As you can see we've reduced the total number of instructions by 1 (there is no second ldc.i4 instruction). Also we're using the get_Chars(int32) method instead of the Substring. As you can guess the internal workings of this method has less logic/looping/work to do than the Substring technique - therefore leading to a performance gain. Yet, there is more occurring here than just this, the return types are different too: strings vs. chars.

So here is yet another kicker for the equation, notice how we are handling the == operation on. In the original the return types are both Strings, this yields the following instruction for the final ternary operation:
//Original: cost = [string] == [string] ? 0 : 1;
  IL_00b0: call bool [mscorlib]System.String::op_Equality(string, string)
  IL_00b5: brtrue.s IL_00ba
  IL_00b7: ldc.i4.1
  IL_00b8: br.s IL_00bb
  IL_00ba: ldc.i4.0
  IL_00bb: stloc.3

Compared to:
//Post Transform: cost = [char] == [char] ? 0 : 1;
  IL_00ab: beq.s IL_00b0
  IL_00ad: ldc.i4.1
  IL_00ae: br.s IL_00b1
  IL_00b0: ldc.i4.0
  IL_00b1: stloc.3

Here we see the call to [mscorlib]System.String::op_Equality(string, string) which of course pushes some return stuff onto the stack and then executes -more- code. Compared to our transformed code which simply executes a branch if equal (short) on the two char's sitting on the stack and then we set our cost. This is of course another case where a little saving has occurred.

So in personal experience this code change has shown me a huge time saving at a critical point on a project in development. It was fun to analyze and most importantly practical and profitable. This combined with a number of other stricter constraints took a project from an estimated constant run time of 30 days on our systems to an estimated runtime of 18 hours!

I am curious if anyone has any insight on the System.Math.Min methods which are called and how they could be sped up?

Posted by Catatonic at 3:49 PM 4 comments