Thursday, October 14, 2010

Optimizing the Levenshtein Algorithm in C#

For fun and profit. So maybe that phrase is used too much, but it is very applicable here. Optimization of any program can be fun and more importantly profitable. There are just some algorithms which take a while to run. The Levenshtein Distance is one of these algorithms which can take a while. Its not terribly complex and its O function in a typical case is n2. A brief description is that it is basically an extended Hamming Distance and it works in a very similar manner, but has a larger alphabet. We generally concern ourselves with ASCII, or Unicode as our alphabets because it's useful.

For small amounts of data it's really not that bad, a quadratic run time is pretty damned good in fact - I mean at least its not exponential!

So let's paste some code here to look at:

static public int LevenshteinDistance (string s, string t)
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++);
for(int j = 0; j <= m; d[0, j] = j++);
// Step 3
for(int i = 1; i <= n;i++)
//Step 4
for(int j = 1; j <= m;j++)
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // *We'll be looking at this line*.
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
// Step 7
return d[n, m];
Originally authored by: Lasse Johansen

So most of this is pretty straight forward, and we can just skip to the meat of the implementation, Step 5 & 6. Here we see some pretty simple stuff, get two characters and compare them. If they match, cool, don't increase the cost for this iteration.  Add that cost to the corresponding table cell and go to the next one. Rinse and repeat for the length of string s (n) times the length of string t (m).

So everyone can see this is the part that runs the most and consumes the most time and resources, thus we will focus here. Some of you may be having calculus flash backs and are currently being punched in the face with a derivative concept that the maximum run time of m * n occurs when n=m (maxima) and you'd like to cry yourself to sleep while holding a teddy bear.

But I digress...

Lets talk about objects and why optimization may be necessary =)
//Just the t.Substring(j-1,1) from above.
  IL_009a:  ldarg.1
  IL_009b:  ldloc.s    j
  IL_009d:  ldc.i4.1
  IL_009e:  sub
  IL_009f:  ldc.i4.1
  IL_00a0:  callvirt   instance string [mscorlib]System.String::Substring(int32, int32)

So what is happening here? We push an object onto the stack, in this case t; then we push j, push 1, sub does two pops and 1 push, and yet another constant pushed to the stack. This leaves us with a stack that kinda looks like "t | (j -1) | 1" and we call Substring which goes off and runs lots more logic and loops for doing its thing. Sure seems like a lot, doesn't it? I propose a transform:

t.Substring(j-1,1) => t[j - 1]

Post transform we'll get something like...
//After transform to t[j - 1]
  IL_0097:  ldarg.1
  IL_0098:  ldloc.s    j
  IL_009a:  ldc.i4.1
  IL_009b:  sub
  IL_009c:  callvirt   instance char [mscorlib]System.String::get_Chars(int32)

So this already looks better. As you can see we've reduced the total number of instructions by 1 (there is no second ldc.i4 instruction). Also we're using the get_Chars(int32) method instead of the Substring. As you can guess the internal workings of this method has less logic/looping/work to do than the Substring technique - therefore leading to a performance gain. Yet, there is more occurring here than just this, the return types are different too: strings vs. chars.

So here is yet another kicker for the equation, notice how we are handling the == operation on. In the original the return types are both Strings, this yields the following instruction for the final ternary operation:
//Original: cost = [string] == [string] ? 0 : 1;
  IL_00b0:  call       bool [mscorlib]System.String::op_Equality(string, string)
  IL_00b5:  brtrue.s   IL_00ba
  IL_00b7:  ldc.i4.1
  IL_00b8:  br.s       IL_00bb
  IL_00ba:  ldc.i4.0
  IL_00bb:  stloc.3

Compared to:
//Post Transform: cost = [char] == [char] ? 0 : 1;
  IL_00ab:  beq.s      IL_00b0
  IL_00ad:  ldc.i4.1
  IL_00ae:  br.s       IL_00b1
  IL_00b0:  ldc.i4.0
  IL_00b1:  stloc.3

Here we see the call to [mscorlib]System.String::op_Equality(string, string) which of course pushes some return stuff onto the stack and then executes -more- code. Compared to our transformed code which simply executes a branch if equal (short) on the two char's sitting on the stack and then we set our cost. This is of course another case where a little saving has occurred.

So in personal experience this code change has shown me a huge time saving at a critical point on a project in development. It was fun to analyze and most importantly practical and profitable. This combined with a number of other stricter constraints took a project from an estimated constant run time of 30 days on our systems to an estimated runtime of 18 hours!

I am curious if anyone has any insight on the System.Math.Min methods which are called and how they could be sped up?


  1. Dunno how much inlining C# does, but it couldn't hurt to replace the Min calls with the logical operator:

    // Step 6
    int a = d[i - 1, j] + 1;
    int b = d[i, j - 1] + 1;
    int c = d[i - 1, j - 1] + cost;
    a = (a < b) ? a : b;
    d[i, j] = (a<c) ? a : c;

  2. That's what I was thinking too Abe. I'm curious how it'd clock out.

  3. Actually, I am so curious, I'm going to try it and I'll have an article up about the results. I have an article in the pipeline about creating stop watch objects using the real time timers (applicable to benching these little one liner changes).

  4. So I opened mscorlib.dll in the disassembler... and here is it's output:

    .maxstack 8
    IL_0000: ldarg.0
    IL_0001: ldarg.1
    IL_0002: ble.s IL_0006
    IL_0004: ldarg.1
    IL_0005: ret
    IL_0006: ldarg.0
    IL_0007: ret

    This is on code which has been "optimized", probably doesn't inline it because of the callvirt (polymorphic) part. But since it's a special scenario I'll skip the virtual call and just inline myself - Nice idea Abe!