Thursday, October 21, 2010

Accessing a MS SQL database in R

I'm going to preface this post with a disclaimer: I'm not really a programmer - I'm actually an ecologist. That being said, I am slowly learning R in the course of my work. R is a programming language and a software environment for doing statistics and making pictures with data. A lot of programmers have never heard of R and those who come to R after learning other languages may complain bitterly about it. Since R is the first language I've learned, I can't really tell you how it's different from other languages, but I can tell you that I struggled quite a bit with the syntax because there are many ways to seemingly do the same thing. But the very subtle differences will kill your code.

With all this talk of death, you're probably as frightened of R as the average ecologist, but there are very good reasons to use R (even the NY Times thinks so!). If you're doing data analysis, you can do it with R and you can probably do it better. Plus, it's free.

To get you familiar with R and some of its capabilities, I'm going to walk you through one of my current projects over a few posts. My lab group has data from several remote weather stations stored in a Microsoft SQL Database. Temperature and precipitation have been measured hourly for about eight years. For a variety of reasons, we have quite a few missing measurements. I need to create a model that (reasonably) predicts those missing values.

The first step is to actually get to the data. Those of you familiar with databases probably wouldn't have had to spend an entire afternoon trying to do that. Those of you not familiar with databases hopefully won't spend all afternoon trying to do that after reading this post. A quick note: I use Windows at work. If you're using something else, this may not be as helpful as I'd hope.

Create Data Source Name
We're going to use ODBC to get R to communicate with the database. Before today, I didn't have a good handle on ODBC. Wikipedia was totally there for me:
Open Database Connectivity (ODBC) provides a standard software interface for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of programming languages, database systems, and operating systems. Thus, any application can use ODBC to query data from a database, regardless of the platform it is on or DBMS it uses. ODBC accomplishes this by using a driver as a translation layer between the application and the DBMS. The application thus only needs to know ODBC syntax, and the driver can then pass the query to the DBMS in its native format, returning the data in a format the application can understand.
Cool, right? So to get this all working, the first thing you need to do is create a Database Source Name for your database. I didn't really know what that was either, but of course Wikipedia did:

A DSN specifies a data structure that contains the information about a specific data source (database or other data source) that an Open Database Connectivity (ODBC) driver needs in order to connect to it.
Luckily, this is really easy to do and Microsoft will hold your hand the whole way:

Create a System DSN in Windows XP

  1. Click Start, point to Control Panel, double-click Administrative Tools, and then double-click Data Sources(ODBC).
  2. Click the System DSN tab, and then click Add.
  3. Click the database driver that corresponds with the database type to which you are connecting, and then click Finish.
  4. Type the data source name. Make sure that you choose a name that you can remember. You will need to use this name later.
  5. Click Select.
  6. Click the correct database, and then click OK.
  7. Click OK, and then click OK.

Install RODBC package
The next step is even easier. Assuming you already have R installed, you need to install and load the RODBC package. There are a lot of ways to install packages in R, but I installed and loaded the package like so:

install.packages('RODBC') #install
library('RODBC') #load
Anything after a # is a comment in R.

Connect to the database
And we're ready to connect! All I need to do now is give the odbcConnect function the name of the DSN I created and my username and password. I'll call this connection con. You can call it anything you like, as long as it doesn't have spaces or start with a number.
con <- odbcConnect('DSN', uid='username', pwd='password')
The quotes are important. If you don't use them, R thinks you're referring to an object.

Now that I have access to the database through con, I want to know a little bit about the database. odbcGetInfo(con) will tell you some basic information about the database like what database management system you're using, what version you're using, and the name. That's not especially helpful since you should already know these things.

The sqlTables function will return all table like objects. I just want to see what the actual tables are, so I pass it an additional argument, like so:
sqlTables(con, tableType='TABLE')
The parameter tableType is case sensitive, but the argument TABLE isn't. That won't always be the case with an R function. Figuring out errors caused by using the wrong case is a great time.

If I want to know the column names and data types in a particular table I can use the function sqlColumns.
sqlColumns(con, sqtable='hrlyweather')
I could have returned the same output with
sqlColumns(con, 'hrlyweather')
R functions will usually let you get away with leaving off the parameter name (sqtable in this case) as long as you pass the function arguments in the order it expects, but it's a good habit to use the parameter names. It'll make it easier to figure out problems and make your code much more readable.

You can learn more about what you can do with the RODBC package in this handy document (pdf). In my next post I'll query the database through the connection I've created and do some basic data manipulation in preparation for running a model.

Thursday, October 14, 2010

Optimizing the Levenshtein Algorithm in C#

For fun and profit. So maybe that phrase is used too much, but it is very applicable here. Optimization of any program can be fun and more importantly profitable. There are just some algorithms which take a while to run. The Levenshtein Distance is one of these algorithms which can take a while. Its not terribly complex and its O function in a typical case is n2. A brief description is that it is basically an extended Hamming Distance and it works in a very similar manner, but has a larger alphabet. We generally concern ourselves with ASCII, or Unicode as our alphabets because it's useful.

For small amounts of data it's really not that bad, a quadratic run time is pretty damned good in fact - I mean at least its not exponential!

So let's paste some code here to look at:

static public int LevenshteinDistance (string s, string t)
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++);
for(int j = 0; j <= m; d[0, j] = j++);
// Step 3
for(int i = 1; i <= n;i++)
//Step 4
for(int j = 1; j <= m;j++)
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // *We'll be looking at this line*.
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
// Step 7
return d[n, m];
Originally authored by: Lasse Johansen

So most of this is pretty straight forward, and we can just skip to the meat of the implementation, Step 5 & 6. Here we see some pretty simple stuff, get two characters and compare them. If they match, cool, don't increase the cost for this iteration.  Add that cost to the corresponding table cell and go to the next one. Rinse and repeat for the length of string s (n) times the length of string t (m).

So everyone can see this is the part that runs the most and consumes the most time and resources, thus we will focus here. Some of you may be having calculus flash backs and are currently being punched in the face with a derivative concept that the maximum run time of m * n occurs when n=m (maxima) and you'd like to cry yourself to sleep while holding a teddy bear.

But I digress...

Lets talk about objects and why optimization may be necessary =)
//Just the t.Substring(j-1,1) from above.
  IL_009a:  ldarg.1
  IL_009b:  ldloc.s    j
  IL_009d:  ldc.i4.1
  IL_009e:  sub
  IL_009f:  ldc.i4.1
  IL_00a0:  callvirt   instance string [mscorlib]System.String::Substring(int32, int32)

So what is happening here? We push an object onto the stack, in this case t; then we push j, push 1, sub does two pops and 1 push, and yet another constant pushed to the stack. This leaves us with a stack that kinda looks like "t | (j -1) | 1" and we call Substring which goes off and runs lots more logic and loops for doing its thing. Sure seems like a lot, doesn't it? I propose a transform:

t.Substring(j-1,1) => t[j - 1]

Post transform we'll get something like...
//After transform to t[j - 1]
  IL_0097:  ldarg.1
  IL_0098:  ldloc.s    j
  IL_009a:  ldc.i4.1
  IL_009b:  sub
  IL_009c:  callvirt   instance char [mscorlib]System.String::get_Chars(int32)

So this already looks better. As you can see we've reduced the total number of instructions by 1 (there is no second ldc.i4 instruction). Also we're using the get_Chars(int32) method instead of the Substring. As you can guess the internal workings of this method has less logic/looping/work to do than the Substring technique - therefore leading to a performance gain. Yet, there is more occurring here than just this, the return types are different too: strings vs. chars.

So here is yet another kicker for the equation, notice how we are handling the == operation on. In the original the return types are both Strings, this yields the following instruction for the final ternary operation:
//Original: cost = [string] == [string] ? 0 : 1;
  IL_00b0:  call       bool [mscorlib]System.String::op_Equality(string, string)
  IL_00b5:  brtrue.s   IL_00ba
  IL_00b7:  ldc.i4.1
  IL_00b8:  br.s       IL_00bb
  IL_00ba:  ldc.i4.0
  IL_00bb:  stloc.3

Compared to:
//Post Transform: cost = [char] == [char] ? 0 : 1;
  IL_00ab:  beq.s      IL_00b0
  IL_00ad:  ldc.i4.1
  IL_00ae:  br.s       IL_00b1
  IL_00b0:  ldc.i4.0
  IL_00b1:  stloc.3

Here we see the call to [mscorlib]System.String::op_Equality(string, string) which of course pushes some return stuff onto the stack and then executes -more- code. Compared to our transformed code which simply executes a branch if equal (short) on the two char's sitting on the stack and then we set our cost. This is of course another case where a little saving has occurred.

So in personal experience this code change has shown me a huge time saving at a critical point on a project in development. It was fun to analyze and most importantly practical and profitable. This combined with a number of other stricter constraints took a project from an estimated constant run time of 30 days on our systems to an estimated runtime of 18 hours!

I am curious if anyone has any insight on the System.Math.Min methods which are called and how they could be sped up?