So we have a list of people, and we've compared this list of people to another list of people using the Levenshtein Algorithm. Then we store the results of that comparison in the database so some users can check it out later, a time/memory trade off which has actually been fairly useful (especially with the speed increases we've given it over the years).
Now when a user loads this list of comparisons, we track what the current "match_score" is. This is the decimal percentage difference between the two people's information. Weighted to our liking. So our application gets the match_score:
SELECT TOP 1 * FROM People_Comparisons P WHERE ID = 13689385
Return: 13689385 | 0.214736842105263From the perspective of the application, and the database, ID is effectively random. We sort the results of comparisons by the match_score then by the ID, so that when doing reviews we're looking at each ID (x) in the subset of records with the same match_score (y). But the point is, the ID is sorted only within each match_score across the domain of Y values. To get the next record we run a query with the currenct ID and and match_score, N(x,y), which uses the following SQL to return the next record of interest:
SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > {0}In the above {0} is replaced with the previously retrieved match_score, in this case the value:
0.214736842105263Now some experienced database type individuals probably already see the problem. But to less experienced individuals, this can be quite the snake-bite. Because when we run the SQL we have a somewhat unexpected result:
SELECT TOP 1 * FROM People_Comparisons P WHERE ID= 13689385 AND MATCH_SCORE > 0.214736842105263
Return: 13689385 | 0.214736842105263Obviously there is a mistake, we've used the same value (0.21473...) and it seems impossible that this SQL should return the same row when we're expecting a row with a match_score greater than that value! Enter Floating-Point Precision.
The quick rundown of it is this: What the database reports is a truncated version of the raw stored value. We have 15 significant figures displayed, but the database is storing much more data than this. This truncation gives us a raw binary value which is different than that which is stored - and in this case, less than what is stored. Thus, the next greater match_score, after truncation/rounding, is the same match_score! To solve this I select the value of the match score into a SQL variable then us this value to maintain the full precision of the data type and reuse this variable for comparisons.
No comments:
Post a Comment