Mostly Ramblings...: Approximate string matching using Longest Common Substrings

Approximate string matching is the process of determining that two strings, while not identical, share enough characteristics that the likelihood of them referring to each other is high. For example, in my haste to write this post, I typed “likelyhood” which the spell checker quickly identified as not being a word in the dictionary but probably referred to the word “likelihood.” (Thank you Mr. Spellchecker!) Puzzle 3nail puzzle - undone Objects / Tools

There are many algorithms that can be used to perform approximate string matching. Levenshtein Distance being an example of one of them. While reviewing some old projects, I came across the implementation of another: Longest Common Substrings.

The method shown below finds and remove the longest common substring from the two strings being compared. It continues until the length falls below a specified threshold. A measure of equality is determined by taking the sum of the characters removed and dividing it by the length of the smaller string. You end up with a value from 0.0 (completely different or smaller than the threshold) to 1.0 (equal).

Here’s the code.

using System;
using System.Text;
 
public partial class StringFunctions
{
    public static Double ApproximateStringMatchingUsingLCS(string str1, string str2, int requiredMinLength)
    {
 
        if (string.IsNullOrEmpty(str1) || string.IsNullOrEmpty(str2)) return 0;
 
        if (str1 == str2) return 1.0;
 
        int totalMatchedCharacters = 0;
        string foundSubstring;
        string targetString1 = str1.ToLowerInvariant();
        string targetString2 = str2.ToLowerInvariant();
        while((foundSubstring = LongestCommonSubstring(targetString1, targetString2)).Length >= requiredMinLength)
        {
            totalMatchedCharacters += foundSubstring.Length;
 
            targetString1 = RemoveFirstOccurence(targetString1, foundSubstring);
            targetString2 = RemoveFirstOccurence(targetString2, foundSubstring);
        }
 
        return (1.0*totalMatchedCharacters/Math.Min(str1.Length, str2.Length));
    }
 
    /// <summary>
    /// Removes the first occurance of a substring
    /// </summary>
    /// <param name="targetString">The string to search</param>
    /// <param name="stringToRemove">The string to search for</param>
    /// <returns>The string to search with the first occurance of the substring removed.</returns>
    public static string RemoveFirstOccurence(string targetString, string stringToRemove)
    {
        if (String.IsNullOrEmpty(targetString))
        {
            throw new ArgumentNullException("targetString", "A non-null value must be supplied for 'targetString'");
        }
 
        if (String.IsNullOrEmpty(stringToRemove))
        {
            throw new ArgumentNullException("stringToRemove", "A non-null value must be supplied for 'stringToRemove'");
        }
 
        int start = targetString.IndexOf(stringToRemove);
        return -1 == start ? targetString : targetString.Remove(start, stringToRemove.Length);
    }
 
    /// <summary>
    /// Finds the longest substring that two strings have in common
    /// </summary>
    /// <param name="str1">The first string to search.</param>
    /// <param name="str2">The second string to search</param>
    /// <returns>The longest substring that the two strings have in common.</returns>
    private static string LongestCommonSubstring(string str1, string str2)
    {
        if (String.IsNullOrEmpty(str1) || String.IsNullOrEmpty(str2))
            return string.Empty;
 
        int[,] num = new int[str1.Length, str2.Length];
        int maxlen = 0;
        int lastSubstringStartingPosition = 0;
        StringBuilder sequenceBuilder = new StringBuilder();
 
        for (int i = 0; i < str1.Length; i++)
        {
            for (int j = 0; j < str2.Length; j++)
            {
                if (str1[i] != str2[j])
                    num[i, j] = 0;
                else
                {
                    if ((i == 0) || (j == 0))
                        num[i, j] = 1;
                    else
                        num[i, j] = 1 + num[i - 1, j - 1];
 
                    if (num[i, j] > maxlen)
                    {
                        maxlen = num[i, j];
                        int thisSubstringStartingPosition = i - num[i, j] + 1;
                        if (lastSubstringStartingPosition == thisSubstringStartingPosition)
                        {
                            //if the current LCS is the same as the last time this block ran
                            sequenceBuilder.Append(str1[i]);
                        }
                        else 
                        {
                            //this block resets the string builder if a different LCS is found
                            lastSubstringStartingPosition = thisSubstringStartingPosition;
                            
                            // reset the string builder
                            sequenceBuilder.Remove(0, sequenceBuilder.Length);
                            sequenceBuilder.Append(str1.Substring(lastSubstringStartingPosition, (i + 1) - lastSubstringStartingPosition));
                        }
                    }
                }
            }
        }
        return sequenceBuilder.ToString();
    }
    
};

If you’re interested, here’s a (rather old) whitepaper that compares the performance of some of the approximate string matching algorithms: Real World Performance of Approximate String Comparators for use in Patient Matching

Mostly Ramblings...

Sunday, December 13, 2009

Approximate string matching using Longest Common Substrings

No comments:

Post a Comment