Finding similarity percentage between Strings

In the field of text mining, one of the useful tools is to find the similarity percentage between two words for clustering or other purposes. Actually, I am not so familiar with text mining but it sounds quite interesting topic and I would like to do more study to find out about this field. However, I was working on a piece of code to find how much two names similar to each other and if the percentage is more than X (e.g. 80 %) , it is considered the names almost identical, else it prompts user.

I did little bit Googling and found bunch of useful materials as well as a sample Java code from Stack Overflow which I copied here with slight modifications,

public class FindSimilarityPercentage
{
    public static void main(String []args)
    {
        System.out.println("Similarity between Hello and Yellow is " + similarity("Hello","Yellow"));
    }
    public static int similarity(String s1, String s2)
    {
        String longer = s1, shorter = s2;
        if (s1.length() < s2.length())
        { // longer should always have greater length
            longer = s2; shorter = s1;
        }
        int longerLength = longer.length();
        if (longerLength == 0)
        {
            return 1; /* both strings are zero length */
        }
        double dValue = (longerLength - editDistance(longer, shorter)) / (double) longerLength;
        return (int) (dValue * 100);
    }
    public static int editDistance(String s1, String s2)
    {
        s1 = s1.toLowerCase();
        s2 = s2.toLowerCase();
        int[] costs = new int[s2.length() + 1];
        for (int i = 0; i <= s1.length(); i++)
        {
            int lastValue = i;
            for (int j = 0; j <= s2.length(); j++)
            {
                if (i == 0)
                {
                    costs[j] = j;
                }
                else
                {
                    if (j > 0)
                    {
                        int newValue = costs[j - 1];
                        if (s1.charAt(i - 1) != s2.charAt(j - 1))
                        {
                            newValue = Math.min(Math.min(newValue, lastValue),costs[j]) + 1;
                        }
                        costs[j - 1] = lastValue;
                        lastValue = newValue;
                    }
                }
          }
          if (i > 0)
          {
              costs[s2.length()] = lastValue;
          }
        }
        return costs[s2.length()];
    }
}

For further reading regarding text mining please refer to the following links,

http://jtmt.sourceforge.net/

http://en.wikipedia.org/wiki/Text_mining

https://www.coursera.org/course/textanalytics

http://searchbusinessanalytics.techtarget.com/definition/text-mining

http://www.kdnuggets.com/software/text.html

http://people.ischool.berkeley.edu/~hearst/text-mining.html

Send your idea and information to kasra@madadipouya.com

Leave a Reply