Computer Association of SIUE - Forums

CAOS Forums => Questions and Answers => Topic started by: Bryan on 2009-06-04T08:59:14-05:00 (Thursday)

Title: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-04T08:59:14-05:00 (Thursday)
A little bit of back story.  We have both old and new data in our databases that is wrong.  There are many reasons it's wrong, but it almost always ties back to user entry errors.  For example one of our show hosts is Ted Allen (if you've ever watched Iron Chef America, you know him).  Somewhere along the lines someone was looking at a data entry screen and type in Ted Allan and went with it instead of using the suggested entry (if we had that feature when this was entered).  Either way now we have multiple entries for the same person.  This has downstream effects as you can imagine.  What I'm looking for is data cleaning and de-duping techniques that can be implemented without a third party application.

Anyone have any advice?  I've done some googling, but without any specific patten (such as with cleaning phone numbers) it's making things difficult.  There will also have to be a great deal of human involvement, which I don't mind.  I mean, two John Smiths is way too common.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: raptor on 2009-06-04T09:36:48-05:00 (Thursday)
Brush up on your SQL, (did something similar recently) I'm not sure any existing software would recognize two different John Smiths.

Scott
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-04T09:38:47-05:00 (Thursday)
I don't see any reason to brush up on something I'm buried in all day.  This is less about the actual SQL and more about probable pattern matching techniques.  I can run a query to return everyone who's last names are the same, but that's not the problem.  I want a query that will return me probably dupes, like Ted Allen and Ted Allan.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-04T10:03:06-05:00 (Thursday)
I'm probably going to go with something along the lines of trigram matching as described here http://www.python.org/~jeremy/pubs/thesis/node6.html
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: William Grim on 2009-06-04T10:58:44-05:00 (Thursday)
Is the problem such that you can't use SQL's "sounds like" algorithm?

If you have to do your own thing, this sounds a lot like LCS with points for inexact matches being weighted slightly less than exact matches by how different their sounds or distance between keyboard keys are.  This can give you a statistical difference, which should cut down immensely on your human work.  There is a word for this version of the LCS, but I don't have my bioinformatics text with me to help me remember.

My variation on the LCS will also, in my opinion, be better than the trigrams algorithm if your strings are already properly separated into their own table columns (i.e. first name, last name, etc.) but may require a bit of extra work doing cross-joins on the columns if they're sometimes put into the table in the wrong order.  However, if you have lots of entries where people are putting multiple parts of the name into a single field, you could use the trigrams to find candidates for my LCS algorithm, tokenize the strings, and apply my LCS algorithm to find the statistical differences.

Hope that provides something useful for you.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-04T11:34:00-05:00 (Thursday)
Gonna throw this up there for those that are left asking "what's LCS?"

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem


I'm also considering trying some stuff I found on the Soundex function.  I mean when you say "Allen" and "Allan" they sound very similar.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-04T11:47:11-05:00 (Thursday)
and Mike, totally love the Avatar.  I have 1 disc left of 3.0.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: William Grim on 2009-06-04T17:23:35-05:00 (Thursday)
Quote from: Bryan on 2009-06-04T11:47:11-05:00 (Thursday)
and Mike, totally love the Avatar.  I have 1 disc left of 3.0.

Ha ha; thanks!  Man, you're that far back?!  I thought you were already finished with them.  Make sure you see Caprica as well :-D
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-05T05:41:14-05:00 (Friday)
I would be done with them except a friend of mine is insisting on rewatching the series with me.  She's the one that got me into it in the first place, so I humor her.  It's worth it.
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: Bryan on 2009-06-05T07:45:07-05:00 (Friday)
A very interesting overview and look at some methods of cleaning and string identification.

http://www.google.com/url?sa=t&source=web&ct=res&cd=4&url=http%3A%2F%2Fwww.andreiko.com%2Fdesignpatterns%2FText%2520Search%2520and%2520Fuzzy%2520Matching.ppt&ei=0xEpSvuVFJO0MYTssOMJ&usg=AFQjCNFVk8x949jmeKzpJ5uttWfWN8r_IQ&sig2=GbfQLJ0XE8ipitueAc9h_w (http://www.google.com/url?sa=t&source=web&ct=res&cd=4&url=http%3A%2F%2Fwww.andreiko.com%2Fdesignpatterns%2FText%2520Search%2520and%2520Fuzzy%2520Matching.ppt&ei=0xEpSvuVFJO0MYTssOMJ&usg=AFQjCNFVk8x949jmeKzpJ5uttWfWN8r_IQ&sig2=GbfQLJ0XE8ipitueAc9h_w)
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: blacklee on 2009-06-05T08:22:06-05:00 (Friday)
Mike, it was nice seeing you on campus again. (sorry for the off-topic)
Title: Re: Dating Cleaning/De-Duping Identification Techniques (Help)
Post by: William Grim on 2009-06-15T00:51:39-05:00 (Monday)
Ella, it was very good to see you too.  Sorry I didn't get to stop by the first floor lab, but Cat and I had a lot of things to see.  I'll be back in the next 6 mos. though, and maybe I'll run into you :-D