• Welcome to Computer Association of SIUE - Forums.
 

Dating Cleaning/De-Duping Identification Techniques (Help)

Started by Bryan, 2009-06-04T08:59:14-05:00 (Thursday)

Previous topic - Next topic

Bryan

A little bit of back story.  We have both old and new data in our databases that is wrong.  There are many reasons it's wrong, but it almost always ties back to user entry errors.  For example one of our show hosts is Ted Allen (if you've ever watched Iron Chef America, you know him).  Somewhere along the lines someone was looking at a data entry screen and type in Ted Allan and went with it instead of using the suggested entry (if we had that feature when this was entered).  Either way now we have multiple entries for the same person.  This has downstream effects as you can imagine.  What I'm looking for is data cleaning and de-duping techniques that can be implemented without a third party application.

Anyone have any advice?  I've done some googling, but without any specific patten (such as with cleaning phone numbers) it's making things difficult.  There will also have to be a great deal of human involvement, which I don't mind.  I mean, two John Smiths is way too common.
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

raptor

Brush up on your SQL, (did something similar recently) I'm not sure any existing software would recognize two different John Smiths.

Scott
President of CAOS
Software Engineer NASA Nspires/Roses Grant

Bryan

I don't see any reason to brush up on something I'm buried in all day.  This is less about the actual SQL and more about probable pattern matching techniques.  I can run a query to return everyone who's last names are the same, but that's not the problem.  I want a query that will return me probably dupes, like Ted Allen and Ted Allan.
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

Bryan

I'm probably going to go with something along the lines of trigram matching as described here http://www.python.org/~jeremy/pubs/thesis/node6.html
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

William Grim

Is the problem such that you can't use SQL's "sounds like" algorithm?

If you have to do your own thing, this sounds a lot like LCS with points for inexact matches being weighted slightly less than exact matches by how different their sounds or distance between keyboard keys are.  This can give you a statistical difference, which should cut down immensely on your human work.  There is a word for this version of the LCS, but I don't have my bioinformatics text with me to help me remember.

My variation on the LCS will also, in my opinion, be better than the trigrams algorithm if your strings are already properly separated into their own table columns (i.e. first name, last name, etc.) but may require a bit of extra work doing cross-joins on the columns if they're sometimes put into the table in the wrong order.  However, if you have lots of entries where people are putting multiple parts of the name into a single field, you could use the trigrams to find candidates for my LCS algorithm, tokenize the strings, and apply my LCS algorithm to find the statistical differences.

Hope that provides something useful for you.
William Grim
IT Associate, Morgan Stanley

Bryan

Gonna throw this up there for those that are left asking "what's LCS?"

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem


I'm also considering trying some stuff I found on the Soundex function.  I mean when you say "Allen" and "Allan" they sound very similar.
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

Bryan

and Mike, totally love the Avatar.  I have 1 disc left of 3.0.
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

William Grim

Quote from: Bryan on 2009-06-04T11:47:11-05:00 (Thursday)
and Mike, totally love the Avatar.  I have 1 disc left of 3.0.

Ha ha; thanks!  Man, you're that far back?!  I thought you were already finished with them.  Make sure you see Caprica as well :-D
William Grim
IT Associate, Morgan Stanley

Bryan

I would be done with them except a friend of mine is insisting on rewatching the series with me.  She's the one that got me into it in the first place, so I humor her.  It's worth it.
Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

Bryan

Bryan Grubaugh
Quickly aging alumni with too much time on his hands
Business Systems Analyst, Scripps Networks.

blacklee

Mike, it was nice seeing you on campus again. (sorry for the off-topic)

William Grim

Ella, it was very good to see you too.  Sorry I didn't get to stop by the first floor lab, but Cat and I had a lot of things to see.  I'll be back in the next 6 mos. though, and maybe I'll run into you :-D
William Grim
IT Associate, Morgan Stanley