Ambiguous Conditions in Entity Resolution Systems
Most entity resolution systems don’t handle ambiguous records properly. This tricky and subtle condition creates false positives that are difficult to find.
In entity resolution, we use the term “ambiguous” to mean “multiple good answers.”
The great American boxer George Foreman named all five of his boys George. Imagine having to perform entity resolution on a record containing only his name, home address, and home phone — nothing else. In a typical household, a record containing a name, home address and phone would likely be unique to a single person. In the case of George Foreman, such a record could be any one of six people.
Look at this simple example:
Most entity resolution algorithms will arbitrarily resolve Record 3 into either Record 1 (the senior, born in 1970) or Record 2 (the junior, born in 1990). For example, imagine this outcome:
Even upon human inspection this match looks good, doesn’t it? That’s the tricky thing about ambiguous records like Record 3 — they can create invisible false positives. Invisible, in that you can’t see the false positive, until becoming aware of Record 2 (the junior).
The existence of Record 2 (the junior) means Record 3 could possibly be Record 1 (the senior) or Record 2 (the junior).
Handling ambiguous records properly is very important, especially when deployed in systems that can impinge on someone’s freedom or opportunity e.g., government watch listing or background check system. Imagine if Record 3 was represented derogatory information e.g., “terrorist” or “criminal record.” Arbitrarily matching this derogatory data to the junior or senior would result in a 50/50 chance of adversely impacting the wrong person.
If you want to see how your entity resolution engine handles this ambiguous condition compared to Senzing, check out these three records and more in our Synthetic Truth Set.
For a more technical article on this topic, click here.