Invisible False Positives in Entity Resolution Systems
By Jeff Jonas, published August 5, 2019
[I’ve attempted to make this subtle entity resolution accuracy issue understandable to the average reader. Not easy.]
The phrase “false positive” basically means you are sure, but nonetheless wrong. Think arresting the wrong person. Whoops!
The term “invisible false positive” is an error that can only be detected with the presence of additional information. No matter how closely an invisible false positive is inspected, the error in undetectable until additional information is considered.
Invisible false positives are discovered when additional information is considered later. You have likely experienced this firsthand. If you have ever worked on a jigsaw puzzle and accidentally connected two pieces together (with great confidence) — only to later discover you connected those pieces in error.
When invisible false positives occur in entity resolution systems, this can be very bad news e.g., denying a loan to a credit-worthy person, or targeting the wrong person for police questioning.
Such false positives are bummers for two reasons:
- An innocent person is harmed or hindered
- False positives can overwhelm and waste resources e.g., analysts spend time on dead end leads
While somewhat rare in big data, invisible false positives can affect thousands of people, if not tens of thousands.
Let’s look at a realistic example:
Are these two observations about the same entity?
Yes, quite likely since they have the same name, address and phone.
One year later Observation 3 arrives. Notice Observation 3 contains the same names as Observations 1 and 2, a very different date of birth than Observation 1, and same ID as Observation 2.
Given the presence of Observation 3, it becomes clear that Observation 2 is not the same person as described in Observation 1. Most likely a Junior/Senior mismatch.
Observation 2 is now exposed. It has been lurking there all along as an invisible false positive!
Just like when working a jigsaw puzzle at home, upon such a discovery, you fix it right then and there. In this case, Observation 2 conjoins with Observation 3 becoming Entity 2 — because they share the same name and ID.
What’s the big deal? Well… what if Observation 1 (the son) presents a perfect credit history and Observation 2 (the father) indicates a horrible credit history, what a flub it would be if you denied the son a loan because of his father’s poor credit.
If you can’t fix such false positives in real time as new data arrives, the accuracy drifts from the truth. In traditional batch systems, this is typically remedied by reloading all the data, which is unfortunate as poor decisions are being made between each periodic reload.
The hard part: finding and fixing such invisible false positives at 1,000s/second over billions of records is a non-trivial form of real-time machine learning. In fact, so hard, we’ve spent 10+ years slowly but surely chipping away at this challenge that plagues virtually all entity resolution algorithms.
When we say Smarter Entity Resolution™ — we mean it. Check out our Uniquely Senzing White Paper.