Entity Resolution Accuracy: Tips for Optimal Test Results
Testing your entity resolution technology for accuracy is essential. Watch this video as Jeff Jonas provides tips on the best methods for testing entity resolution accuracy. He’ll cover why you should use your own data (versus synthetic data), two methods for testing accuracy, and why accuracy degrades in batch-based systems.
Accuracy is at the core of any effective entity resolution solution. To accurately test your system, other commercial systems or Senzing® entity resolution, we recommend you resolve a snapshot of your data, up to 10 million records, and also audit accuracy rates over time. Senzing offers audit tools in our Github to help you speed up and simplify your audit.
To learn more about other open source entity resolution tools that Senzing provides, read our blog.
Video Transcript
Timestamps
0:00 Intro
0:25 Assessing Entity Resolution Accuracy
0:50 Entity Resolution Accuracy Test & Entity Resolution Audit
1:16 Testing Entity Resolution Accuracy: Real Data
1:43 Testing Entity Resolution Accuracy: Types of Measurements
1:55 Run an Entity Resolution Audit using Python Tools
2:22 Reduced Accuracy with Batch Entity Resolution
3:35 Test Accuracy with Entity Resolution Proof of Concept (PoC)
I thought I would take a minute and talk about entity resolution accuracy. First of all, it’s kind of like on your mirror in your car, “objects in the mirror are closer than they appear.” Accuracy is particular to you and your data. If a company’s telling you about average accuracy somewhere else, that probably has very little to do with the accuracy you’re going to experience.
0:25 Assessing Entity Resolution Accuracy
So, let’s say you want to assess accuracy. I’m going to start on the basis of you want to actually assess what accuracy it’s going to have for you. You’re going to want to use real data, not synthetic data. We’ve got some articles on how to create synthetic data if you have to, but you want to try to stay away from that. You’re going to use your data. You’re going to take a vertical slice. Remember to separate a scalability performance test from an accuracy test.
0:50 Entity Resolution Accuracy Test & Entity Resolution Audit
In an accuracy test, when you’re going to be doing an audit and looking at pairs and combinations and clusters of records, you’re going to want to run something under 10 million [records]… Could be hundreds of thousands, could be a million, but probably under 10 million. You’re going to want to do a vertical slice. It might be one state, it might be everybody with the last name [starting with] the letter A. What you’re not going to do is take a random sampling of a bunch of data and then see what your entity resolution is.
1:16 Testing Entity Resolution Accuracy: Real Data
You’re going to use real data. You may already have an audit set where you already know the answers. If you don’t have an audit set, you’re going to want to have human beings actually look at results and decide what are matches, what are possible matches, and what are just relationships – that becomes your audit set. That’s what you’re going to test against. You’ll test Senzing against that, you’ll test your own homegrown [entity resolution] methods against that or other entity resolution tools.
1:43 Testing Entity Resolution Accuracy: Types of Measurements
You’re going to want to test a few things on accuracy. One is you want to do snapshot accuracy. You’re going to take this, I’ll say, one to 10 million records. You’re going to take that vertical slice of real data and you’re going to run it through entity resolution.
1:55 Run an Entity Resolution Audit using Python Tools
Now you’re going to compare that to the other entity resolution methods, or the golden [audit] set, whatever you’re going to compare it against. Some people try to do this on their own.
We actually have an open source piece of code. It’s Apache-2.0 licensed Python. You can just download [open source entity resolution tools] off our GitHub and actually just run the audit. It will tell you what the differences are and really speed up that process. So, you can kind of drill in and say why did this record match or why not and really understand the differences between [entity resolution] engines for accuracy.
2:22 Reduced Accuracy with Batch Entity Resolution
So, there you go, you’re going to have one aspect of testing accuracy that’s about a snapshot, but the other thing about measuring accuracy is going to be accuracy over time.
[With] batch entity resolution technologies, maybe all the records that accrue over the day or the week get incrementally added end of week or end of day, and maybe once a month you have to reboil the ocean to account for data drift where new records have changed the past. These kinds of systems have an accuracy on day one and then they degrade all month or all quarter depending on how long you go between reloads.
[With] Senzing, whatever accuracy you see at your snapshot is actually maintained. In truth, as you get more data, accuracy really gets better and better.
If you can look at the long haul, with other technologies, you’re going to find that you get your baseline accuracy and then the error rate is going to climb, climb, climb, climb. Then once a quarter, or once a month, it’s going to reset and then your accuracy is back. That over time accuracy is going to affect you, even if it accumulates up to one percent, it’s still 10,000 affected records on simple small one million record databases. It really matters.
3:35 Test Accuracy with Entity Resolution Proof of Concept (PoC)
So two things to test on accuracy: point-in-time and evolution or continuous over time. And, always use real data if you want to get a real assessment of what your accuracy is going to be.
We do a lot of work with companies helping them get accuracy. We do one-day [entity resolution] PoCs on up to 10 million records, so you can just, like that on a Tuesday, figure out how Senzing is performing against your own or other technologies. Come try us out.