Entity Resolution Explained Step by Step
By Senzing, published November 4, 2022
Matching data about people and organizations can be complicated. In this step-by-step video, Jeff Jonas reduces entity resolution down to its simplest form and highlights specific examples of what can happen when performing fuzzy matching. See if you can guess the correct outcome of each record before Jeff reveals them.
Learn how records about people are matched, identified as related or determined to not match. Watch as corrections are made as new data reverses earlier assertions. Youโll come away with a much better understanding of the intricacies of entity resolution and the power of entity-centric learning.
After watching, if you want to learn more about how Senzingยฎ entity resolution can help your organization, schedule a call with an entity resolution expert.
Video Transcript
Timestamps
0:00 Intro
0:46 Fuzzy Matching
1:47 Derived Relationships
2:25 Disclosed Relationships
2:50 Mismatched Data
3:28 Ambiguous Matches
4:49 Discoverable, Self-Correcting Matches
5:44 Real-Time, Self-Correcting Matches
6:46 Entity-Centric Learning
7:48 Record Deletion for Privacy Compliance (GDPR, CCPA)
8:15 Entity Resolution for Alphabets and Scripts
8:35 Entity Resolution for Organizations
8:44 Entity Resolution for Vessels
9:16 Bonus Section: Importance of Real-Time Entity Resolution
9:49 Accuracy Drift vs. Self-Correcting
10:51 Periodic Batch Reloads vs. Real-Time Transactional Entity Resolution
12:00 Channel Separation
12:48 Channel Consolidation
13:16 Entity Resolution with Entity-Centric Learning
13:24 Entity Resolution Explained
Hi, I’m Jeff Jonas, the founder and CEO of Senzing. I have spent at least 30 hours creating this visual to explain entity resolution. I think of it as in slow motion because what’s really happening is thousands of records a second, entity resolution decisions are being made. What I’m going to do here is I’m going to explain step by step, call it a deep dive, about what’s happening behind the scenes, what kind of decisions are being made, and I’ll tell you what it’s easier said than done. So many people think it would be so easy to do entity resolution and so many of these moves I’m about to show you are super expensive to make entity resolution do in real-time, but let’s get started.
0:46 Fuzzy Matching
Okay, take a look at these three records. What do you think would become of these three records or is it going to become three different people, is it one, is it one person, does it turn into two people? Now, only less than one in a thousand people will get this right over these next 10 moves. So, you might want to pause along this way on the journey if you want to give yourself a self-test and see if you’re one of those 1 in a 1,000, okay? But take a look there.
Well I’ll tell you what, it’s going to become one person. It becomes entity E1. Rob, Bob, Robert, all part of the same name family. The date of births are different, but only the month and day transposed. Phone number structure different, but there’s enough evidence there to have very high confidence that that is one person and so we assert it’s one.
Now, this is a form of fuzzy matching. Anybody can match exact matching where everything’s exactly the same, but data as you know, has got a lot of fuzziness to it. If you want to figure out if things are the same, you got to see through things like some phone numbers have a plus one and some don’t.
1:47 Derived Relationships
Okay, let’s go on to record number four. Okay, the email is the same as record one, you notice that. You probably also notice the last name’s the same but it’s Patricia versus Bob. Some families share email addresses, and so really this is different people.
And of course, so this one’s not that hard really. We would call this a derived relationship. If you want to do really good entity resolution, you’ve got to be able to see relationships as well, and this derived relationship is created and persisted. It’s remembered in the index about the entities.
2:25 Disclosed Relationships
Let’s move on to record five, and this is a freebie, there’s no work at all. This is what we would call a disclosed relationship. The machine, the entity resolution process is not detecting it, you’re told. This is like corporate hierarchies data that you might get from a data provider or maybe your bank account has a spouse or a co-signer. You’re not guessing that those people are related. This is a disclosed relationship.
2:50 Mismatched Data
Okay, now, it’s going to get a little more tricky. Take a look at record six. To speed it up, compare it mainly to record one and what are you going to do with this. You know what? The names are basically the same, the date of birth are the same, but boy name and date of birth is notโฆit’s a pretty good way to find a match, but if you wanted to favor the false negative โ meaning only putting things together when you’re sure โ you would really best off call that a possible match, and so record six becomes entity E4 on name and date of birth.
3:28 Ambiguous Matches
Okay moving on to record seven. This is tricky for you know, all the prior generations of entity resolution would have missed this โ it’s expensive to do in real time on big data. Record seven has the same email address as record one and four. Which one is it?
Historically, we would have kind of randomly given it to one or the other. That would give you a false positive to tell you the truth and fifty percent of the time. By the way, if record seven had the word street on it, would you maybe call it record one just because record four was missing the word street as if Patricia doesn’t live on a street?
The right answer here is really to hold it out. It’s a special form of a possible match, we call it ambiguous. Funny true story, George Foreman names all five of his sons George. If you get George Foreman with a home address and a home phone, it could be any one of six people. If you randomly assign it to one, you’ve probably made the wrong decision.
Now get this though, just track with me on this. If entity E2 didn’t exist, record seven would become part of E1 and you could be pretty confident. If entity E1 didn’t exist of record seven, you’d be pretty confident it would be part of E2. But it’s only because both exist and the right answer, there’s two right answers, you have to keep it out.
4:49 Discoverable, Self-Correcting Matches
Okay, record number eight. Compare it mainly to record number two. And the question is what’s going to become of this? Yeah, yeah it’ll become part of E1, but when record eight combines with E1 and you learn an email address bsmith@work, you’ve just learned about that entity, and when you’ve learned that much about that entity, what else do you think might happen?
This is often missed, but record E6 is suddenly discoverable that that possible match slides right in and so the arrival of record eight causes record six to combine. This is self-correcting. You see in real time, eight fixes the previous possible match. Oh, we also would think of this as re-resolve. It re-resolves six upon the arrival of eight.
5:44 Real-Time, Self-Correcting Matches
Okay, we’ve been doing this in our prior generation of engines for years, but the move I’m about to show you, oh, we spent probably a year working on this in engineering to do this in real time at scale, I mean at billions of records.
Take a look at record nine. Okay, it’s mainly like record three. Name, ID, but get this, it’s not one and two because the date of births are so different. So, I’m going to give you a hint here. Record nine is going to become a new entity out to the right of entity E1 and, when it lands there, something else is going to happen. It’s going to pull record three out and this is another form of self-correction and it’s unresolving three out of entity E1 and slides it over into entity E6. Doing this in real time at scale is very expensive. I’m going to cover that, why this is so important in a little bonus section at the end, okay? And so, let’s carry on for now, okay.
6:46 Entity-Centric Learning
Here comes record ten, take a look at that. Most entity resolution engines can’t do this move. That’s because they do record matching. So, take a look at record 10 and you know, what would you do with it? Does it match record one? Not sufficient. Does it match any of the other records? Not sufficient. But in fact, record 10 becomes part of E1.
We call this entity-centric learning. Entity E1 is building up a collection of every name it’s seen including akaโs, every address, every phone number, and it’s that collection. So, we’re not taking record 10 and trying to find a record. We’re taking record 10 and trying to find an entity and this is super important, especially in messy data and especially in fraud hunting cases where you’re trying to find clever bad people that don’t use the same name, address, and passport number on every record. More about why that’s important in my little bonus section at the end.
7:48 Record Deletion for Privacy Compliance (GDPR, CCPA)
Okay, let’s take a look at record eight. What happens if you delete record eight? You know, this might happen. Maybe some data needs to be aged out, maybe it’s a privacy law like GDPR or CCPA, the law in California. When record eight is deleted, you better unlearn record ten, the entity-centric learning record, and you better unlearn that record six was a match for sure. It can only be a possible match.
8:15 Entity Resolution for Alphabets and Scripts
Okay, you know, what I’ve just shown you is kind of the basic moves, but you know, to do entity resolution well, you want to be able to do [entity resolution] across script, you better know that the Arabic spelling of Mohammed versus how Muhammad is spelled in English versus Mandarin and all the different variations in English, like there’s over a hundred variations of Muhammad, the shortest Mhd.
8:35 Entity Resolution for Organizations
Entities can be organizations, so instead of names, addresses, phones, and people, it could be names, addresses, fax numbers, corporate URLs and so on.
8:44 Entity Resolution for Vessels
And entities can be vessels. It could be a vessel name and the I.D.s that come with vessels. And these are the kinds of moves, those 10 moves are the basic moves that are required to get really highly accurate entity resolution. And you can get to some of those without too much effort, but to get all of those in real time at scale is very difficult.
It’ll be a miracle if you got them all right, if you quizzed yourself. Like if you were one of those one in a thousand plus, you should just shoot me an email at jeff@jeffjonas.com. I actually want to chat with you, that’s amazing.
9:16 Bonus Section: Importance of Real-Time Entity Resolution
Okay great. So, I want to go back now. This is the bonus section. When we did record nine, take a look back here at this chart. Record nine caused record three to pull out. Now imagine that. You’ve ingested a billion records already and record nine shows up. Not only did you have to figure out it’s a new entity, you have to ask yourself, had I known that in the beginning over the billions of decisions I’ve made, should I have made any of them differently? Doing that in real time: non-trivial.
9:49 Accuracy Drift vs. Self-Correcting
If you can’t do that in real time, take a look at this graph at what happens. Let’s just say for argument’s sake the database has a one percent error rate. If you can’t fix mistakes like that as new records are arriving, they’re kind of invalidating earlier decisions. Your database is slowly drifting from truth.
I heard from a large data aggregator they drift at one percent a month. Well, let’s use that number. That would mean if January you’re at one percent off, at the end of February your database is two percent off, then three percent off, and then you get the four percent off a quarter later. Imagine that. If you just have a million records, you have 40,000 mistakes.
Whether you’re in the fraud business or you’re in healthcare and you’re trying to make sure you’re giving people the right procedures, this is important. What happens in most batch systems that can’t fix themselves is then they have to do a reload where they reload the data and then it fixes it and then it just starts to slowly drift and four months later at the end of July, you’re four percent off again.
10:51 Periodic Batch Reloads vs. Real-Time Transactional Entity Resolution
These kinds of systems have unstable accuracy and they’re fixed with these periodic batch reloads. Now maybe you’re going to do them weekly so you only drift, you know, you drift a bit less. But you have the wrong answer over that period of time. Systems like Senzing, and real-time transactional entity resolution, the new observations reverse earlier assertions. They fix the past in real time, they’re self-correcting. You never need to reload. It’s accurate in every second.
You know, it’s funny, so often when people measure accuracy, they’re taking it at a point in time. But if you don’t factor in that accuracy is drifting, you have a much less accurate system.
Okay, so that was one a little breakout section there. And now I want to take you back to record 10. Record 10 was entity-centric learning. Again, most entity resolution systems use record matching. Does this record match any existing records? And as I mentioned, you can’t catch clever bad people because they don’t want their records to match so they change them. They use a very different identity or they keep altering the features to prevent it.
12:00 Channel Separation
I’ve more broadly described this as channel separation. It’s not always bad. You and I do channel separation. If I send you an encrypted Excel document and then text you with the password, that’s channel separation.
But clever bad people do this on purpose. It’s a primary deception tradecraft. I first saw this in Vegas in the early 90s where people were coming in trying to take advantage of the casinos. I mean I remember somebody had, they had 32 different names, they had eight different social security numbers, five different dates of birth, and they would use different combinations. And later you know, as I saw more of these use cases, whether it’s, you know, organized crime, whether it’s counterterrorism and protecting countries, this is how bad guys get away with things.
12:48 Channel Consolidation
And so, what you’ll see here in this example is, you know, you have somebody open an account, you have a known money laundering, if somebody’s applied for a job. As records, you really couldn’t see these are all the same. If you want to be able to have really high-quality fraud-detection โ stop bad things before they happen โ โleft of boomโ as some people say, you want to be able to do channel consolidation. You want to be able to take these channels that were intentionally segregated and bring them together.
13:16 Entity Resolution with Entity-Centric Learning
And so, to do that, you not only just need entity resolution, you need entity resolution that employs this technique called entity-centric learning.
13:24 Entity Resolution Explained
Okay, so with all of that, that’s an explainer. It’s slow motion. I’ve tried to take something that so many different kinds of things that happen in entity resolution, and I’m trying to reduce it down to its most simple form so you can see it in slow motion explained. I hope you’ve enjoyed it. Thanks.