Senzing EDA Tools - Basic Exploration
Let’s do some basic exploration of the truth set with the sz_explorer
. You will find it in your Senzing project’s /bin
directory.
Type sz_explorer
at the command line to enter it.
Type help
to see all that it can do.
This article will focus on the ad hoc commands highlighted above. You can use these at any time with or without a snapshot. It is a great tool for:
- backend support when a user questions how a match was made or why one was not.
- for exports and screen shots when reporting issues or inquiries to Senzing support.
Senzing is an SDK, and sz_explorer is a tool built using it. You will want to incorporate searching and retreiving records into your workflows for making decisions or into your UI to display entities to your users and allow them to ask why and how. You can type show_last_call
after any command to see what call was made to Senzing, what flags were used, and what Senzing returned.
Using search
Type help search
to see how to use it.
All the commands have help with examples and notes when appropriate.
Type search robert smith
to see if there are any in the database:
Entity IDs in your database may differ from those shown here, as they depend on your data and load order. Use the entity IDs returned by your search commands in subsequent steps.
This tells you there are 3 entities that satisfy the search criteria:
- Entity ID 1 is a Robert Smith with 4 customer records.
- Entity ID 145 is a Robert Smith who is on the watchlist.
- Entity ID 5 is a Robert E Smith Sr who is both a customer and on the watch list.
Your entity ID numbers may be different.
In addition:
-
The match key column tells you what attribute(s) matched (+name) and what principle was satisfied. Senzing performs “principle based entity resolution”. You don’t need to focus on the principles just yet. The most important part is the match_key. However you can read more about Senzing principles Principle Based Entity Resolution
-
The match score is a very simple scoring algorithm to ensure the strongest matches appear first. It just adds the scores of each attribute searched for giving more weight to the name. Since we only searched by name, it is just the name score above.
Using compare
Type compare search
to see the entities returned from the prior search side-by-side.
Placing entities side-by-side makes it easy to glean a few insights. The red arrows and lines are highlight:
- The two customers have the same mailing address, but the DOBs (dates of birth) are about 24 years apart. This might indicate a father and son relationship.
- The watch list entity on the end does not appear to be related to either customer.
- If you see something like
lines 1-46/46 (END)
in the lower left corner, it is letting you know you are in a scrolling window. Use the arrow keys to navigate and pressQ
to quit when done.
Next lets drill into one of the entities we searched for. You do that with a “get” command.
Using get
Type help get
to see how to use it.
Type get 1 detail
to get the detail view of entity ID 1.
It starts with a grid of the records that belong to the entity:
- The first column shows you which data source and record IDs were resolved to this entity as well as the match_key and rule that fired when the record was loaded.
- The second column shows the data on each record that was used for resolution. This is the identifying data such as name, date of birth, address, identifiers, etc.
- The third column shows all the other data for each record.
Underneath the records is a tree view of the entities related to this one by match level. You can read more about match levels here: Understanding match levels
But what is a screen like this telling you? Looking at the red boxes and arrows you can see that:
- Robert has one active and three inactive records.
- The earliest record is from 2015, the latest in 2018.
- One might also wonder why he keeps going inactive, then signing up as a new customer the next year changing his identifying information each time.
- Could it be that his father and his spouse already got flagged and put on the watch list?
Next looking at this record, one might wonder how this entity got resolved. After all customers 1001 and 1005 have very little in common! So lets asking Senzing “how” this entity came together.
Type how 1
to see the decision tree determined all these records belong to the same entity.
The how decision tree view is displayed above. The red arrows are highlighting:
- You read the decision tree from the bottom up.
- Most features have one score, but names can have up to 3: first, last and combined in that order.
- In step 1, you can see that customer 1002 and customer 1004 came together first and created virtual entity V2-S1 which was used in step 2 to match customer 1001 and so on. In each step, new features may be learned for use in the next step. In this case, we learned the email address used in step 2 and the phone number used in step 3.
here is a lot to understanding a how report that we will go through in the how section below. For now just press enter
to continue.
Using why
The why command tells you why two entities did not resolve. While the following may seem complex at first, there are really only two reasons:
- Either they did not score high enough to resolve, but they may still be related!
- They couldn’t find each other because they didn’t have any matching candidate keys or they all went generic.
A why result can help you and your users understand why the records should only be related or lead you to possible tuning changes with the help of Senzing support.
Type help why
to see how to use it.
When you searched for Robert Smith above and found three different entities. You may wonder why the first and the third didn’t resolve to the same entity ID.
Type why 1, 5
to get the answer!
This screen shot of the upper part of the table shows entity 1 on the left and entity 5 on the right.
- The data sources row shows that entity 1 has 4 customer records and entity 5 has 1 customer and 1 watchlist record.
- The why result shows the current match_key and rule between the two entities.
- The match_key shows the list of features that contributed to the match, both positively and negatively. The principle is also displayed and is the actual reason for the match and something that can be adjusted with the help of Senzing support. You can read more about Senzing principles here: Principle Based Entity Resolution
- The cross relation is what is stored in the database and should always equal the why result. Although rare, it can happen they are different and reevaluating the entities will correct it. If you ever run across this, report it to Senzing support and we will help you do get it re-evaluated.
- Next are the features for each entity with the best scoring pair on top.
- Remember the help on why above? On the name row notice:
- Robert Smith (entity 1) was compared with Robbie Smith (entity 5), with a full name score of 97. The surname scored 100 (exact match), and the given name scored 95 (recognized nickname).
- The [2] in brackets after the Robert Smith on the left indicates that a total of 2 entities have this exact name.
- Bob J Smith on the left is another name for entity 1 and the Bob Smith and B smith names are greyed out and have a # sign in the bracket indicator as they are suppressed due to a more complete name being available.
- On the DOB row, it is colored red because it only scored 58 and detracted from the match which is why it is also red in the why_result above.
- On the address row, the best matching address scored 99 and contributed to the match.
- On the remaining rows, only the entity on the left had a phone and only the entity on the right had a drivers license so there was nothing to match.
But before the entities in question can be scored, they must find each other which is why the why screen also shows the keys that put them on a short list of candidates to be compared.
The lower portion of the why screen shows the candidate keys that were created for each entity:
- Highlighted in blue are the keys that matched.
- To keep the system fast, keys can “go generic” which means they are no longer used to generate candidates.
- See the name key RPRT|SM0 [3]? That’s a metaphone for Robert Smith and also for Robbie Smith and [3] different entities have this key.
- If there is an exclamation point in front of the number like [!120], that key is no longer being used to find candidates.
There is a set of configurable thresholds that dictate when keys “go generic” meaning no longer used for candidates. But rather than continually increasing thresholds, slowing down the system, Senzing creates lots of keys. Since the NAME_KEY for Robert Smith, might go generic, we create composite keys like the NAMEADDR_KEY and NAMEDATE_KEY as well. It is far less likely that all of these will go generic.
To learn more about how Senzing Entity Resolution works see Entity Resolution Processes. But don’t worry if it’s not immediately clear what is going on with this screen. Once you see a few examples on your own data, you will quickly get the hang of it.
Using how
Now that you know more about why and its notations, lets go back to the how command for a full review of it and its different views.
Type help how
to see how to use it.
But first, let’s pick a more interesting entity:
Type search maria sentosa
Your entity ID numbers may be different. Use the one you got in the following commands.
Type how 18
to see how those 5 records resolved to the same entity.
The how decision tree is usually displayed first:
- Remember you read a how decision tree from the bottom up. Notice the two interim entities created along the way are combined in the last step which combines them into the final entity.
- You see the scores of all the features compared in each step and the match_key and principle they combined they satisfied.
- Notice also each step has a type and there are only 3 different types. It is either:
- “creating a virtual entity” by combining 2 records,
- “adding a record to a virtual entity”, or
- “combining virtual entities”.
So what is this screen telling you? Some how results are pretty straighforward in that two records came together to create a virtual entity in step 1 and then additional records are simply added to it. But it can happen that two or more interim entities needed to be created along the way before they accumulated enough attributes to join them together. And that is what happened with Maria above.
When you think of it how is a series of why screens. But instead of showing why two entities didn’t match, each step shows you how each records entered the entity!
Next press C
at the prompt to see the columnar view.
A columnar why can be very wide! Remember, if it scrolls off the screen, you can use the arrow keys to scroll left and right, up and down, pressing q
to quit when done scrolling.
The first two columns of the screen are displayed below:
- You read this view from left to right.
- The first two columns show step 1.
- Notice how the name is in yellow. It really didn’t score high enough to be a close name match, but the given name scores 100 creating a partial name match. That is why the match_key starts with PNAME. Principle 110 was designed to allow a partial name match because so many other important features match, including DOB, ADDRESS, and EMAIL.
The remaining columns are displayed below:
- In step 1, we learned a more complete name and new address which were used to match records in the remaining two steps.
The columnar view is great because it’s clearer what is learned at each step. However, it only shows how each record enters the entity. The steps that combine virtual entities are not included.
Next press S
at the prompt to display the summary view.
The summary view for how could almost stand on its own as it is a summary of the entity itself.
At the top you see the resolution summary which summarizes the decision tree.
- It starts with how many steps of what type were required.
- It next highlights steps of interest which include any low scoring names and steps that combine virtual entities. On large entities, there can be lots of steps. Which ones are the most important? This section will tell you!
- And finally it shows the principles and match keys that fired along the way.
The next section is the entity summary which shows how many records, how many features of what type etc.
- See how 4 names were used and that 3 are grayed out with a [#1]. Remember from the why documentation that the # indicates a suppressed name. That is because Senzing computes the most complete name and knows which others are a derivative of it.
- This is important to matching! If Barry Smith and Betty Smith both have an aka of B smith, you want to match on the more complete name even if B Smith matches exactly.
- It can also be useful information for your best name calculation.
- Notice that after each feature there is a number in [] and a number in blue (). The number in brackets tells you how many other entities are using that exact value. The number in blue () tells you how many records in that entity reported that value.
- Looking at addresses, you can see from the blue (3) that 9304 W 15th address is the most common which is useful information for your best address calculation. But you also see from the bracketed [2] that somebody else is using the 638 Downey St address.
Type search addr_full = 638 Downey St, Salem, OR
to search and compare who is at the address.
Then type compare search
to see them side-by-side.
The red arrows are highlighting that Maria is on the watch list and is sharing an address with Susan.
- Did Maria steal Susan’s address?
- Is Susan part of the fraud ring as well?
- Or did they just live there at different times.
Senzing counts everything! It not only helps resolution but helps find the needles in the haystack for threat and fraud protection.
Using why with search
You can even ask why on a search. Refer to the earlier section or type help search
to review the syntax.
First type search barry smith
to see if there are any matches.
Senzing indicates no entities at all were found. That means no keys matched. Type help search
to see the keys it generated:
The [0] is telling you you there are no entities with the name Barry Smith, nor any of its metaphone name keys.
Remember, you may not find your record by name alone with an entity search. So lets try to search with a date of birth as well.
Type search bubby smith | date_of_birth: 12/11/1978
Still nothing returned, but this time it says “entities were found but did not score high enough”.
- Type
why search 1
if you thought it should have found entity ID 1. - Type
why search
if you don’t know the entity ID it should find.
Now you can see that it’s the Bubby vs Bob J doesn’t score high enough. Maybe it was really Bobby we should have been searching for.
Lets try search bobby smith | date_of_birth: 12/11/1978
Now we find two entities and the top one has the same date of birth!
Using tree
This next command is the tree view and it is useful for seeing relationships at multiple degrees.
Type help tree
to see how to use it.
To demonstrate that lets search for a company.
Type search universal exports
Looks like there are 4. Hopefully, the Worldwide version is at the top of the hierarchy.
Type get 97
to see. Remember your entity ID may be different
It does look like it is the global parent of the other 3. We can also see who owns it! A get shows a one degree tree view of the relationships. But what if we wanted to see who is related at two degrees?
Type tree 1 degree 2
to see.
Now we can also see the principals behind Universal Exports USA. The tree command uses just one call to the Senzing SDK.
Using show_last_call
Type show_last_call
to see the calls that were made to the sdk for the last command you executed.
It was a find_network_by_entity_id
call and you can find out all about it here: https://www.senzing.com/docs/
Using export
The last command in this series is export. Export provides a way to extract the original JSON messages that make up an entity. If you have a problem with an entity that either did or didn’t resolve to your liking, you can export the json to a file to load into a test system for further debugging. Senzing may also ask you to export the json of a problem entity for a support ticket.
Type help export
to see how to use it.
Type export 1, 7 to /tmp/entities_1-7.jsonl
.
Make sure you specify a directory you have permission to!
These are the records for those two entities. They can be loaded into another system for testing, debugging.
Another great use for export is to create or add records to your truth set! The best truth sets are based on real data. So when you come across interesting or complex examples of entities that either matched or didn’t, export them for your truth set! see How to create an entity resolution truth set
That’s it for ad hoc exploration of entities for understanding how and why entities either resolved or didn’t. Be sure to check out the next article in the series, exploring snapshot reports!
Next Steps
Congratulations! You’ve learned how to use sz_explorer
to search, compare, and analyze entities in your Senzing database. To continue:
- Explore the next article on snapshot reports for advanced analysis.
- Experiment with your own data using the commands above.
If you have any questions, contact Senzing Support. Support is 100% FREE!