EDA Tools

Senzing includes three Exploratory Data Analysis (EDA) command-line tools for understanding entity resolution results:

Tool Purpose Learn More
sz_explorer Interactive CLI for ad-hoc entity search, retrieval, explainability, and export Entity Exploration
sz_snapshot Takes a snapshot of entity resolution results and generates summary reports Snapshot Analysis , Running sz_snapshot
sz_audit Compares snapshot results against truth set keys to measure precision and recall Auditing , Running sz_audit
These tools are included in the senzingsdk-tools package, which is also installed as part of senzingsdk-poc .

Truth set data

The truth set contains 159 records across three data sources:

Data Source Description Records
CUSTOMERS Primary subjects of interest, such as customers, employees, or vendors. Includes duplicates, name variations, and address changes. 120
REFERENCE External data about people (demographics, past addresses, contact methods) or companies (firmographics, corporate structure, ownership) that enriches entity profiles. 22
WATCHLIST Entities to screen against, such as known fraud actors or sanctioned parties. 17

EDA tools in action

The following examples use the truth set demo data to illustrate the kinds of questions EDA tools can answer.

While this example shows customer records, the same analysis applies to other subjects of interest, such as employees for insider threat detection, vendors for supply chain risk management, or any other entities of interest.

Deduplication

The data_source_summary report shows that 86 of 120 "DATA_SOURCE": "CUSTOMERS" records matched other records, compressing into 71 entities. The report shows how many records resolved to the same entity within each data source.

data_source_summary showing duplicate records

Cross-source screening

The cross_source_summary report shows 11 "DATA_SOURCE": "CUSTOMERS" records matched against "DATA_SOURCE": "WATCHLIST" across 6 entities, identifying entities that appear in both data sources.

cross_source_summary showing watchlist matches

Ambiguous match investigation

The why command in sz_explorer shows the scoring details behind ambiguous matches, where a record could plausibly belong to more than one entity.

why command showing an ambiguous match

Accuracy measurement

sz_audit compares entity resolution results against a truth set, reporting precision, recall, and F1 scores for measuring accuracy.

audit_summary showing precision, recall, and F1 scores

Get started