Senzing CookbookRecipe #INV-001

Vol. II · Investigative & Compliance

Investigative Analysis · Three Variations

Combining Data for
Investigative Analysis

One recipe, three ways to cook it — from open public data
to your own POC to a production-grade deployment

* your mileage may vary

About this recipe

Investigative processes benefit from combining diverse data sets. The added context gives alerting functions and analysts what they need to make higher-quality decisions. In this recipe, we demonstrate combining three sources of data — a transactional, a derogatory, and a reference data source — and building a custom interactive web interface to explore the results.

▶

Watch

Clair Sullivan makes this dish Off-the-Shelf with Open Data

Variation AOff-the-ShelfOpen data only · ~45 min

Variation BDIY POCAdd your own data · 2–3 hours

Variation CProductionProduction deployment · <40 hours*

📺

Variation A

Off-the-Shelf — Open Data, No Prep Required

Everything is provided. Three Las Vegas CORD snapshots (open data), your LLM, the Senzing MCP, three prompts. The MCP locates each dataset and hands your LLM a download URL — no manual hunting or data wrangling. Nothing to bring but a free evaluation license, which your LLM can request for you via the MCP and it arrives almost immediately.

Ingredients

Everything you need to get started

Tools

A coding LLM — Claude Code, Cursor, Kiro, or VS Code with an AI extension
Senzing MCP server — connected to your LLM. Gives the LLM access to Senzing’s documentation, anti-patterns, datasets for testing, and guides for building dashboards and reports
Senzing SDK — installed locally, automatically via the MCP. No data flow to Senzing, Inc.

Data — located and downloaded automatically via the Senzing MCP

PPP Loans for Las Vegas — 3,488 Paycheck Protection Program recipients, Senzing-mapped and ready to load. The MCP provides the download URL; your LLM fetches it automatically.
US Dept. of Labor Violations for Las Vegas — 1,554 employer compliance actions and citations. Same pattern — MCP locates it, LLM downloads and loads it.
National Provider Index (NPI) for Las Vegas — 71,060 records from the CMS registry of licensed healthcare providers, individual and organizational. MCP-assisted download, no manual wrangling.

Other

A Senzing evaluation license (250k records free) — ask your LLM to request one via the Senzing MCP. It will arrive by email almost immediately. For larger evaluation licenses, contact [email protected].

Prep

Before you start cooking

Set up your coding LLM

Have Claude Code, Cursor, Kiro, or VS Code with an AI assistant open on your machine. This recipe requires a LLM that can write and run code locally — not a web chat window.

Add the Senzing MCP server

In your LLM’s settings, add the Senzing MCP server. In Claude Code it’s under Settings → MCP Servers:

https://mcp.senzing.com/mcp

Confirm it’s working: ask your LLM “What Senzing tools do you have available?” — you should see a list including get_sample_data, mapping_workflow, and others.

Get your evaluation license

Ask your LLM: “Request a Senzing evaluation license for me via the MCP.” The MCP’s submit_feedback tool will ask for your work email and send a 250k-record license to your inbox almost immediately. Save the senzing.lic file somewhere on your machine — you’ll attach it to the first prompt below. For larger volumes, contact [email protected].

III

Building

Loading Data and Exploring Matches

Create an Identity Graph Using PPP and Labor Violations ⏱ 15–20 min

Paste this into your LLM. Attach your senzing.lic file to the same message.

Important: Use the Senzing MCP for this task.

Goal: Stand up Senzing, load two pre-mapped datasets, and produce an entity match report. Size the setup for ~10M records.

Hard rules:
— Use a production-grade loader, not a demo/single-threaded process.
— Before recommending any loader pattern, check the Senzing MCP’s anti-patterns docs.

Preferences:
— Print progress every few seconds as records load.

Steps:
1. Deploy Senzing using the attached eval license file.
2. Load these two pre-mapped Senzing-ready snapshots: PPP loan data and Department of Labor compliance actions.
3. When both are fully loaded, generate a basic summary match report.

⚙What is likely happening behind the scenes▸ show

Outcomes to expect: When complete, you’ll have a live identity graph containing PPP loan and labor violation records and a summary match report. Before moving on to Part 2, it’s worth spending a few minutes exploring how these two data sets have combined.

Example questions worth asking before moving on:

“Show me a few matches that were made.”“Explain the last one.”“What is an ambiguous record?”“Show me a few ambiguous matches.”“What would scaling to 300M records on AWS require?”

Build an interactive web UX ⏱ 10–15 min

Important: Use the Senzing MCP for this task.

Goal: Build an interactive web UX to review and explore the entity matching results from the previous step.

Hard rules:
— Keep using the Senzing MCP for everything, paying special attention to the reporting guide for graph, dashboard, and why-match patterns.

Features:
— Search (using the Senzing interface) across resolved entities by name, address, or other attributes.
— Network graph with labels to visualize the identity graph.
— “Why match?” with feature scores for any selected entity pair.

Outcomes to expect: The LLM will likely make the application available via a URL — and may just pop it up. You’ll get a working UX to click around in. If something looks off, just ask.

Try asking for improvements, for example:

“Show the link chart first in the resume view.”“Add a hover-over entity summary card on each node.”“Add a record count on each node.”

Add the NPI data into the Identity Graph ⏱ 10–15 min

Important: Use the Senzing MCP for this task.

Goal: Add NPI data to the existing setup and update the user interface to reflect the new source.

Steps:
1. Find and add the Senzing-ready NPI (National Provider Index) snapshot to the identity graph using the same production-grade loader as before.
2. Update the reporting/visualization user interface as needed to include NPI as a source.

Outcomes to expect: The identity graph will now contain the NPI data source with matches, possible matches, and newly discovered relationships across all three data sources.

Example ways to explore the identity graph:

“Show me NPI records that also have Dept. of Labor matches.”“Which companies with multiple PPP loans appeared in the news for such?”“Find entities that appear in all three sources.”“Are there healthcare providers in the DoL violations data?”

🍳

Variation B

Do It Yourself POC — Your Data, Your Environment

Same structure as Off-the-Shelf, but you’re bringing your own ingredients. Swap out one, two or all of the provided snapshots for real datasets from your organization — e.g., a customer extract, a known fraudster list, third party reference data. Still demonstration-level, but the results will actually mean something to you and your stakeholders.

Ingredients

Tools, data, and what to bring

Tools — same as Off-the-Shelf

A coding LLM — Claude Code, Cursor, Kiro, or VS Code with an AI extension
Senzing MCP server — connected to your LLM. Gives the LLM access to Senzing’s documentation, anti-patterns, datasets for testing, and guides for building dashboards and reports
Senzing SDK — installed locally, automatically via the MCP. No data flow to Senzing, Inc.

Data — your own, plus optional sources

Bring Your Own Data (BYOD): A good starting point is to bring a customer file, a known fraudster or watchlist extract, and any third-party reference data you already have access to — e.g., OpenData.org, Dun & Bradstreet, Equifax.
Optionally, free data: For example opensanctions.org, opendata.org, or other data sources, many of which can be found in the Senzing CORD library (Collection of Relatable Data).

Other

A Senzing evaluation license (250k records free) — ask your LLM to request one via the Senzing MCP. It will arrive by email almost immediately. For larger evaluation licenses, contact [email protected].

Prep

Get your data sources ready

Export a vertical slice from each of your data sources (e.g., CSV) and place the files somewhere your LLM can access on the local filesystem. Note the file path.
A vertical slice applies the same selection criteria across multiple data sources so the sample properly represents the resolutions and relationships you’d see at full scale. The most common forms are geographic (e.g., all records for a city, state, or postal code) or alpha range (e.g., last names starting with “A*” or “Ly*”).
Apply the same slice criteria to each of your source files — the consistency is what makes the POC results meaningful.
Include attributes that inform entity resolution — names, addresses, phone numbers, emails, IDs, dates of birth, and similar identifying fields. The more of these present, the stronger the resolution.
It doesn’t need to be clean. Inconsistent names, messy addresses, conflicting dates of birth, historical values and missing fields — Senzing specializes in this.

III

Building

Mapping, Loading Data and Exploring Matches

Map Your Data to Senzing JSON ⏱ 5–15 min

Drop this prompt into your LLM, replacing the ‘path/to/your/files’ in Step 1 with your path name:

Important: Use the Senzing MCP for this task.

Goal: Map my data sources to Senzing JSON, ready for ingestion.

Hard rules:
— Use the Senzing MCP’s mapping workflow, not general training.
— Show me the mapping before applying it so I can review it.

Steps:
1. Inspect the files in this directory: [path/to/your/files].
2. Propose a field mapping to Senzing features.
3. Once I approve, produce the Senzing-ready output files.

⚙What is likely happening behind the scenes▸ show

Outcomes to expect: A clear field mapping for each of your source files, showing how your columns map to Senzing features. Once you approve, your data is ready to load.

Create an Identity Graph from Your Data ⏱ 5–20 min depending on volume

Important: Use the Senzing MCP for this task.

Goal: Load my mapped files into Senzing and produce a summary match report.

Hard rules:
— Use a production-grade loader, not a demo/single-threaded process.
— Before recommending any loader pattern, check the Senzing MCP’s anti-patterns docs.

Preferences:
— Print progress every few seconds as records load.

Steps:
1. Load the files we just mapped into Senzing.
2. When fully loaded, give me summary matching statistics.

Outcomes to expect: When complete, you’ll have a live identity graph containing your own data — entities resolved within and across your source files, with summary matching statistics. Before going further, it’s worth spending a few minutes exploring how your data has combined.

Example questions worth asking before moving on:

“Show me a few of the strongest matches that were made across my sources.”“Walk me through why these two records were linked.”“Show me some ambiguous matches — where Senzing found more than one suitable match.”“Which records have the most cross-source connections?”

Build an interactive web UX ⏱ 10–15 min

Important: Use the Senzing MCP for this task.

Goal: Build an interactive web UX to review and explore the entity matching results from the previous step.

Hard rules:
— Keep using the Senzing MCP for everything, paying special attention to the reporting guide for graph, dashboard, and why-match patterns.

Features:
— Search (using the Senzing interface) across resolved entities by name, address, or other attributes.
— Network graph with labels to visualize the identity graph.
— “Why match?” with feature scores for any selected entity pair.

Outcomes to expect: The LLM will likely make the application available via a URL — and may just pop it up. You’ll get a working UX to click around in. If something looks off, just ask.

Try asking for improvements, for example:

“Add a hover-over entity summary card on each node.”“Add a record count on each node.”

What’s next

Now that you have completed your POC and seen the results, it’s time to start planning your production deployment. Proceed to the Production Deployment recipe.

👨‍🍳

Variation C

Production — Deployment Considerations

You’ve run the POC. The results were compelling. Now someone is asking: “What would it take to do this for real?” This variation is not a step-by-step recipe — it’s a set of considerations for the conversation that follows a successful POC.

↳

Before you read this section

Context for a production conversation

A production deployment is an engineering and organizational project. The considerations below are a starting point for scoping, not a complete specification. Every deployment is shaped by data volume, team structure, regulatory context, and the specific questions the system needs to answer.

The Senzing MCP is a useful reference throughout this process — tell your LLM your goals, sizing, and environment, and then ask it for architecture guidance, project planning tips, even a test plan.

Infrastructure & Scale

Volume, database, loader, and cloud

Volume

Record counts and growth

The POC runs comfortably on SQLite at tens of thousands of records. Production deployments of 10M–1B records require a purpose-sized Postgres instance or a distributed Senzing configuration. The Senzing MCP’s architecture docs describe sizing rules by record count — pull them before committing to a database tier.

Database

Engine selection

SQLite is development-only. Postgres is the standard production choice. For very large deployments (>100M records), Senzing supports distributed architectures with multiple database replicas. Database provisioning, tuning, and backup strategy are engineering tasks that belong in a proper infrastructure plan.

Loader

Throughput and concurrency

The POC uses a production-grade loader, but at POC volume the distinction is academic. At scale, loader concurrency, batch size, redo-queue management, and error retry logic become critical. The Senzing MCP anti-patterns documentation covers common failure modes — review it before designing a production pipeline.

Cloud

Deployment target

Senzing runs on any infrastructure that can host a Linux process and a supported database. AWS, GCP, and Azure are all viable. Ask your LLM: “What are the considerations for deploying Senzing at 300M records on AWS?” — the MCP will pull the relevant architecture guidance.

Data Quality & Mapping

Field mapping, profiling, and refresh strategy

Mapping

Field mapping at scale

The POC mapping workflow handles one or two sources in a session. A production deployment may involve ten or twenty source systems, each with different schemas, encodings, and quality levels. Mapping becomes a managed artifact — version-controlled, tested, reviewed by data stewards.

Quality

Pre-ingestion profiling

Senzing resolves what it receives. Poor input quality — truncated names, missing address components, inconsistent encoding — produces lower-quality resolution. A production pipeline typically includes a profiling and standardization step before records reach Senzing. This is data engineering work that runs upstream of the resolver.

Refresh

Ongoing data ingestion

The POC is a one-time load. Production deployments need a refresh strategy — incremental loading as source systems update, handling deletes and corrections, managing the redo queue as new records arrive. Senzing supports incremental updates, but the orchestration layer (when to load, how often, how to detect changes) is an engineering decision.

III

Governance & Operations

Thresholds, auditability, access, and licensing

Thresholds

Match confidence tuning

Senzing’s default thresholds work well for most public-record use cases — no training or tuning required out of the box. For production deployment — especially in regulated industries or where false positives carry real consequences — threshold tuning is a deliberate exercise. The Senzing MCP documentation describes how thresholds affect the ambiguous match population and how to evaluate the tradeoffs.

Auditability

Match explainability

One of Senzing’s core properties is that every match decision is explainable — the features that drove resolution are stored and queryable. In a production investigative or compliance context, this auditability is often a requirement. Plan for how match reasoning will be surfaced to end users and how it will be logged for review.

Access

Who queries the graph

The POC user interface is built by your LLM for your local use. A production system has multiple users with different roles — analysts, investigators, data stewards, auditors — each with different access needs. The query layer, access controls, and UX are separate concerns from the resolution engine itself.

Licensing

Production license

The evaluation license covers 250k records with a 5-day window — sufficient for a POC. A production deployment requires a commercial license sized to your record volume. Contact [email protected] to discuss options — they can also advise on architecture for your specific scale.

A Suggested Next Step

From POC to production deployment

The gap between a POC and a production deployment is mostly engineering and organizational, not technology. The resolution engine you used in the POC is the same one that runs at enterprise scale — the surrounding infrastructure is what grows.

A reasonable path forward:

Step 1

Scope the data

Identify the source systems that matter — not all of them, just the ones where cross-source linking would produce the highest-value results. A scoped first production deployment (2–3 sources, well-mapped) outperforms a broad but shallow one.

Step 2

Size the infrastructure

Use the Senzing MCP architecture docs and your record counts to size a Postgres instance and a loader configuration. Ask your LLM to pull the relevant guidance for your target volume.

Step 3

Engage Senzing

Email [email protected] with your record volume, source count, and deployment target. They can advise on architecture, licensing, and whether professional services would accelerate the project.

What’s next

You have a working POC and a clear picture of what production would involve. The next step is a conversation with the Senzing team about architecture, licensing, and timeline.

—

Why this recipe matters

Combining diverse data has historically been hard — regardless of the datasets involved. For investigative analysis, Senzing’s agentic entity resolution makes it a breeze.

Senzing resolves diverse data without you writing a single matching rule. The identity graph it builds can not only be used for investigative analysis, but also other workflows including alerting and reporting. For a full range of what’s possible, read this comprehensive guide on Identity Intelligence.

For analysts

Diverse data sources — internal and external — visualized together in a single link chart. Relationships and connections that were previously invisible become immediately apparent.

For FIU (Financial Intelligence Unit) managers

Combining more data sources means higher-quality alerts with richer context. Adding a new internal or external source to the identity graph is straightforward — no custom integration work required.

For data engineers

No ETL pipelines to hand build. Accurate matching out of the box without model training or tuning. The Senzing MCP guides the LLM — the result: a fraction of the engineering effort.

For builders

Three prompts take you from zero to a working pipeline — ingest, visualize, extend. The same pattern that runs this POC scales to production. Use Agentic entity resolution to add data source N+1 with 100x less effort.

—

Troubleshooting

⚙The LLM says it can't find the Senzing MCP tools.▸ show

⚙The LLM asks me to download a file manually.▸ show

⚙The data load is taking a long time.▸ show

⚙Things don't seem right. Something's wrong.▸ show

⚙Having difficulty?▸ show

Part of the Senzing Cookbook — practical recipes for putting entity resolution to work, one dish at a time.

Combining Data for
Investigative Analysis

About this recipe

What’s next

What’s next

Product

Capabilities

Use Cases

Partners

Resources

Company

Combining Data forInvestigative Analysis

About this recipe

What’s next

What’s next

Product

Capabilities

Use Cases

Partners

Resources

Company

Combining Data for
Investigative Analysis