Senzing SDK Linux Quickstart Guide

This article outlines installing the Senzing SDK on Linux, performing loading and entity resolution, analysis and exploration of the outcomes of entity resolution and how to prepare and load your own data to Senzing.

Tip

Senzing provides 100k source records for ingestion and evaluation for free. If you require additional records for an evaluation, or any assistance when following this guide, please contact Senzing Support. Support is 100% FREE!

The installation steps add the Senzing software repository to your Linux distribution, these steps only need to be completed once. During installation you will be asked to accept the End User License Agreement (EULA). On Red Hat based distributions you will also be prompted to accept the Senzing public key.

For air-gapped installs, use our air-gapped systems guide to install the packages and then return here to complete.

Tip

In addition to these instructions, the senzingapi and senzingdata packages can be manually downloaded and installed for Debian and RedHat based systems.

Installing Senzing - Debian Based Distributions

Add repository

Add and enable the Senzing repository to the currently configured list managed by apt. This only need to be completed once.

sudo apt install apt-transport-https

Info

The new APT senzingrepo v2 repository package works only for Senzing versions >= 3.10.0 It detects architecture and platform. If a prior Senzing version is required, you must install the older senzingrepo v1 repository package: https://senzing-production-apt.s3.amazonaws.com/senzingrepo-1.0.1-1_amd64.deb. Please contact Senzing Support if you have any questions.

wget https://senzing-production-apt.s3.us-east-1.amazonaws.com/senzingrepo_2.0.1-1_all.deb

sudo apt install ./senzingrepo_2.0.1-1_all.deb

sudo apt update

Install package

Warning

The latest version of Senzing can now be installed. As part of the installation you will be asked to accept the End User License Agreement (EULA).

sudo apt install senzingapi

Continue with Creating a Senzing Project…

Installing Senzing - Red Hat Based Distributions

Add repository

Add and enable the Senzing repository to the currently configured list managed by yum. This step only needs to be completed once.

Info

The new YUM senzingrepo v2 repository package works only for Senzing versions >= 3.10.0 It detects architecture and platform. If a prior Senzing version is required, you must install the older senzingrepo v1 repository package: https://senzing-production-yum.s3.amazonaws.com/senzingrepo-1.0.0-2.x86_64.rpm. Please contact Senzing Support if you have any questions.

sudo yum install https://senzing-production-yum.s3.us-east-1.amazonaws.com/senzingrepo-2.0.1-1.noarch.rpm

Install package

The latest version of Senzing can now be installed. As part of the installation you will be asked to accept the End User License Agreement (EULA), this can be viewed at https://senzing.com/end-user-license-agreement/

sudo yum install senzingapi

Tip

During the first installation of Senzing to a system you will also be prompted to accept the Senzing public key. Accepting the prompt imports the public key to verify future installations come from Senzing.

Retrieving key from https://senzing-production-yum.s3.amazonaws.com/senzing-production.key
Importing GPG key 0xD99E309D:
 Userid : "Senzing, Inc. <buildmgr@senzing.com>"
 Fingerprint: e38c a28c f7ab 06d5 120b bda7 4f67 bf4d d99e 309d
 From : https://senzing-production-yum.s3.amazonaws.com/senzing-production.key
Is this ok [y/N]: y

Create a Senzing Project

To begin using Senzing, first create a project. This deploys an instance of Senzing into a specified path. The project folder must not already exist and will be created by the G2CreateProject.py utility.

Creating and using projects provides independent and isolated instances of Senzing. Projects can be upgraded from prior Senzing versions.

This command creates the Senzing project in your current users home path in a new directory named senzing.

python3 /opt/senzing/g2/python/G2CreateProject.py ~/senzing

$ python3 /opt/senzing/g2/python/G2CreateProject.py ~/senzing

Creating Senzing instance at: /home/username/senzing
Senzing version: 3.12.3 - (3.12.3-24323)

Successfully created.

Info

To expedite getting started an embedded SQLite database is configured for use when creating a Senzing project. SQLite is easy to evaluate with, for production systems an enterprise level RDBMS such as Postgres would be used. For additional information see Technical - Database.

Configure Environment

To utilize your new project, environment variables need to be set indicating where to find resources for the project. The setupEnv script is project dependent and needs to be run whenever you are working with a project, for example between logging in and out of shell sessions. To setup the environment, change to your project directory and source the setupEnv file.

cd <project_path>

source setupEnv

Info

<project_path> refers to the path specified on the G2CreateProject.py command when creating a project.

Updating Database with Senzing Configuration

A Senzing instance is configured with a JSON document, on a fresh installation this document needs to be registered in the Senzing database. This step only needs to be performed once initially for a new project. From the root of your project directory, run the following command and enter y when prompted:

python3 python/G2SetupConfig.py

Loading the Sample Truth Set Data

You can now load some sample demo data into Senzing using the G2Loader utility. G2Loader is a sample application for loading data that calls the Senzing SDK, the same SDK you would call when building your own applications or embedding Senzing into other systems or processes.

Add Data Source Codes

The three sample files to load represent three different data sources: customers, a watchlist, and reference data. Records loaded into Senzing have an identifier attribute called DATA_SOURCE, this is an arbitrary value to describe and identify where source records originated from and is useful designation when analyzing and reporting on entities.

Each of the records in the three files to load use one of the DATA_SOURCE codes: CUSTOMERS, REFERENCE or WATCHLIST. Before data can be loaded using these values, they need to be added to the Senzing configuration. This only needs to be completed once for each DATA_SOURCE value. The G2ConfigTool.py utility performs this configuration change, to start G2ConfigTool.py:

python3 python/G2ConfigTool.py

Once at the (g2cfg) prompt enter the following commands:

addDataSource CUSTOMERS
addDataSource REFERENCE
addDataSource WATCHLIST
save
y
quit

$ python3 python/G2ConfigTool.py

Initializing Senzing engines...

Welcome to G2Config Tool. Type help or ? to list commands.

(g2cfg) addDataSource CUSTOMERS

Successfully added!

(g2cfg) addDataSource REFERENCE

Successfully added!

(g2cfg) addDataSource WATCHLIST

Successfully added!

(g2cfg) save

WARNING: This will immediately update the current configuration in the Senzing repository with the current configuration!

Are you certain you wish to proceed and save changes? (y/n)  y

Configuration saved to Senzing repository.

Initializing Senzing engines...

(g2cfg) quit

Loading

With the data source codes added, load each file with the following commands:

python3 python/G2Loader.py -f python/demo/truth/customers.json

python3 python/G2Loader.py -f python/demo/truth/reference.json

python3 python/G2Loader.py -f python/demo/truth/watchlist.json

Senzing operates in real-time, as each record is loaded it completes the entity resolution process. The outcome is every record within and across each file has been entity resolved against all other data and the outcomes persisted in the Senzing database.

Info

To learn more about the entity resolution process, check out these Senzing white papers.

Exploring Entity Resolution Outcomes

Loading data into Senzing completes the entity resolution processing which can now be reviewed, explored and evaluated with the Exploratory Data Analysis (EDA) tools. The EDA tools consist of:

G2Explorer.py for understanding how and why entities are resolved and related
G2Snapshot.py for calculating reports to be viewed with G2Explorer
G2Audit.py for comparing results between Senzing and other technologies or comparing Senzing results between configurations

To begin exploring the EDA tools, review Exploratory Data Analysis (EDA) tools. Once you have an overview of EDA tools and their functionality it is recommended to explore G2Explorer and G2Snapshot on the previously loaded truth set data.

Tip

The EDA tools articles outline loading the truth set data, this doesn’t need to be completed it was completed in the prior step.

G2Explorer

To get started with G2Explorer.py, run the following command:

python3 python/G2Explorer.py

$ python3 python/G2Explorer.py

  ____|  __ \     \
  __|    |   |   _ \   Senzing G2
  |      |   |  ___ \  Exploratory Data Analysis
 _____| ____/ _/    _\


sucessfully loaded snapshottest.json


Type help or ? to list commands.

(g2)

Tip

The EDA tools have built in help!

(g2) help

Adhoc entity commands:
    search - search for entities by name and/or other attributes.
    get - get an entity by entity ID or record_id.
    compare - place two or more entities side by side for easier comparison.
    how - get a step by step replay of how an entity came together.
    why - see why entities or records either did or did not resolve.
    tree - see a tree view of an entity's relationships through 1 or 2 degrees.
    export - export the json records for an entity for debugging or correcting and reloading.

Snapshot reports: (requires a json file created with G2Snapshot)
    dataSourceSummary – shows how many duplicates were detected within each data source, as well as
    the possible matches and relationships that were derived. For example, how many duplicate customers
    there are, and are any of them related to each other.
    crossSourceSummary – shows how many matches were made across data sources.  For example, how many
    employees are related to customers.
    entitySizeBreakdown – shows how many entities of what size were created.  For instance, some entities
    are singletons, some might have connected 2 records, some 3, etc.  This report is primarily used to
    ensure there are no instances of over matching.   For instance, it’s ok for an entity to have hundreds
    of records as long as there are not too many different names, addresses, identifiers, etc.

Audit report: (requires a json file created with G2Audit)
    auditSummary - shows the precision, recall and F1 scores with the ability to browse the entities that
    were split or merged.

Other commands:
    quickLook - show the number of records in the repository by data source without a snapshot.
    load - load a snapshot or audit report json file.
    score - show the scores of any two names, addresses, identifiers, or combination thereof.
    set - various settings affecting how entities are displayed.

Senzing Knowledge Center: https://senzing.zendesk.com/hc/en-us
Senzing Support Request: https://senzing.zendesk.com/hc/en-us/requests/new


(g2) help get

Displays a particular entity by entity_id or by data_source and record_id.

Syntax:
    get <entity_id>               looks up an entity's resume by entity ID
    get <dataSource> <recordID>   looks up an entity's resume by data source and record ID
    get search <search index>     looks up an entity's resume by search index (requires a prior search)
    get detail <entity_id>        adding the "detail" tag displays each record rather than a summary by de
    get features <entity_id>      adding the "features" tag displays the entity features rather than the e

Notes:
    Add the keyword ALL to display all the attributes of the entity if there are more than 50.

(g2)

get

The get command displays details for an entity, in this instance looked up by the data source code and record id:

get customers 1070

(g2) get customers 1070

Entity summary for entity 55: Jie Wang
┼───────────┼───────────────────────────────┼─────────────────┼
│ Record ID │ Entity Data                   │ Additional Data │
┼───────────┼───────────────────────────────┼─────────────────┼
│ CUSTOMERS │ PRIMARY: Wang Jie             │ AMOUNT: 100     │
│ 1069      │ NATIVE: 王杰                  │ AMOUNT: 200     │
│ 1070      │ DOB: 9/14/93                  │ DATE: 1/26/18   │
│           │ GENDER: M                     │ DATE: 1/27/18   │
│           │ GENDER: Male                  │ STATUS: Active  │
│           │ RECORD_TYPE: PERSON           │                 │
│           │ NATIONAL_ID: 832721           │                 │
│           │ NATIONAL_ID: 832721 Hong Kong │                 │
│           │ HOME: 12 Constitution Street  │                 │
┼───────────┼───────────────────────────────┼─────────────────┼
│ REFERENCE │ PRIMARY: Wang Jie             │ CATEGORY: Owner │
│ 2013      │ DOB: 1993-09-14               │ STATUS: Current │
│           │ RECORD_TYPE: PERSON           │                 │
┼───────────┼───────────────────────────────┼─────────────────┼
└── Disclosed relationships (1)
    └── OWNS 60% (1)
        └── 91 CUSTOMERS (1) | REFERENCE (1) Hajah Mamunah Jln Pisang

(g2)

search

Perform a search for an entity:

search {"name_full": "robert smith", "date_of_birth": "11/12/1978"}

(g2) search {"name_full": "robert smith", "date_of_birth": "11/12/1978"}

Searching ...


Search Results
┼───────┼───────────┼──────────────┼──────────────────────┼─────────────────────────────┼─────────────┼──┼
│ Index │ Entity ID │ Entity Name  │ Data Sources         │ Match Key                   │ Match Score │ R│
┼───────┼───────────┼──────────────┼──────────────────────┼─────────────────────────────┼─────────────┼──┼
│   1   │     1     │ Robert Smith │ CUSTOMERS: 4 records │ NAME+DOB                    │     200     │ 3│
│       │           │              │                      │  Principle 180: SNAME_SSTAB │             │  │
┼───────┼───────────┼──────────────┼──────────────────────┼─────────────────────────────┼─────────────┼──┼
│   2   │   100003  │ Robert Smith │ WATCHLIST: 1008      │ NAME                        │     100     │ 2│
│       │           │              │                      │  Principle 206: CNAME       │             │  │
┼───────┼───────────┼──────────────┼──────────────────────┼─────────────────────────────┼─────────────┼──┼


(g2)

You’ll learn about the JSON structure in the next section - Mapping and Loading Your Own Data.

Tip

Try out the other examples in the G2Explorer.p article and explore the commands and their options using help.

Mapping and Loading Your Own Data

Mapping

At this point you are ready to map and load your own data. Mapping is the process of converting your source data into a structure Senzing understands ready to load.

Info

To learn more about mapping, the dictionary of terms and samples to help prepare your own data sources for loading and entity resolving review the Senzing Entity Specification.

Consider these examples, in your data an attribute describing a personal full name is in a database table with the column name fullname. In Senzing a full name is represented by the term NAME_FULL. Similarly for address line 1, your database column is named addressline1, in Senzing this is represented by the term ADDR_LINE1.

Your task in mapping is to determine which attributes in your data source(s) are appropriate for use in entity resolution, extract those attributes and construct the structure describing those attributes to send to Senzing. The following is an example of a Senzing mapped JSON structure for an entry from a data source.

{
"DATA_SOURCE": "CUSTOMERS",
"RECORD_ID": "1001",
"RECORD_TYPE": "PERSON",
"PRIMARY_NAME_LAST": "Smith",
"PRIMARY_NAME_FIRST": "Robert",
"DATE_OF_BIRTH": "12/11/1978",
"ADDR_TYPE": "MAILING",
"ADDR_LINE1": "123 Main Street, Las Vegas NV 89132",
"PHONE_TYPE": "HOME",
"PHONE_NUMBER": "702-919-1300",
"EMAIL_ADDRESS": "bsmith@work.com",
}

Tip

Additionally, you can view the files for the sample truth set data under the /python/demo/truth path in your project. Review the customers.json, reference.json, and watchlist.json truth set files.

Loading

Once you have mapped your own data source(s) it’s time to load them. Before loading your own data, you’ll want to purge the Senzing database which contains the sample truth set data. Purging the Senzing database completely removes all previously loaded data and entity resolution outcomes, use with caution!

The G2Command utility is one method of purging the Senzing database, to start G2Command:

python3 python/G2Command.py

Once at the (g2cmd) prompt enter the following commands:

purgeRepository
y 
quit

$ python3 python/G2Command.py

Welcome to G2Command. Type help or ? for help.

(g2cmd) purgeRepository

********** WARNING **********

This will purge all currently loaded data from the senzing database!
Before proceeding, all instances of senzing (custom code, rest api, redoer, etc.) must be shut down.

********** WARNING **********

Are you sure you want to purge the senzing database? (y/n) y

Purging the Senzing database (and resetting resolver)...

(g2cmd) quit
$

Once at the (g2cfg) prompt enter the following commands where datasourcecode is the value you used for DATA_SOURCE during mapping:

addDataSource datasourcecode
save
y
quit

$ python3 python/G2ConfigTool.py

Initializing Senzing engines...

Welcome to G2Config Tool. Type help or ? to list commands.

(g2cfg) addDataSource PROSPECT

Successfully added!

(g2cfg) save

WARNING: This will immediately update the current configuration in the Senzing repository with the current configuration!

Are you certain you wish to proceed and save changes? (y/n)  y

Configuration saved to Senzing repository.

Initializing Senzing engines...

(g2cfg) quit

$

You are now ready to load your data, again using the G2Loader utility as previously used for loading the sample truth set data. For example, assume you have a file containing mapped data describing prospects, the following command would load the file:

python3 python/G2Loader.py -f prospects.json

Once loading completes, revisit using the EDA tools to explore and analyze the outcomes of entity resolution on your data.

Start developing

Members of our team have made some GitHub projects that show more of what you can do quickly:

SDK reference documentation
Senzing in 3 Python Calls
Task-based code-snippets
Python: Streamlined SQS, RabbitMQ, and Redo processing examples
Java: Streamlined RabbitMQ and Redo processing examples

Tip

Don’t forget you can reach out to support if you need any assistance with getting started with Senzing. Support is 100% FREE!