When GraphRAG Goes Bad:
A Study In Why Knowledge Graphs Cannot Afford to Ignore Entity Resolution
Dr. Clair Sullivan
Data science leader and keynote speaker
I help companies create innovative, data-driven solutions that generate ROI
Letโs face it. If you have been working with generative AI (GenAI) and large language models (LLMs) in any serious way, you will have had to develop a strategy for minimizing hallucinations. One of the most popular and effective ways of doing so is through the use of retrieval-augmented generation (RAG) whereby you supplement what the LLM already knows with up-to-date, detailed information. For many practitioners, this means taking a corpus of text and converting it to a series of word embeddings. However, several people like Juan Sequeda have been pointing out that a more accurate way of doing RAG is with a knowledge graph.
Like pretty much everything in data science there is no proverbial silver bullet. Knowledge graphs can be hard to assemble. If you have seen some of my pre-GenAI content about this, you have seen that demonstrated in detail (fortunately it has gotten significantly better since the release of tools such as LangChain). However, like many cases in data science, the data associated with knowledge graphs still suffer from the problem of GIGO: โgarbage in, garbage out.โ One of the most common sources of data โgarbage,โ which is particularly true for knowledge graphs, is lack of quality entity resolution.
Entity resolution (ER) implies determining if two (or more) real-world entities, whether represented as rows in a table or nodes in a graph, are the same thing. As an easy example, suppose you have two entries for a business. They might have two different names in the database, such as โJoeโs Building and Construction Company, LLCโ and โJoeโs Bldg Inc.โ Maybe some simple RegEx can detect that. Perhaps they have two slightly different addresses, such as โ123 First Streetโ and โ123 1st St.โ OK, that might be a bit harder with RegEx. What if it is really ugly, like you have some people with addresses like comparing the following records:
โข Elizabeth Hampton MD, 456 Elm Rd., Boston, MA 02108 with a driverโs license number of MA987654321 and a date of birth of 03/19/1992
โข Beth Hampton with an address of 456 Elm, Boston with a date of birth of MAR-1992 and a mobile phone number of 202โ555โ1212
โข Dr. Elizbeth Connor with an address of 123 North First Street, Washington, DC 20001 and a driverโs license number of DC123456789 a mobile phone number of 202โ555โ1212 and a DOB of MAR-1992.
These are all the same person, but with Beth being short for Elizabeth and perhaps she changed her name while in DC at some point. No RegEx in the world is going to save you in this case! Neither will an edit distance algorithm.
And those are just examples used for names and strings. This problem can get really ugly when you start talking about things like business ownership. For example, consider the retail company of Kroger. They operate many grocery stores called โKroger,โ but they also own a variety of other stores including City Market, Smithโs, and Harris Teeter, to name a few. If you were wanting to ask your LLM about Kroger, would it understand that these companies all roll up into the same parent company?
ER can be very difficult to do well. Paco Nathan recent published a blog post highlighting the use of the commercial ER product by Senzing to create entity-resolved knowledge graphs (ERKGs). Senzing is industrial-grade ER. I like their description of their product: โWe sell transmissions, not cars.โ In other words, they are not trying to be all things to all people. They have a specific problem that they are trying to solve โ entity resolution in real time and at scale โ and they are hyper-focused on that goal. I have seen many places do ER poorly, so I was curious to see Senzingโs offering at work. They conveniently offer a free tier that will analyze up to 100,000 records. I decided to give it a try and look at what it could do for improving general data science capabilities as well as work with LLMs. As usual, all of the code that will be shown in this post can be found on my GitHub.
The Data and Setup
In this blog post I will be using the same data set regarding businesses in the Las Vegas, NV area as the original article, which comes from a few different sources:
- SafeGraph: a dataset of places of interest such as businesses and organizations
- US Department of Labor Wage and Hour Compliance Action Data (DOL_WHISARD): a publicly-available dataset of labor violations of various companies
- US Small Business Administration PPP Loans over $150K (PPP_LOANS): a publicly-available data set of Paycheck Protection Program (PPP) was a federal loans awarded to help small businesses keep their employees during the COVID-19 pandemic
There are several different ways that you can run Senzing. In the original post Paco hosted it on a virtual Linux machine and famously ran up a bill for $0.04 to do the full ER. I decided to do it a different way and use Docker. You can find a very useful Senzing Docker Quickstart Guide here.
In the GitHub repo you will see that I actually am using Docker in two different ways. First, there are the tools we are going to use that have nothing to do with Senzing. This includes the graph database we will use (Neo4j) and PostgreSQL, which is required for Senzing. For clarity, I decided to keep those separate from everything else and run them through docker-compose.
<p>I am raw html version: ‘3.7’
services:
portainer:
image: portainer/portainer-ce:latest
container_name: portainer
security_opt:
– no-new-privileges:true
volumes:
– /etc/localtime:/etc/localtime:ro
– /var/run/docker.sock:/var/run/docker.sock:ro
– ./portainer-data:/data
ports:
– “9443:9443”
– “8000:8000”
neo4j:
image: neo4j:5.16.0
container_name: neo4j
volumes:
– $HOME/graph_data/testing:/data
– ./data:/var/lib/neo4j/import
ports:
– “7474:7474”
– “7687:7687”
environment:
– NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
– NEO4J_AUTH=neo4j/graphs_@re_c00l
– NEO4J_PLUGINS=[“apoc”, “graph-data-science”]
– apoc.import.file.enabled=true
– NEO4J_server_memory_pagecache_size=4G
– NEO4j_server_memory_heap_initial__size=4G
– NEO4J_server_memory_heap_max__size=8G
– apoc.export.file.enabled=true
postgres:
image: postgres
container_name: pg_container
volumes:
– $HOME/senzing_pg_data:/var/lib/postgresql/data
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: letmein
POSTGRES_DB: erkg
ports:
– “5432:5432”
pgadmin:
image: dpage/pgadmin4
container_name: pgadmin
ports:
– “5050:80”
environment:
PGADMIN_DEFAULT_EMAIL: admin@admin.com
PGADMIN_DEFAULT_PASSWORD: letmeinblock.<br/>Click edit button to change this html</p>
We now start these containers with the commands:
docker-compose build
docker-compose up
Once they have started, you can navigate your browser to Portainer at localhost:9443 and set up a login and password. This is a useful little UI to see all of your running containers, and we will use this information for when we fire up Senzing. In particular, you should navigate into the container list, which will show you all of the running containers as well their internal IP addresses.
(You could also get this information from the command line through docker inspect , but I just happen to find Portainer handy.)
We will specifically need to pay attention to the IP address of the PostgreSQL container, called pg_container above with an address of 172.20.0.5. (Be sure to record whatever it is for your computer, noting that it will change if you restart the container.) Using this and following along with the Docker Quickstart Guide, we will need to set the following environment variable:
export SENZING_ENGINE_CONFIGURATION_JSON='{ “PIPELINE” : { “CONFIGPATH” : “/etc/opt/senzing”, “RESOURCEPATH” : “/opt/senzing/g2/resources”, “SUPPORTPATH” : “/opt/senzing/data” }, “SQL” : { “CONNECTION” : “postgresql://postgres:letmein@172.20.0.5:5432:erkg” }}’
Here you see we have used the IP address for PostgreSQL provided by Portainer along with its password established in the docker-compose.yml. Essentially what we are doing is telling Senzing where to look for PostgreSQL, which it will use through various phases in its ER process.
We are now ready to fire up the Senzing container. To do this, we will initialize the Senzing-PostgreSQL configuration:
docker run –rm -it –network host -e SENZING_ENGINE_CONFIGURATION_JSON senzing/init-postgresql mandatory
Next, we need to configure the Senzing API to do the ER:
docker run –rm -it –network host -e SENZING_ENGINE_CONFIGURATION_JSON senzing/senzingapi-tools
This will take you into the running container. There are several Python scripts within this container, but to get things started you will run G2ConfigTool.py (it is located in /opt/senzing/g2/python but is linked such that you donโt have to go into that directory if you donโt want to). Once in the configuration tool, you will add some data sources and adjust some templates. We will keep this consistent with the original article, although we are running it manually. To do so, enter the following commands (each on a separate line):
addDataSource DOL_WHISARD
addDataSource PPP_LOANS
addDataSource SAFEGRAPH
templateAdd {“feature”: “PLACEKEY”, “template”: “global_id”, “behavior”: “F1E”, “comparison”: “exact_comp”}
addGenericThreshold {“plan”: “INGEST”, “behavior”: “FF”, “feature”: “ADDRESS”, “candidateCap”: 10, “scoringCap”: 500, “sendToRedo”: “Yes”}
save
Excellent! It is now time to get to business and run the ER! We can exit out of that API Docker container for now. We are now going to use a different Docker container to load the 3 files we have downloaded.
docker run -it –rm –network host -u $UID -v ${PWD}:/data -e SENZING_ENGINE_CONFIGURATION_JSON senzing/file-loader -f /data/SafeGraph_LasVegas_Partial.json
docker run -it –rm –network host -u $UID -v ${PWD}:/data -e SENZING_ENGINE_CONFIGURATION_JSON senzing/file-loader -f /data/Dept_Labor_Whisard_LasVegas.json
docker run -it –rm –network host -u $UID -v ${PWD}:/data -e SENZING_ENGINE_CONFIGURATION_JSON senzing/file-loader -f /data/PPP_Loans_Over_150k_LasVegas.json
I really like this file loader because it handles all of the parallelization for you without having to do anything fancy. On my very modest Linux machine it took under 3 minutes to run. And then you are done!
Now letโs see what we have by hopping back into the API container:
docker run –rm -it –network host -e SENZING_ENGINE_CONFIGURATION_JSON senzing/senzingapi-tools
The next tool we will use is G2Explorer.py. This gives you direct access to the ER results. It is pretty elegant because Senzing is just a JSON in and out engine and G2Explorer.py is a handy tool for allowing you to interact with that JSON via the command line. (As you get more proficient with using Senzing for ER you will likely wind up writing your own tooling around working with that JSON, but G2Explorer.pyis an easy way to get some quick views into the results.) I would encourage you to read the docs by typing ? or help at the G2 prompt because there are a ton of things this tool can do. Letโs start with a simple search for an entity called โDesert Springs Landscaping.โ Here is a screen shot (I am doing this because the colors of the text, which are hard to maintain on this blogging platform, having meaning):
It is great that it found a single entity, which is the whole point! Senzing assigns resolved entities an Entity ID but there can be other entities linked to it. For example, letโs explore what those relationships are that it found:
We can see in the above that Senzing was able to resolve two records from the PPP Loans and SafeGraph into this single entity at the top. This is great because you can see that there are some slight differences in the names addresses of the two records. You also see a series of entities that are also possibly related to this entity, which can be very helpful. Letโs look at that first one, whose name is similar, but it sounds like maybe the same company has two different arms: one that does landscaping and the other that does pools and spas. We can get information on that entity using its entity ID as shown:
We can see here another relationship that is potentially interesting: โDisclosed relationships.โ Here was have a parent that has been identified for this entity. Letโs dig into that one:
Just looking at the list of disclosed relationships, I see that there are many unrelated businesses. However, if you do a Google search you will find that Blue Diamond Crossing is a very large strip mall where all of these businesses are located. So far, this is looking pretty promising to me! I would encourage you to poke around in G2Explorer.py because there are a lot of great insights to be had already!
Getting This Data Out For Consumption
What I am going to show now is how we can bring the ER results outside of the container and back to our local machine. In general, this is just for demonstration purposes. If you were running a proper production ER system you would have too much data to do this and would then interact with it strictly from Python. So we are just going to do this for the sake of this article.
We need to export the ER results to a file and then copy that file to our local system. There are a zillion data features (a full list can be found here) we could chose to export and for demonstration purposes I am going to get most of them. Again, this is not something we would do in production, but here is how you do it:
exportJSONEntityReport entities.json -f G2_EXPORT_INCLUDE_ALL_ENTITIES G2_ENTITY_INCLUDE_ALL_RELATIONS G2_ENTITY_INCLUDE_RECORD_DATA G2_ENTITY_INCLUDE_RELATED_RECORD_DATA G2_ENTITY_INCLUDE_RECORD_MATCHING_INFO G2_ENTITY_INCLUDE_RELATED_MATCHING_INFO
Now we can copy the file entities.json to our local machine:
docker cp e0678f39601f:/opt/senzing/g2/python/entities.json .
(Note that e0678f39601f is the container ID of my senzing/senzingapi-tools container that I obtained via docker ps -a . You will need to use your actual container value.)
Bringing it into a RAG
A RAG based on the tabular data
Before we look at whether the resolved data helps the LLM, we should start by asking questions of the non-resolved data. Remember that this data is tabular. You could try to do things to establish relationships among this data to create some sort of graph, but these relationships will be artificial at best and likely miss many things. Instead, I opted to use a quickie SQLite database that I made from Pandas with the following code to ingest it:
import json
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from langchain_openai import ChatOpenAI
from langchain_community.agent_toolkits import create_sql_agent
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine
sg_data = []
with open(‘./SafeGraph_LasVegas_Partial.json’) as f:
for line in f:
sg_data.append(json.loads(line))
sg_df = pd.DataFrame(sg_data)
dol_data = []
with open(‘./Dept_Labor_Whisard_LasVegas.json’) as f:
for line in f:
dol_data.append(json.loads(line))
dol_df = pd.DataFrame(dol_data)
dol_df = dol_df.iloc[:, :13] # Each individual infraction count is shown beyond column 13
ppp_data = []
with open(‘./PPP_Loans_Over_150k_LasVegas.json’) as f:
for line in f:
ppp_data.append(json.loads(line))
ppp_df = pd.DataFrame(ppp_data)
dol_df.replace(‘null’, np.nan, inplace=True)
engine = create_engine(“sqlite:///erkg_demo.db”)
sg_df.to_sql(“sg”, engine, index=False)
ppp_df.to_sql(“ppp”, engine, index=False)
dol_df.to_sql(“dol”, engine, index=False)
Note that for brevity I elected not to get every single different violation type from the DOL_WHISARD data set and just am using the total number of violations.
Now it is time to create an LLM and start asking some questions. For the sake of this post I am not going to get into prompt engineering too much and have just elected to do something very basic. If you were truly creating a production LLM system you would spend a lot of time on the prompt and do serious testing of it such as with LangSmith. For demonstration purposes, I want to make sure that we are not worrying whether the prompt itself is what is generating good or bad results.
Next we need to establish the SQL agent in LangChain and connect it to the database:
engine = create_engine(“sqlite:///erkg_demo.db”)
sg_df.to_sql(“sg”, engine, index=False)
ppp_df.to_sql(“ppp”, engine, index=False)
dol_df.to_sql(“dol”, engine, index=False)
db = SQLDatabase(engine=engine)
print(db.dialect)
print(db.get_usable_table_names())
print(db.run(“SELECT count(*) FROM sg;”))
print(db.run(“SELECT count(*) FROM ppp;”))
print(db.run(“SELECT count(*) FROM dol;”))
>>
sqlite
[‘dol’, ‘ppp’, ‘sg’]
[(79946,)]
[(3488,)]
[(1554,)]
Now that we have the connection, we can create an SQL agent for asking questions of the database:
sql_agent_executor = create_sql_agent(
ChatOpenAI(temperature=0, model=”gpt-4o-2024-05-13″),
db=db,
agent_type=”openai-tools”,
verbose=True
)
sql_agent_executor.invoke(‘what tables are present?’)
>>
> Entering new SQL Agent Executor chain…
Invoking: `sql_db_list_tables` with `{}`
dol, ppp, sgThe database contains the following tables:
1. dol
2. ppp
3. sg
If you have any specific questions about these tables or need information from them, please let me know!
> Finished chain.
{‘input’: ‘what tables are present?’,
‘output’: ‘The database contains the following tables:\n1. dol\n2. ppp\n3. sg\n\nIf you have any specific questions about these tables or need information from them, please let me know!’}
Now we are cooking! I may never write SQL again!
sql_agent_executor.invoke(‘tell me about the tables in the database’)
>>
> Entering new SQL Agent Executor chain…
Invoking: `sql_db_list_tables` with `{}`
dol, ppp, sg
Invoking: `sql_db_schema` with `{‘table_names’: ‘dol, ppp, sg’}`
CREATE TABLE dol (
“RECORD_TYPE” TEXT,
“DATA_SOURCE” TEXT,
“RECORD_ID” BIGINT,
case_id BIGINT,
“BUSINESS_NAME_ORG” TEXT,
“LEGAL_NAME_ORG” TEXT,
“BUSINESS_ADDR_LINE1” TEXT,
“BUSINESS_ADDR_CITY” TEXT,
“BUSINESS_ADDR_STATE” TEXT,
“BUSINESS_ADDR_POSTAL_CODE” TEXT,
naic_cd TEXT,
naics_code_description TEXT,
case_violtn_cnt BIGINT
)
/*
3 rows from dol table:
RECORD_TYPE DATA_SOURCE RECORD_ID case_id BUSINESS_NAME_ORG LEGAL_NAME_ORG BUSINESS_ADDR_LINE1 BUSINESS_ADDR_CITY BUSINESS_ADDR_STATE BUSINESS_ADDR_POSTAL_CODE naic_cd naics_code_description case_violtn_cnt
ORGANIZATION DoL_WHISARD 53 1658108 Fabulous Freddy’s (Trailwood) Fabulous LLC 9611 Trail Wood Drive Las Vegas NV 89134 811192 Car Washes 0
ORGANIZATION DoL_WHISARD 165 1419689 Boulder Station Hotel & Casino Boulder Station, Inc. 4111 Boulder Hwy Las Vegas NV 89121 721120 Casino Hotels 1
ORGANIZATION DoL_WHISARD 178 1424856 MSI Landscaping MIST Systems International, Inc. 4820 Quality Court, #B Las Vegas NV 89103 561730 Landscaping Services 14
*/
CREATE TABLE ppp (
“RECORD_TYPE” TEXT,
“DATA_SOURCE” TEXT,
“RECORD_ID” BIGINT,
“Loan_Range” TEXT,
“BUSINESS_NAME_ORG” TEXT,
“BUSINESS_ADDR_LINE1” TEXT,
“BUSINESS_ADDR_CITY” TEXT,
“BUSINESS_ADDR_STATE” TEXT,
“BUSINESS_ADDR_POSTAL_CODE” TEXT,
“NAICS_Code” TEXT,
“Business_Type” TEXT,
“OwnedByRaceEthnicity” TEXT,
“OwnedBy” TEXT,
“OwnedByVeteran” TEXT,
“NonProfit” TEXT,
“JobsReported” TEXT,
“DateApproved” TEXT,
“Lender” TEXT,
“CD” TEXT
)
/*
3 rows from ppp table:
RECORD_TYPE DATA_SOURCE RECORD_ID Loan_Range BUSINESS_NAME_ORG BUSINESS_ADDR_LINE1 BUSINESS_ADDR_CITY BUSINESS_ADDR_STATE BUSINESS_ADDR_POSTAL_CODE NAICS_Code Business_Type OwnedByRaceEthnicity OwnedBy OwnedByVeteran NonProfit JobsReported DateApproved Lender CD
ORGANIZATION PPP_LOANS 7017 c $1-2 million INFINITY HOSPICE CARE OF LAS VEGAS, LLC 5110 N 40TH ST STE 107 PHOENIX AZ 85018 623110 Limited Liability Company(LLC) Unanswered Male Owned Unanswered 137 05/01/2020 JPMorgan Chase Bank, National Association AZ-09
ORGANIZATION PPP_LOANS 7018 d $350,000-1 million CLUB TATTOO LAS VEGAS LLC 1839 s. almaschool rd. ste 230 MESA AZ 85210 812199 Limited Liability Company(LLC) Unanswered Male Owned Non-Veteran 15 05/01/2020 JPMorgan Chase Bank, National Association AZ-09
ORGANIZATION PPP_LOANS 7021 d $350,000-1 million LAS VEGAS LABOR, LLC 10265 W Camelback Rd, Ste 104 PHOENIX AZ 85037 111421 Limited Liability Company(LLC) Unanswered Unanswered Unanswered 0 04/07/2020 UMB Bank, National Association AZ-03
*/
CREATE TABLE sg (
“DATA_SOURCE” TEXT,
“RECORD_ID” TEXT,
“RECORD_TYPE” TEXT,
“PLACEKEY” TEXT,
“REL_ANCHOR_DOMAIN” TEXT,
“REL_ANCHOR_KEY” TEXT,
“LOCATION_NAME_ORG” TEXT,
“BRANDS” TEXT,
“TOP_CATEGORY” TEXT,
“SUB_CATEGORY” TEXT,
“NAICS_CODE” TEXT,
“BUSINESS_GEO_LATITUDE” TEXT,
“BUSINESS_GEO_LONGITUDE” TEXT,
“CATEGORY_TAGS” TEXT,
“CLOSED_ON” TEXT,
“TRACKING_CLOSED_SINCE” TEXT,
“PHONE_NUMBER” TEXT,
“BUSINESS_ADDR_COUNTRY” TEXT,
“BUSINESS_ADDR_FULL” TEXT,
“MAILING_VERIFIED_STATUS” TEXT,
“REL_POINTER_DOMAIN” TEXT,
“REL_POINTER_KEY” TEXT,
“REL_POINTER_ROLE” TEXT,
“OPENED_ON” TEXT,
“IS_INTERSECTION” TEXT
)
/*
3 rows from sg table:
DATA_SOURCE RECORD_ID RECORD_TYPE PLACEKEY REL_ANCHOR_DOMAIN REL_ANCHOR_KEY LOCATION_NAME_ORG BRANDS TOP_CATEGORY SUB_CATEGORY NAICS_CODE BUSINESS_GEO_LATITUDE BUSINESS_GEO_LONGITUDE CATEGORY_TAGS CLOSED_ON TRACKING_CLOSED_SINCE PHONE_NUMBER BUSINESS_ADDR_COUNTRY BUSINESS_ADDR_FULL MAILING_VERIFIED_STATUS REL_POINTER_DOMAIN REL_POINTER_KEY REL_POINTER_ROLE OPENED_ON IS_INTERSECTION
SAFEGRAPH 225-222@5yv-j92-tn5 ORGANIZATION 225-222@5yv-j92-tn5 PLACEKEY 225-222@5yv-j92-tn5 Cantwell Michelle L Atty [] Legal Services Offices of Lawyers 541110 36.145647 -115.186399 [] 2024-02-01 2019-07-01 +17023627800 US 3320 W Sahara Ave Las Vegas NV 89102-3223 VERIFIED_PREMISE None None None None None
SAFEGRAPH 226-222@5yv-hmm-whq ORGANIZATION 226-222@5yv-hmm-whq PLACEKEY 226-222@5yv-hmm-whq Pieology Pizzeria [ { “safegraph_brand_id”: “SG_BRAND_f372f9969f6e1ae119a13ff4fc78fd2d”, “safegraph_brand_name”: “Pieo Restaurants and Other Eating Places Limited-Service Restaurants 722513 36.144906 -115.332644 [ “Counter Service”, “Dinner”, “Fast Food”, “Lunch”, “Pizza” ] None 2019-07-01 +17023314454 US 10965 Lavender Hill Dr Ste 130 Las Vegas NV 89135-2951 VERIFIED_DELIVERY_POINT PLACEKEY zzw-223@5yv-hkm-rc5 PARENT None None
SAFEGRAPH 22s-222@5yv-jbz-jgk ORGANIZATION 22s-222@5yv-jbz-jgk PLACEKEY 22s-222@5yv-jbz-jgk Ellen Peneyra [] Offices of Other Health Practitioners Offices of All Other Miscellaneous Health Practitioners 621399 36.10289 -115.121807 [] None 2019-07-01 +17027397716 US 2275 Renaissance Dr Ste D Las Vegas NV 89119-6797 VERIFIED_DELIVERY_POINT PLACEKEY 22g-222@5yv-jbz-h89 PARENT None None
*/The database contains the following tables:
1. **dol**: This table appears to contain information about organizations, including their business names, addresses, NAICS codes, and violation counts.
– Example columns: `RECORD_TYPE`, `DATA_SOURCE`, `RECORD_ID`, `BUSINESS_NAME_ORG`, `BUSINESS_ADDR_CITY`, `BUSINESS_ADDR_STATE`, `naics_code_description`, `case_violtn_cnt`.
2. **ppp**: This table seems to store data related to PPP loans, including loan ranges, business names, addresses, NAICS codes, business types, and other related information.
– Example columns: `RECORD_TYPE`, `DATA_SOURCE`, `RECORD_ID`, `Loan_Range`, `BUSINESS_NAME_ORG`, `BUSINESS_ADDR_CITY`, `BUSINESS_ADDR_STATE`, `NAICS_Code`, `Business_Type`, `JobsReported`, `DateApproved`, `Lender`.
3. **sg**: This table contains information about various locations, including their names, brands, categories, NAICS codes, geographical coordinates, and other related details.
– Example columns: `DATA_SOURCE`, `RECORD_ID`, `RECORD_TYPE`, `PLACEKEY`, `LOCATION_NAME_ORG`, `BRANDS`, `TOP_CATEGORY`, `SUB_CATEGORY`, `NAICS_CODE`, `BUSINESS_GEO_LATITUDE`, `BUSINESS_GEO_LONGITUDE`, `PHONE_NUMBER`, `BUSINESS_ADDR_FULL`.
If you have any specific questions or need information from any of these tables, please let me know!
> Finished chain.
{‘input’: ‘tell me about the tables in the database’,
‘output’: ‘The database contains the following tables:\n\n1. **dol**: This table appears to contain information about organizations, including their business names, addresses, NAICS codes, and violation counts.\n – Example columns: `RECORD_TYPE`, `DATA_SOURCE`, `RECORD_ID`, `BUSINESS_NAME_ORG`, `BUSINESS_ADDR_CITY`, `BUSINESS_ADDR_STATE`, `naics_code_description`, `case_violtn_cnt`.\n\n2. **ppp**: This table seems to store data related to PPP loans, including loan ranges, business names, addresses, NAICS codes, business types, and other related information.\n – Example columns: `RECORD_TYPE`, `DATA_SOURCE`, `RECORD_ID`, `Loan_Range`, `BUSINESS_NAME_ORG`, `BUSINESS_ADDR_CITY`, `BUSINESS_ADDR_STATE`, `NAICS_Code`, `Business_Type`, `JobsReported`, `DateApproved`, `Lender`.\n\n3. **sg**: This table contains information about various locations, including their names, brands, categories, NAICS codes, geographical coordinates, and other related details.\n – Example columns: `DATA_SOURCE`, `RECORD_ID`, `RECORD_TYPE`, `PLACEKEY`, `LOCATION_NAME_ORG`, `BRANDS`, `TOP_CATEGORY`, `SUB_CATEGORY`, `NAICS_CODE`, `BUSINESS_GEO_LATITUDE`, `BUSINESS_GEO_LONGITUDE`, `PHONE_NUMBER`, `BUSINESS_ADDR_FULL`.\n\nIf you have any specific questions or need information from any of these tables, please let me know!’}
Alright, so letโs start asking actual questions of the data. Given this particular data set, I might care that PPP loans are not given to companies with a large number of DOL violations. How might we get at that information? Letโs start by picking Union Cabs (I am going to start truncating the LangChain verbose output a bit, but the full output is in the notebooks in the GitHub repo):
sql_agent_executor.invoke(‘find all references to Union Cabs in all of the tables’)
>>
> Finished chain.
{‘input’: ‘find all references to Union Cabs in all of the tables’,
‘output’: ‘The only reference to “Union Cabs” found in the database is in the `sg` table under the `LOCATION_NAME_ORG` column.\n\nHere is the result:\n- `sg` table: `Union Cabs`\n\nNo references were found in the `dol` or `ppp` tables.’}
sql_agent_executor.invoke(‘does Union Cabs have any violations?’)
>>
> Finished chain.
{‘input’: ‘does Union Cabs have any violations?’,
‘output’: ‘It appears that Union Cabs does not have any violations recorded in the database.’}
Uh ohโฆHouston, we have a problem! According to these basic SQL queries, Union Cabs (an entry in the SafeGraph data) doesnโt have any violations. But if we look at the DOL data there is an entry called โUnion Cabโ (singular) that actually DOES have violations. Further, if you go into G2Explorer.py you will see that there are several other related cab companies with the same address like LV Cabs and Vegas Western Cabs who do have additional violations. (Once we actually assemble the graph of this data below I will show you a visualization of this and it will make more sense.) So if you were wanting to get an accurate accounting of those violations (or many other questions you could think of to ask this data), you will miss it because you have not resolved the relationships between each of these (seemingly unrelated) cab companies!
A RAG based on the entity-resolved data in a graph format
Clearly it is going to help us get better answers by having entity-resolved data. This is where the entities.json file is going to come into play. Letโs explore what it looks like before we use it to create a graph and ask some questions. There are a ton of things you actually can have Senzing write out in JSON like this and they are described in these docs. Here is a sample entry from the file I created via exportJSONEntityReport above:
{‘RELATED_ENTITIES’: [{‘ENTITY_ID’: 14802,
‘ERRULE_CODE’: ‘MFF’,
‘IS_AMBIGUOUS’: 0,
‘IS_DISCLOSED’: 0,
‘MATCH_KEY’: ‘+ADDRESS+GEO_LOC-PLACEKEY’,
‘MATCH_LEVEL’: 3,
‘MATCH_LEVEL_CODE’: ‘POSSIBLY_RELATED’,
‘RECORDS’: [{‘DATA_SOURCE’: ‘SAFEGRAPH’,
‘RECORD_ID’: ‘228-223@5yv-j2j-2×5’}]},
{‘ENTITY_ID’: 37854,
‘ERRULE_CODE’: ‘MFF’,
‘IS_AMBIGUOUS’: 0,
‘IS_DISCLOSED’: 0,
‘MATCH_KEY’: ‘+ADDRESS+GEO_LOC-PLACEKEY’,
‘MATCH_LEVEL’: 3,
‘MATCH_LEVEL_CODE’: ‘POSSIBLY_RELATED’,
‘RECORDS’: [{‘DATA_SOURCE’: ‘SAFEGRAPH’,
‘RECORD_ID’: ‘zzw-222@5yv-j2j-2kz’}]},
{‘ENTITY_ID’: 52110,
‘ERRULE_CODE’: ‘CFF’,
‘IS_AMBIGUOUS’: 0,
‘IS_DISCLOSED’: 0,
‘MATCH_KEY’: ‘+GEO_LOC-PLACEKEY’,
‘MATCH_LEVEL’: 3,
‘MATCH_LEVEL_CODE’: ‘POSSIBLY_RELATED’,
‘RECORDS’: [{‘DATA_SOURCE’: ‘SAFEGRAPH’,
‘RECORD_ID’: ‘zzw-22d@5yv-j2j-2kz’}]}],
‘RESOLVED_ENTITY’: {‘ENTITY_ID’: 1,
‘RECORDS’: [{‘DATA_SOURCE’: ‘SAFEGRAPH’,
‘ENTITY_DESC’: “Myle’s Nails & Spa”,
‘ENTITY_KEY’: ‘E86C65CF09E319D85BE58B3EF8FCB29525B06122’,
‘ENTITY_TYPE’: ‘GENERIC’,
‘ERRULE_CODE’: ”,
‘INTERNAL_ID’: 1,
‘LAST_SEEN_DT’: ‘2024-06-17 14:56:09.956’,
‘MATCH_KEY’: ”,
‘MATCH_LEVEL’: 0,
‘MATCH_LEVEL_CODE’: ”,
‘RECORD_ID’: ‘228-222@5yv-j2j-2×5’}]}}
Here we have two keys for each entry: RESOLVED_ENTITY and RELATED_ENTITY. The former is our starting point. Basically, these are โanchorsโ in our data showing what Senzing has identified as being the true entity for other entities. It is the main result of the ER process and can be used for merging entities. Related entities, on the other hand, are entities that have a potentially meaningful connection to other entities but may not merge entities. For example, two businesses with same address (think a Starbucks within a Target store) would be related even though they might not be the same entity.
If you explore this JSON file you will see that every entry in the JSON has a resolved entity. There may be multiple records that resolve to a single entity. There are also potentially several related entities associated with each resolved entity. Our job is to turn this into a graph that we can use to ask questions of.
To create the graph, I used both the original tabular JSON data as well as the Senzing data, the later being used to establish relationships between the tabular data. I chose this model because it would contain the most amount of text that the LLM can use for the RAG. I used this data to populate a Neo4j database that is running from the original Docker container. You can access it in your web browser at locahost:7474 with the login neo4j and the password graphs_@re_c00l .
For the sake of brevity I will not include the code to populate the graph here. However, you can see how I populated the graph from the original tabular data here and the Senzing data here.
In this code I have used the Neo4j Python driver to establish the connection to the database and populate the graph in batches from Pandas dataframes like I have done in previous blog posts.
As for the graph model, I have created 3 node types based off of the tabular data: SGEntity, DOLEntity , PPPEntity . Essentially, each row in each data set corresponds to a node. I have also created a node called Entity , which correspond to the resolved entities by Senzing. Then there are the relationships RESOLVES and RELATED_TO , which correspond to those keys within entities.json.
Once you run the code in both of these notebooks, you will have a graph of approximately 86,000 nodes and 8600 relationships. The graph model (provided by CALL db.schema.visualization ) should look like this:
At this point, it is important to stop and talk about this a bit. If you have read other posts about Senzing, including Paco Nathanโs original article, you will note that this graph model looks a bit different. When we talk about ER, we think of entities as groups of records. Therefore, we would expect a more hub-and-spoke model with entities in the center and record nodes pointing at them. While this might be beneficial for a variety of reasons, one key reason this is great is that you are creating an entity and not altering any of the records used to create it. So you have the ability to go back and forth easily between unaltered records and resolved entities.
What we have above is more complicated by that. I decided to make the graph model be a bit different for the sake of working with LLMs. I was hoping (although you will see later that this was perhaps a proverbial pipe dream!) that having more relationships in the graph would allow the LLM to move about the graph easier or more accurately. In reality, there is not necessarily a reason to do it this way or use the traditional hub-and-spoke model. I would encourage you to try both models and see which performs better for you.
At any rate, letโs see what it looks like for Union Cabs:
If you compare this to what we see in G2Explorer.py by searching for Union Cabs, it makes sense. If you explore this subgraph, you will see that the related nodes are all related based on addresses. This is likely due to a corporate office having the same address managing multiple different cab companies.
Similar to how we created a SQL agent with LangChain, we are going to create a Cypher QA Chain for asking questions of our ERKG. We first establish a connection from LangChain to Neo4j like this:
import json
import numpy as np
import os
import pandas as pd
from pprint import pprint
import time
from tqdm import tqdm
from langchain_openai import ChatOpenAI
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
URI = “bolt://localhost:7687”
USER = “neo4j”
PWD = ” ” # Put yours here
AUTH = (USER, PWD)
os.environ[“NEO4J_URI”] = URI
os.environ[“NEO4J_USERNAME”] = USER
os.environ[“NEO4J_PASSWORD”] = PWD
graph = Neo4jGraph()
graph.query(“””MATCH (n) RETURN COUNT(n)”””)
>>
[{‘COUNT(n)’: 86374}]
Now, we create the chain:
llm = ChatOpenAI(model=”gpt-4o-2024-05-13″, temperature=0)
chain = GraphCypherQAChain.from_llm(graph=graph,
llm=llm,
validate_cypher=True,
verbose=True)
Great. Time to start asking questions!
response = chain.invoke({“query”: “Tell me about Union Cab Co”})
response
> Entering new GraphCypherQAChain chain…
Generated Cypher:
cypher
MATCH (e:Entity {name: “Union Cab Co”})-[:RELATED_TO|RESOLVES*1..2]-(related)
RETURN e, related
>>
> Finished chain.
{‘query’: ‘Tell me about Union Cab Co’,
‘result’: ‘Union Cab Co, also known as ABC Union Cab Co, was located at 5010 S Valley View Blvd,
Las Vegas, NV 89118-1705. It operated in the Taxi Service industry, specifically under
the NAICS code 485310. The company was categorized under “Taxi and Limousine Service”
and had tags such as “Airport Shuttles” and “Taxis.” Union Cab Co had a verified mailing
status and was tracked as closed since July 1, 2019, with its official closure on March 1,
2022. The company had one recorded case violation and was related to other entities like
Vegas Western Cab, Inc and A NLV Cab Co.’}
Alright, there are some really interesting things to talk about here!
First, check out that Cypher query that the LLM generated. In all of my experience tinkering with this I observed that GraphCypherQAChain really wanted to be searching Entity nodes. It was incredibly rare for it to search anything else. It was also very rare for it to do any sort of string similarity akin to WHERE BUSINESS_NAME_ORG LIKE ‘%Union Cabs%’ in SQL. I found this to be a frustrating limitation of GraphCypherQAChain, but one that could ideally be handled if I did some prompt engineering or tinkered with some of the hyperparameters associated with the function (both beyond the scope of this blog post). After experimenting with this a bit, I knew that in order to get the result above I needed to be sure to show this result for an Entity node with the exact name (see graph visualization above). The performance of assembling the Cypher queries will likely get better over time and should not stop us from realizing the full importance of using ER in a knowledge graph.
Next, letโs look at the answer. Here I was a little heartened. It was great that the LLM was able to pick up on Vegas Western Cab and A NLV Cab Co. as being related to Union Cab Co. I also appreciated how much information the LLM brought into the output.
Letโs now take this a step further and ask questions specifically designed to get at nodes that are related to the starting Entity node:
response = chain.invoke({“query”: “How many violations have entities Union Cab Co is related to been involved in?”})
response
>>
> Entering new GraphCypherQAChain chain…
Generated Cypher:
cypher
MATCH (e:Entity {name: “Union Cab Co”})-[:RELATED_TO]-(d:DOLEntity)
RETURN SUM(d.case_violtn_cnt) AS total_violations
Full Context:
[{‘total_violations’: 6}]
> Finished chain.
{‘query’: ‘How many violations have entities Union Cab Co is related to been involved in?’,
‘result’: ‘Entities related to Union Cab Co have been involved in a total of 6 violations.’}
Well, that is exciting! According to the graph, there is a single DOL entry called โUnion Cabโ and it has a single violation associated with it. However, because we were able to use ER and resolve the entire subgraph, the LLM was able to see that there were 4 entries within DOL_WHISARD that had a total of 6 violations associated with this grouping of cab companies!
I do not want to get you too excited though. I can tell you that I ran this particular block of code several times and received a variety of different answers, including the sum total of all violations within the database. Oh, the joys of working with LLMs!
Down the rabbit hole (a bit)
As I was exploring around in the tabular data, I asked a question that I got an interesting answer to:
sql_agent_executor.invoke(‘of the companies that took on the biggest loans, which had the most violations?’)
>>
> Finished chain.
{‘input’: ‘of the companies that took on the biggest loans, which had the most violations?’,
‘output’: ‘The company that took on one of the biggest loans and had the most violations is HSG, LLC, with a loan range of $350,000-1 million and 36 violations.’}
So there is this company called HSG, LLC with big loans and a fair few violations. But who are they? As I started probing, I found some interesting things. They are present in both the PPP_LOANS and DOL_WHISARD data with slight variations on the address: 4001 S. Decatur Blvd. #37โ376, Las Vegas, NV, 89103 and 4001 South Decatur Blvd, Las Vegas, NV, 89103 respectively. Pulling the thread on that a bit, I found that 4001 S. Decatur Blvd is actually a strip mall and Suite #37 is a UPS Store. So this is likely a PO Box in the UPS Store that is used as the registered address for this business.
However, things got somewhat more interesting. I noted that these two entities have totally different NAICS codes. NAICS is the North American Industry Classification System, which provides an integer value to represent the type of business. In one instance HSG, LLC has an NAICS code of 541620, which corresponds to โEnvironmental Consulting Services.โ In the other case it has a code of 621330, which corresponds to โMental Health Practitioners.โ Curiouser and curiouser.
So I took this to the knowledge graph. Below is the subgraph I identified for all entities with โHSGโ in the name:
Interesting! We have a new node here that I wasnโt expecting, which is a company called Integrated Mission Support Services, LLC. Senzing identified this business as โpossibly relatedโ to HSG, LLC based on the address. Maybe that is correct, maybe it isnโt. However, I noted that it also has an NAICS code of 541620, โEnvironmental Consulting Servicesโ and has the same address of the environmental consultant HSG, LLC. Perhaps they are the same entity. But then what is the mental health HSG doing here? Maybe someone has a typo in their entry in the data? Something else? Please add comments to this article if you have theories we might test!
Conclusions
As I was exploring around in the tabular data, I asked a question that I got an interesting answer to:
I hope after this post you have gotten a feel for the importance of entity resolution when working with knowledge graphs, particularly as they are using in RAG applications. I have shown an introduction on how to work with Senzing to resolve some real-world data and ask questions of it in a basic LLM+RAG setting. It is important to realize that many of the shortcomings of the results here were due to the GraphCypherQAChain package making some simplistic assumptions about the graph like assigning all query entities the node type :Entity (perhaps fixable with hyperparameters) and only looking out 2 hops from the query node. As work progresses on these packages these results will likely improve. It is also likely that this would be improved by spending time doing prompt engineering rather than just asking these basic questions. That can be the subject for a future post.
What I hope you have come away with is the importance of ER on data science, whether it is applied to LLMs with RAG or just in general data science. To anyone working in the field it is clear that if you have better data you will get better answers. What I mean by this is that if I know I have a table of many rows and am able to know that some of those rows are really the same entity, it is obvious that we will get better data science answers in the long run, whether they involve LLMs or are based on more traditional data science needs.
I was really impressed by how easy it was to do ER with Senzing as well as the quality of the results. It is worth noting though that this really is a commercial-grade ER engine. We are dealing with an incredibly small data set for this post โ less than 100k records. Really Senzing is designed and optimized to work with much larger data than this like several millions of records and, no joke, billions and billions too. Plus, it is designed to handle all of these records in real time. If you were using it in production you would not interact with it like I have shown here. This was just for demonstration and experimentation purposes. However, please reach out if you wind up creating more sophisticated data sets and applications with it, particularly with LLMs!
Acknowledgements
As I was exploring around in the tabular data, I asked a question that I got an interesting answer to:
I would like to give special thanks to Paco Nathan for his input on the content of this post as well as Brian Macy, Anthony Cunningham, and Ryan Basile for helping me get up and running with Senzing. And really, Senzing is awesome when it comes to support. If you try this and are finding anything regarding Senzing hard, just shoot them an email at support@senzing.com where their support is free, fast, and awesome!
How Do You Get Started with Smarter Entity Resolution?
If youโd like to know more about Senzing entity resolution:
Consult with an Expert โ Schedule a call with a Senzing entity resolution expert to discuss your requirements.
Try it Yourself โ There are three easy ways to take Senzing entity resolution for a test drive: a simple desktop evaluation tool (for Windows or Mac) and QuickStarts for Linux and Docker. You can install the software, load data and evaluate results in as little as 15 minutes.
When you see how Senzing Entity Resolution AI works, you’ll understand why we provide the most accurate and up-to-date results (that are continually updated in real-time).