What Is Libpostal?
What It Is & How It Works
Understanding Libpostal. How It Works & Why It Matters
In the digital age, understanding and processing addresses accurately is crucial for a wide range of use cases including address matching and entity resolution. Libpostal stands out as the leading open-source library explicitly designed for parsing and normalizing addresses. In this article, weโll explore what Libpostal is, how it works, the technologies involved, and the benefits it offers. Weโll also take a closer look at how libpostal benefits entity resolution.
What is Libpostal?
Libpostal is an open-source library for parsing and normalizing postal addresses across different locations and languages.
The libpostal statistical model was originally created by Al Barentine and Mapzen in 2016. Libpostal is written in C, so it is fast โ with client libraries available in other popular languages.
Libpostal is one of the fundamental tools used to help overcome the biggest challenges around address parsing, including a lack of global address standards and often dirty or incomplete address data.
Problems Libpostal Solves
Global Address [Lack of] Standards
Address parsing is a tough task to accomplish in a single library because addresses vary wildly from country to country. (Check out the varying national standards in the Address page on Wikipedia to get an idea of the challenge involved).
For example, letโs look at how significantly addresses in the United States vary from addresses in India.
A valid US address might look like:
701 1st Ave
Ste 1A
Sunnyvale, CA 94089
A valid Indian address might look like:
e-506 street number 78
uttam viharblock d
uttam nagar
bindapur, 110059
new delhi
If that looks unfamiliar, itโs because the format of an Indian address is quite different from an American address:
Name Son/Daughter Of (DO/SO) Or Husband/Wife Of (H/O or W/O)
Door number:
Street Number, Street Name
VIA NAME (VIA)
Post Name (PO)
Taluk Name (TK)
Locality or Neighbourhood
CITY – Postal Code (PIN)
District Name
State
Country
There is little correlation between fields across American and Indian addresses, so Libpostal has its work cut out for it in wrangling the data into a common format. It uses a natural language processing technnique called conditional random fields (CRF) to make a structured prediction to parse addresses into a standard format. Structured prediction means all fields are parsed, all at once and each parsing decision influences the others – they arenโt independent. Machine learning is used to write programs too complex to write by hand, which is exactly what address parsing is.
Messy, Incomplete Addresses
Data in the wild is dirty. How do we parse this address? It is incomplete.
Peachtree St
Atlanta, 30308
Libpostal will do as good a job as possible with partial addresses, helping salvage structured data that might otherwise go unused.
We got what was available: the street name, city and postal code. This approximate location is still valuable information about the record from which the address comes. In a geospatial analysis, zip code alone can be enough to perform analytics (e.g. GROUP BY zip_code).
Technologies Utilized by Libpostal
Libpostal incorporates a variety of technologies to achieve its parsing and normalization capabilities:
โข Machine Learning: Libpostal utilizes machine learning models trained on large datasets of real-world address examples to improve accuracy and robustness in parsing and normalization.
โข Natural Language Processing (NLP): NLP techniques are used to analyze and understand the linguistic structure of address strings, enabling efficient tokenization and parsing.
โข Rule-based Systems: In addition to machine learning, libpostal employs rule-based systems to help generate training data that might not be adequately represented in source data.
โข Data: Libpostal analyzes OpenStreetMap (OSM) and OpenAddress (OA) data, a collaborative mapping project that provides comprehensive geographic information, including street names, place names, and postal code boundaries.
Updated Libpostal Data Model from Senzing
In 2023, Senzing created a new libpostal data model that is more up to date and significantly more accurate than the original model from 2016. Senzing trained this new libpostal data model on 40% more records than the original, with 1.2 billion training records created from addresses in OpenAddressesand OpenStreetMap (OSM). This includes addresses data from over 230 countries and over 100 different languages.
The model was tested on 12,950 addresses from 89 countries, yielding an average accuracy improvement of over 4% for all countries and over 10% in 27 countries, with improvements in specific countries as high as 87%. This results in better address parsing results and broader coverage for libpostal users. You can read more about the new Senzing libpostal data model here.
Watch as Jeff Jonas describes the new Senzing libpostal data model.
What are the Benefits of Using Libpostal?
Libpostal offers several benefits for developers, businesses, and organizations working with address data:
โข Address Parsing Accuracy: By leveraging machine learning and NLP techniques, libpostal achieves high accuracy in parsing addresses across diverse languages and formats.
โข Operational Efficiency: By automating address processing tasks, libpostal streamlines operations, saving time and resources.
โข Open Source: As an open-source library, libpostal is freely available for use and modification, fostering collaboration and innovation within the developer community.
โข Multilingual Support: Libpostal offers multilingual support, enabling address parsing and for diverse languages and scripts. With its adaptable architecture, the library accommodates variations in address formats , enhancing its utility in global applications.
โข Cost: Libpostal is not only free, but by using libpostal you may not need an expensive address cleansing product.
What are the Most Common Libpostal Use Cases?
Common use cases for address parsing often involve data cleaning before downstream analytics, machine learning, etc. A data pipeline for customer data might parse addresses so it can populate address fields in a database to enable improved (future) search, or geospatial analytics e.g. GROUP BY query on zip code.
Some specific use cases for address parsing include:
โข Matching a shipping address with an official address from a postal service.
โข Geographic Information Systems (GIS) rely on structured records to geocode addresses
โข Extract-Transform-Load (ETL) of raw single-line address data into a structured address in a database
โข Entity resolution where well-parsed addresses 1) improve candidate selection (often called binning); and, 2) contribute to higher quality address scoring
โข Knowledge graph construction where well-parsed addresses can be used to discover connections between nodes โ that would otherwise be missed
Libpostal in Action - What Is Possible
Letโs examine some examples of whatโs possible with a high-performance address-parsing library. What does Libpostal, used with the pypostal PyPI client library, make of real addresses?
Different Strings - Same Address
Below are a pair of dissimilar address strings that represent the same address. The Levenshtein distance between these two addresses is 13, which is misleading. The two addresses are the same, despite being dissimilar strings.
The diagram below shows what Libpostal needs to figure out: street names and numbers are in different positions in the string, but the locations represented by these addresses are identical.
Comparing parsed addresses using Libpostal indicates a partial match occurred: the house number, road and postal code matched. For some applications, this is sufficient to match addresses. In others, it is not. You get to decide.
Similar Strings - Different Address
Sometimes address strings can be similar, but a single character makes them different locations. 32 Orchard Road is not 38 Orchard Row, even if all other fields match. Nor is a matching street address with postal code 238875 the same location as 238874. The similar addresses below are different even with a Levenshtein edit distance of 2.
Address Parsing for Entity Resolution
One very popular use case for Libpostal is entity resolution, where parsed addresses enable semantic comparison at the address element level – which works much better than whole-address string comparison.
Entity resolution, also known as data matching, fuzzy matching, record linkage or deduplication, is the process of identifying and linking records that refer to the same real-world entity across different data sources. It involves comparing attributes such as names, addresses, and identifiers to determine the degree of similarity and establish connections between related records.
While Libpostal parses and expands addresses, it is up to users to implement comparisons suited to their domain. Libpostal gives users the ability to work semantically with each address field and make informed decisions about matching addresses. Beyond entity resolution, this is a critical capability for geospatial analytics and knowledge graphs (e.g., discovering address connections missed in messy address data).
Libpostal & Entity Resolved Knowledge Graphs
Well-parsed addresses can also be used to create new, reliable, address nodes in a knowledge graph revealing connections despite the raw addresses being structurally quite different e.g. natural variability or incompleteness. Address nodes can be powerful enablers of geospatial knowledge graph queries. Pro tip: Keep an eye out for address supernodes – excessively connected nodes that can break graph database queries – necessitating a degree of quality assurance.
Entity-resolved knowledge graphs are popular in the field of Anti-Money Laundering (AML) because financial criminals often work together across national borders. They work in networks. This makes parsing addresses from all 195 nations on Earth important. The diagram below shows how financial risk in a business graph might spread across an address node formed during entity resolution using Libpostal parsed addresses. Sharing an address with a known money launderer can be a strong signal of risk.
Libpostal Client Libraries
While the core library is written in C for quick performance, you can use Libpostal in your own language via a client library. The official bindings are listed below, but other libraries for additional languages exist.
โข Python – pypostal
โข Java / JNI – jpostal
โข Ruby – ruby_postal
โข NodeJS – node_postal
โข R – poster
โข Go – gopostal
โข PHP – php-postal
Libpostal and Geocoding?
Libpostal isnโt geocoding software, but address parsing is necessary for geocoding addresses. Once parsed, address matching with a database can determine the latitude and longitude of an address. Absent a match, interpolation can be used to determine approximate coordinates based on similarities to known locations in terms of street and house number. Other parsed addresses provide clues to the location of a new address. Mapzen released Libpostal and also a sister open-source project that geocodes parsed addresses called Pelias.
Scaling Libpostal for Big Data
Libpostal is blazinโ fast, so it scales to billions of addresses on a distributed system like Apache Spark, Amazon Elastic MapReduce, GCP Dataproc, Databricks or Dask. Libpostal isnโt likely to be a significant bottleneck in your data pipeline compared to I/O operations.
You can use Libpostal with PySpark or Dask via the pypostal Python client library. Youโll need to create a build script to install Libpostal and pypostal on each machine as it boots using the install instructions for Linux. In addition to PySpark, jpostal will likely work through Spark on Scala or Java.
Trying Libpostal With Docker
You can try Libpostal via the command line interface (CLI) that parses addresses you type, using docker in three commands:
docker pull senzing/libpostal-docker:latest
docker run -it senzing/libpostal-docker /bin/bash
../libpostal/src/address_parser
Any addresses you type will be parsed and displayed as JSON.
Building Libpostal with the Senzing Model
Senzing created the first updated libpostal data model since 2016. We trained it on 40% more records (1.2 billion) and measurably increased accuracy.
Now, you can easily build Libpostal with the Senzing model using the following instructions or the README for libpostal. Libpostal has a few dependencies we need to install first.
On Ubuntu Linux, run the following commands:
sudo apt install curl build-essential autoconf automake libtool pkg-config -y
On Redhat / CentOS, run:
sudo yum install curl autoconf automake libtool pkgconfig
If you run into issues with Redhat / CentOS, try installing the Development Tools package:
yum groups mark install “Development Tools”
yum groups mark convert “Development Tools”
yum groupinstall “Development Tools”
On a Mac, you can use Homebrew to install the same stuff:
brew install curl autoconf automake libtool pkg-config
A single flag `MODEL=senzing` is necessary to configure the Senzing model (see the Dockerfile for this post). It takes a few minutes to build. Run the following commands:
git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
# Disable SSE2 so it will work on Apple ARM processors (optional)
./configure –datadir=/tmp –disable-sse2 MODEL=senzing
make -j6
make install
On Linux there is one more command:
ldconfig
Libpostal Community
The Libpostal community is constantly growing! If you have an issue or feature to suggest, use Github Issues for the openvenues/libpostal project. If you need help using Libpostal or want to discuss it with other users, check out the Libpostal LinkedIn group.
Table of Contents
- What Is Libpostal?
- How Does Libpostal Work?
- Problems Libpostal Solves
- Technologies Used By Libpostal
- Updated Senzing Libpostal Data Model
- What Are The Benefits of Using Libpostal?
- What are the Most Common Libpostal Use Cases?
- Libpostal In Action – What Is Possible
- Address Parsing For Entity Resolution
- Libpostal & Entity Resolved Knowledge Graphs
- Libpostal Client Libraries
- Libpostal and Geocoding
- Scaling Libpostal For Big Data
- Trying Libpostal
- Building Libpostal with the Senzing Model
- Libpostal Community