Understanding Libpostal

How It Works & Why It Matters

In the digital age, understanding and processing addresses accurately is crucial for a wide range of use cases including address matching and entity resolution. Libpostal stands out as the leading open-source library explicitly designed for parsing and normalizing addresses. In this article, we’ll explore what Libpostal is, how it works, the technologies involved, and the benefits it offers. We’ll also take a closer look at how libpostal benefits entity resolution.

Libpostal - what it is and why it matters

What is Libpostal?

Libpostal is an open-source library for parsing and normalizing postal addresses across different locations and languages.

The libpostal statistical model was originally created by Al Barentine and Mapzen in 2016. Libpostal is written in C, so it is fast – with client libraries available in other popular languages. 

Libpostal is one of the fundamental tools used to help overcome the biggest challenges around address parsing, including a lack of global address standards and often dirty or incomplete address data.

Problems Libpostal Solves

Global Address [Lack of] Standards

Address parsing is a tough task to accomplish in a single library because addresses vary wildly from country to country. (Check out the varying national standards in the Address page on Wikipedia to get an idea of the challenge involved).

For example, let’s look at how significantly addresses in the United States vary from addresses in India.

A valid US address might look like:

701 1st Ave
Ste 1A
Sunnyvale, CA 94089

A valid Indian address might look like:

e-506 street number 78
uttam viharblock d
uttam nagar
bindapur, 110059
new delhi

If that looks unfamiliar, it’s because the format of an Indian address is quite different from an American address:

Name Son/Daughter Of (DO/SO) Or Husband/Wife Of (H/O or W/O)
Door number:
Street Number, Street Name
VIA NAME (VIA)
Post Name (PO)
Taluk Name (TK)
Locality or Neighbourhood
CITY – Postal Code (PIN)
District Name
State
Country

There is little correlation between fields across American and Indian addresses, so Libpostal has its work cut out for it in wrangling the data into a common format. It uses a natural language processing technnique called conditional random fields (CRF) to make a structured prediction to parse addresses into a standard format. Structured prediction means all fields are parsed, all at once and each parsing decision influences the others – they aren’t independent. Machine learning is used to write programs too complex to write by hand, which is exactly what address parsing is.

Libpostal address parsing and format example

Messy, Incomplete Addresses

Data in the wild is dirty. How do we parse this address? It is incomplete.

Peachtree St
Atlanta, 30308

Libpostal will do as good a job as possible with partial addresses, helping salvage structured data that might otherwise go unused.

Address-parsing-libpostal-entity-resolution-example

We got what was available: the street name, city and postal code. This approximate location is still valuable information about the record from which the address comes. In a geospatial analysis, zip code alone can be enough to perform analytics (e.g. GROUP BY zip_code).

Technologies Utilized by Libpostal

Libpostal incorporates a variety of technologies to achieve its parsing and normalization capabilities:

• Machine Learning: Libpostal utilizes machine learning models trained on large datasets of real-world address examples to improve accuracy and robustness in parsing and normalization.

• Natural Language Processing (NLP): NLP techniques are used to analyze and understand the linguistic structure of address strings, enabling efficient tokenization and parsing.

• Rule-based Systems: In addition to machine learning, libpostal employs rule-based systems to help generate training data that might not be adequately represented in source data.

• Data: Libpostal analyzes OpenStreetMap (OSM) and OpenAddress (OA) data, a collaborative mapping project that provides comprehensive geographic information, including street names, place names, and postal code boundaries.

Updated Libpostal Data Model from Senzing

In 2023, Senzing created a new libpostal data model that is more up to date and significantly more accurate than the original model from 2016. Senzing trained this new libpostal data model on 40% more records than the original, with 1.2 billion training records created from addresses in OpenAddressesand OpenStreetMap (OSM). This includes addresses data from over 230 countries and over 100 different languages.

The model was tested on 12,950 addresses from 89 countries, yielding an average accuracy improvement of over 4% for all countries and over 10% in 27 countries, with improvements in specific countries as high as 87%. This results in better address parsing results and broader coverage for libpostal users. You can read more about the new Senzing libpostal data model here.

Watch as Jeff Jonas describes the new Senzing libpostal data model.

What are the Benefits of Using Libpostal?

Libpostal offers several benefits for developers, businesses, and organizations working with address data:

Address Parsing Accuracy: By leveraging machine learning and NLP techniques, libpostal achieves high accuracy in parsing addresses across diverse languages and formats.

Operational Efficiency: By automating address processing tasks, libpostal streamlines operations, saving time and resources.

Open Source: As an open-source library, libpostal is freely available for use and modification, fostering collaboration and innovation within the developer community.

Multilingual Support: Libpostal offers multilingual support, enabling address parsing and for diverse languages and scripts. With its adaptable architecture, the library accommodates variations in address formats , enhancing its utility in global applications.

Cost: Libpostal is not only free, but by using libpostal you may not need an expensive address cleansing product.

What are the Most Common Libpostal Use Cases?

Common use cases for address parsing often involve data cleaning before downstream analytics, machine learning, etc. A data pipeline for customer data might parse addresses so it can populate address fields in a database to enable improved (future) search, or geospatial analytics e.g. GROUP BY query on zip code.

Some specific use cases for address parsing include:

• Matching a shipping address with an official address from a postal service.

• Geographic Information Systems (GIS) rely on structured records to geocode addresses

• Extract-Transform-Load (ETL) of raw single-line address data into a structured address in a database

• Entity resolution where well-parsed addresses 1) improve candidate selection (often called binning); and, 2) contribute to higher quality address scoring

• Knowledge graph construction where well-parsed addresses can be used to discover connections between nodes – that would otherwise be missed

Libpostal in Action - What Is Possible

Let’s examine some examples of what’s possible with a high-performance address-parsing library. What does Libpostal, used with the pypostal PyPI client library, make of real addresses?

Different Strings - Same Address

Below are a pair of dissimilar address strings that represent the same address. The Levenshtein distance between these two addresses is 13, which is misleading. The two addresses are the same, despite being dissimilar strings.

The diagram below shows what Libpostal needs to figure out: street names and numbers are in different positions in the string, but the locations represented by these addresses are identical.

Libpostal address parsing example

Comparing parsed addresses using Libpostal indicates a partial match occurred: the house number, road and postal code matched. For some applications, this is sufficient to match addresses. In others, it is not. You get to decide.

Libpostal address matching

Similar Strings - Different Address

Sometimes address strings can be similar, but a single character makes them different locations. 32 Orchard Road is not 38 Orchard Row, even if all other fields match. Nor is a matching street address with postal code 238875 the same location as 238874. The similar addresses below are different even with a Levenshtein edit distance of 2.

Libpostal different address parsing example

Address Parsing for Entity Resolution

One very popular use case for Libpostal is entity resolution, where parsed addresses enable semantic comparison at the address element level – which works much better than whole-address string comparison.

Entity resolution, also known as data matching, fuzzy matching, record linkage or deduplication, is the process of identifying and linking records that refer to the same real-world entity across different data sources. It involves comparing attributes such as names, addresses, and identifiers to determine the degree of similarity and establish connections between related records.

While Libpostal parses and expands addresses, it is up to users to implement comparisons suited to their domain. Libpostal gives users the ability to work semantically with each address field and make informed decisions about matching addresses. Beyond entity resolution, this is a critical capability for geospatial analytics and knowledge graphs (e.g., discovering address connections missed in messy address data).

Libpostal address parsing for entity resolution

Libpostal & Entity Resolved Knowledge Graphs

Well-parsed addresses can also be used to create new, reliable, address nodes in a knowledge graph revealing connections despite the raw addresses being structurally quite different e.g. natural variability or incompleteness. Address nodes can be powerful enablers of geospatial knowledge graph queries. Pro tip: Keep an eye out for address supernodes – excessively connected nodes that can break graph database queries – necessitating a degree of quality assurance.

Entity-resolved knowledge graphs are popular in the field of Anti-Money Laundering (AML) because financial criminals often work together across national borders. They work in networks. This makes parsing addresses from all 195 nations on Earth important. The diagram below shows how financial risk in a business graph might spread across an address node formed during entity resolution using Libpostal parsed addresses. Sharing an address with a known money launderer can be a strong signal of risk.

Libpostal & entity resolved knowledge graphs

Libpostal Client Libraries

While the core library is written in C for quick performance, you can use Libpostal in your own language via a client library. The official bindings are listed below, but other libraries for additional languages exist.

• Python – pypostal

• Java / JNI – jpostal

• Ruby – ruby_postal

• NodeJS – node_postal

• R – poster

• Go – gopostal

PHP – php-postal

Libpostal and Geocoding?

Libpostal isn’t geocoding software, but address parsing is necessary for geocoding addresses. Once parsed, address matching with a database can determine the latitude and longitude of an address. Absent a match, interpolation can be used to determine approximate coordinates based on similarities to known locations in terms of street and house number. Other parsed addresses provide clues to the location of a new address. Mapzen released Libpostal and also a sister open-source project that geocodes parsed addresses called Pelias.

Scaling Libpostal for Big Data

Libpostal is blazin’ fast, so it scales to billions of addresses on a distributed system like Apache Spark, Amazon Elastic MapReduce, GCP Dataproc, Databricks or Dask. Libpostal isn’t likely to be a significant bottleneck in your data pipeline compared to I/O operations.

You can use Libpostal with PySpark or Dask via the pypostal Python client library. You’ll need to create a build script to install Libpostal and pypostal on each machine as it boots using the install instructions for Linux. In addition to PySpark, jpostal will likely work through Spark on Scala or Java.

Trying Libpostal With Docker

You can try Libpostal via the command line interface (CLI) that parses addresses you type, using docker in three commands:

docker pull senzing/libpostal-docker:latest
docker run -it senzing/libpostal-docker /bin/bash
../libpostal/src/address_parser

Any addresses you type will be parsed and displayed as JSON.

Building Libpostal with the Senzing Model

Senzing created the first updated libpostal data model since 2016. We trained it on 40% more records (1.2 billion) and measurably increased accuracy.

Now, you can easily build Libpostal with the Senzing model using the following instructions or the README for libpostal. Libpostal has a few dependencies we need to install first. 

On Ubuntu Linux, run the following commands:

sudo apt install curl build-essential autoconf automake libtool pkg-config -y

On Redhat / CentOS, run:

sudo yum install curl autoconf automake libtool pkgconfig

If you run into issues with Redhat / CentOS, try installing the Development Tools package:

yum groups mark install “Development Tools”
yum groups mark convert “Development Tools”
yum groupinstall “Development Tools”

On a Mac, you can use Homebrew to install the same stuff:

brew install curl autoconf automake libtool pkg-config

A single flag `MODEL=senzing` is necessary to configure the Senzing model (see the Dockerfile for this post). It takes a few minutes to build. Run the following commands:

git clone https://github.com/openvenues/libpostal.git
cd libpostal
./bootstrap.sh
# Disable SSE2 so it will work on Apple ARM processors (optional)
./configure –datadir=/tmp –disable-sse2 MODEL=senzing
make -j6
make install

On Linux there is one more command:

ldconfig

Libpostal Community

The Libpostal community is constantly growing! If you have an issue or feature to suggest, use Github Issues for the openvenues/libpostal project. If you need help using Libpostal or want to discuss it with other users, check out the Libpostal LinkedIn group.

Interested in what we're up to?
Subscribe to email updates from Senzing.

Please add your email address to opt-in to be subscribed to our email marketing list. You can unsubscribe at any time. For further information, please view our full Privacy Notice.