Privacy by Design (PbD) History & Features of Senzing
By Jeff Jonas, published December 10, 2018
Inspired by the 70th Anniversary of Universal Declaration of Human Rights, I wanted to blog about the PbD history and features of Senzing.
For over a decade we have been building Privacy by Design into the Senzing technology and we’re starting to see great examples of it in the wild, such as its use to modernize voter registration in America. This implementation is effective in no small part due to the utilization of one of our PbD features called Selective Field Hashing. More about this system, and this feature, in this IAPP keynote video “21st Century Voting – The Success of the Electronic Registration Information Center ” and this NY Times story “Another Use for A.I.: Finding Millions of Unregistered Voters.”
Background
2005: IBM acquired my Las Vegas-based startup Systems Research & Development for its real-time Entity Resolution technology known as Non-Obvious Relationship Awareness (NORA). IBM renamed NORA and now sells this technology under the brand name IBM InfoSphere Identity Insight. It’s a unique product, and one in use around the world – a product the team and I are very proud of.
2008: The team and I quietly embarked on an ambitious project (code named “G2”) to revolutionize Entity Resolution. Among the many aspirations was a goal to create a real-time, self-learning, self-correcting Entity Resolution system while at the same time baking in as many privacy and civil liberties features as we could.
2011: Following two and a half years in full stealth mode, we announced the existence of the G2 technology on Data Privacy Day, January, 2011.
2012: Ann Cavoukian, the creator of Privacy by Design (PbD), and I released a joint paper June 8, 2012 entitled “Privacy by Design in the Era of Big Data.” In this paper, Ann describes her aspirations for Privacy by Design (PbD) and I enumerate the PbD features we had imagined for our next generation Entity Resolution technology.
2016: The G2 technology and the G2 team form Senzing following a one-of-a-kind IBM spinout. Over the next two years, Senzing remains in stealth, quietly focusing its engineering efforts on ease of use, accuracy and performance.
2018: Senzing, Inc. formally launches its Entity Resolution technology of the same name.
More About PbD in Senzing
I spoke above about the paper I co-authored with Ann Cavoukian and indicated there were a number of privacy and civil liberties features we felt ought to be baked-in to any Entity Resolution technology (or, at minimum, provide the user the capability of turning them on.) Here are some details about these features:
- Full Attribution: Every record received is stored with a pointer to its source system and record i.d… There are no processes (e.g., merge/purge, data survivorship) whereby some data is discarded. That is because if data is discarded, system-to-system reconciliation audits become problematic. Moreover, if a system incorrectly discards information, it becomes difficult or impossible to correct such historical decisions. Another good reason to maintain Full Attribution is found in none other than the Universal Declaration of Human Rights . It has four articles admonishing against arbitrariness: e.g., the word “arbitrary” appears in Article 9, which reads “No one shall be subjected to arbitrary arrest, detention or exile.” Back to Entity Resolution, if you don’t know where the data came from, how can any resulting action be anything but arbitrary? Full Attribution is baked-into Senzing.
- Field Hashing: The ability to perform Entity Resolution on hashed data – data cryptographically altered to be unreadable and hard to reverse. Hashed data helps reduce the risk of unintended disclosure. Senzing has baked-in the ability to perform Entity Resolution over hashed fields – while still maintaining some fuzzy matching qualities, e.g., Bob versus Robert and dates of birth with transposed month and days.
- Data Tethering: Adds, changes and deletes occurring in systems of record must be accounted for in as close to real time as possible. Data currentness is significant, especially if one is making important, difficult to reverse, decisions that affect people’s freedoms or privileges. For example, if someone is removed from a watch list, how long should they have to wait before their name is cleared in downstream systems? Senzing supports adds, changes and deletions in real-time. Among other things, this enables compliance with Right to Be Forgotten obligations that come with privacy regulations such as the E.U.’s General Data Protection Regulation (GDPR).
- False Negative Favoring: In many use cases, when it comes to Entity Resolution, it is far preferable from a civil liberties standpoint to miss a few things (false negatives) than inadvertently make claims that are not true (false positives). This is because false positives can adversely affect people’s lives e.g., the police find themselves knocking down the wrong door or an innocent passenger is denied the ability to board a plane. Senzing, by design, uses a false negative-favoring algorithms (though this can be loosened some, as appropriate e.g., use cases in marketing or human in the loop investigations).
- Self-Correcting False Positives: Imagine making an assertion that two people are the same because they share exactly the same name, address and home phone number – only later to learn that these are really two different people (a junior and a senior). Senzing, by design, can self-correct these rare cases, in real-time.
- Information Transfer Accounting: Record-level information transfers should be recorded at the originating system. This allows stakeholders (consumers, data custodians, oversight bodies, etc.) to determine exactly how data is flowing. If source systems don’t track which records were sent where, they will be unable to ensure future changes and deletes are properly relayed downstream (aka Data Tethering). A good example of this in practice is the Inquiries section on US consumer reports as mandated by the Fair Credit Reporting Act (FCRA). The Inquiries section allows consumers to review how their credit file has been shared. This PbD concept is not built into Senzing because it is best deployed during system integration.
- Tamper-Resistant Audit Logs: Tamper-resistant logs make it possible to audit user behavior with confidence – even the database administrator cannot alter the evidence contained in this audit log. This is particularly important to address search abuse e.g., privileged users looking up records without a legitimate business purpose, e.g., an employee taking a peek into their roommate’s file. This PbD concept is not built into Senzing as it is best deployed today via a widely available immutable logging mechanism e.g., Blockchain.
Proudly, as I write this, I believe Senzing may have more baked-in privacy and civil liberties- enhancing features than any other commercially available Entity Resolution software. I could be wrong, if so, do tell!