Why Federated Search Bites (for Enterprise Discovery)
By Jeff Jonas, published October 1, 2018
By โfederated searchโ I mean individually searching each system in an attempt to locate records meeting certain criteria, e.g., records about โLiz Reston.โ Federated search is like going to the library for a specific book and, instead of searching for its location in the index, looking for the book in every aisle until you locate it.
Federated search is less accurate and less efficient than using an index, regardless of whether federated search is manually implemented (i.e., a human searches each system) or an automated process (i.e., a machine searches each system).
To illustrate, imagine searching for the information โLiz Reston, 123 E Court Rd, reston@home.comโ when these seven records exist โ each one in a different source system:
- Liz Reston, 123 E Court Rd, reston@home.com
- Elizabeth Reston, reston@home.com, (202) 762โ1401
- Beth Reston, (202) 762โ1401, beth@old-email.com
- Beth Smith-Reston, beth@old-email.com, 444 Fourth St
- Lizzy Smith, 444 Fourth St, beth@work.com
- Bob Reston, reston@home.com
- reston@home.com
First, note two immediate problems:
- Federated search wonโt find records 3, 4, and 5 because no fields match.
- Federated search will most likely include records 7, which is risky as this record could just as easily be Bobโs record (6).
Federated search, whether conducted manually or implemented with automation, is deficient.
Deficiencies of Manual Federated Search
- Volume: searching every system, every time is time consuming and challenging, especially when there are dozens or hundreds of different systems (e.g., will the person searching remember to search the payroll database).
- Variation: the person searching is unlikely to remember to search for every possible variation (e.g., Elizabeth, Beth, Liz or the many spellings of Muhammed including Mhd).
- Variability: the person searching is unlikely to try dates of birth with month and day transposed (a common data quality problem) or natural variability in addresses such as 123 E Court Rd vs. 123 East Court Road.
While, in theory, automated search could remedy the above deficiencies, there are other serious issues with federated search not easily solved even with automation.
Deficiencies of Automated Federated Search
- Constraints: legacy systems often donโt provide an efficient means to search by address, phone, email, etc. For example, a payroll system is optimized to only allow searching by employee number, name, date of birth and tax ID, but not email or phone. As a result, automated search may have to scan entire databases record-by-record, which is exceptionally slow.
- Completeness: if only a name, address, and email are available, how will automated search find records about the same person if records lack those fields (such as in records 4 and 5 above)?
- Comingling: just because records look alike doesnโt mean they are alike. What if you find a matching record based on an email address that was periodically shared by a husband and wife? Knowing the email has been used by both is essential to understanding who is who in your data. (as noted earlier with regard to record 7).
- Contamination: many systems write searches to audit logs, which mean every search creates more copies of personal data. From a privacy compliance perspective, this is a nightmare. The first time a person asks you for the data you hold on them (e.g., GDPR) there are only a few records, but the second time they ask there are hundreds of new instances of their data (due to meticulously logged searches)!
The simple remedy to address these problems is to use entity resolution to create an index that turns the above seven records into the following entity-resolved graph:
Entity #1 Contains โฆ
- Liz Reston, 123 E Court Rd, reston@home.com
- Elizabeth Reston, reston@home.com, (202) 762โ1401
- Beth Reston, (202) 762โ1401, beth@old-email.com
- Beth Smith-Reston, beth@old-email.com, 444 Fourth St
- Lizzy Smith, 444 Fourth St, beth@work.com
And points to Entity #3 (below) as a possible match.
Entity #2 contains โฆ
6. Bob Reston, reston@home.com
And points to Entity #3 (below) as a possible match.
Entity #3 contains โฆ
7. reston@home.com
And points to Entity #1 and #2 (above) as possible matches.
When searching for โLiz Reston, 123 E Court Rd, reston@home.comโ against this index, Entity #1 is discovered as same โ revealing Lizโs five records. And Entity #3 (record 7) is highlighted as a possible match, allowing the person searching to more carefully consider these two records.
There are no shortcuts: entity resolved indexes deliver effective and efficient search, whether an organization is simply trying to improve investigative search (e.g., insider threat, bank fraud, fake identities) or striving to comply with new privacy laws (e.g., GDPR or CCPA).