Structuring Unstructured Data
When asked about unstructured data this is all I have to say:
“Unstructured data is only useful if structure can be extracted from it.”
Let me explain: A picture taken in pitch black without a flash is useless as it contains no discernible features. The mobile phone call that suddenly goes bonkers and becomes all garbled is equally useless as there is no way to extract meaning from the noise.
On the other hand, a parking garage video has the potential to be much more useful because license plate reading software can extract plate numbers.
The principle that observations are only useful if features can be extracted from them has helped me simplify system architectures:
Observe -> Feature Extract -> Contextualize -> Decide -> Act
When an observation arrives pre-structured e.g., a database transaction, the Feature Extract step is skipped. Because all inputs to Contextualizing are structured, Contextualization processing can be streamlined — indifferent to the nature of the original observation (structured or unstructured).
Some common feature extraction algorithms you may have heard of:
- Optical character recognition e.g., converting a picture of words into a text document
- Object recognition e.g., detecting pictures of cats
- Facial recognition e.g., unlocking the iPhone 10 without a password
- Acoustic fingerprinting e.g., detecting an artist/song based on a small audio sample
- Named entity recognition e.g., suggesting a new contact based on an email’s contents
Unfortunately, commercially available feature extraction technology has a long way to go. The error rates are often just too high. As a consequence, downstream processes (e.g., entity resolution) become the victim. Technology breakthroughs in the field of unstructured feature extraction is much needed. I keep waiting — come on already.