Fantasy Analytics
By Jeff Jonas, published July 24, 2019
It often amazes me what people think is computable given their actual observation space.
Here’s an example conversation:
Me: “Tell me about your company.”
Customer: “We are in the business of moving things through supply chains.”
Me: “What do you want to achieve with analytics?”
Customer: “We want to find bombs in the supply chain.”
Me: “COOL!”
Me: “Tell me about your available observation space.”
Customer: “We have information on the shipper and receiver. We also know the owner of the plane, train, truck, car, etc. and the people who operate these vehicles.”
Me: “Nice. What else do you have?”
Customer: “We have the manifest — a claim made by the sender about the contents.”
Me: “Excellent. What else?”
Customer: “That’s it.”
Me: “WHAT?!”
Me: “YOU ARE NEVER GONNA FIND A BOMB!”
Me: “NO ONE WRITES ‘BOMB’ ON THE MANIFEST!”
The problem being; oftentimes business objectives (e.g., finding a bomb) are impossible to achieve given the proposed observation space (data sources).
Unless, in this case, the bad actor writes the word “BOMB” on the manifest. And only idiots do that. Luckily we don’t have to worry much about people who truly don’t know what they’re doing, as they run out of gas on the way to the operation.
When we software engineering folks get overly excited, and run off and build systems with little forethought about the balance between the mission objectives and the observation space, there is a risk the finished system will utterly fail on its business objectives.
As I have no interest in spending intense chunks of my life working on pointless projects, when initially scoping a system, I first qualify the available observation space to determine if it is sufficient to deliver on the mission objectives. If the available observation space is insufficient, then I must first figure out if/how the observation space can be appropriately widened.
Here are a few of my best practices:
How to Qualify Observation Spaces
- Have them name their data sources and the data elements (key features).
- Then, just because they say a data source has certain features, go look yourself — I can’t tell you how many times I’ve taken a look only to find key columns empty or so dirty that the value of this data is negligible.
- If the data sources share common features between them (e.g., customer number, address, email, phone number, etc.), then generally more is good.
- For those data sources that have few, if any, shared features (e.g., one data source has name and address and the other data source has stock symbol and stock price) then generally this is not good.
- Ask for real examples from the past — things they would like to detect (opportunity or risk) — and then look in the real data to see if, upon inspection, it is discoverable. If real examples from the past cannot be detected in the provided data sources, I tell the them “not even a sentient being could discover this.”
There will be many cases where it becomes necessary to help the customer think about widening their observation space if they want to make their hopes and dreams (business objectives) a reality.
Conjuring up additional data to expand the observation space is quite an art and requires real-world understanding of what and how data flows inside the walls and outside the walls, as well the legal and policy ramifications.
How to Widen Observation Spaces
- Generally one starts looking for new data sources in this order: (i) other stuff inside the walls that you already collect (e.g., product returns); (ii) collecting more data (e.g., adding a field to a web page so customers can score feedback); and (iii) external data (e.g., marketing flags like “presence of children” and “income indicators” as routinely sold by data aggregators).
- Beware of social media: there is allure to the idea that one can computationally associate social media (e.g., Tweets about your company/brand) to which customer said it. Easier said than done. Different kinds of social sites will yield different results.
- If you are trying to catch bad guys, hope that some of the data sources are unknown or non-intuitive to their adversary (if the bad guys know you have cameras on these four streets, then they will take the fifth street).
- Now let’s say one has a list of potential new data sources. The next question is how to prioritize all of these possibilities. Again, there are a lot of ways to think about this — but here are a few ways I think about this:
- Data that improves the ability to perform more entity resolutions (e.g., a source that contains new identifiers like email addresses) so that one can discover that two customers are really the same;
- Data that brings more facts (e.g., what, where, when, how many, how much);
- Diverse data potentially containing identifiers and facts in disagreement (e.g., this fact indicates they are here, but that fact shows they were over there) — helpful in finding lies like identity theft.
Finally, don’t forget there will be plenty of times that the mission objectives cannot be achieved because the necessary observation space is not available.
Please consider the above “how to” sections as starter kits … hacking them any which way you like…