Bad Data Good
By Jeff Jonas, published January 29, 2019
Some organizations are over-cleaning their data, which makes some systems less intelligent.
To be clear: By bad data I am not talking about a date placed in the name field or a phone field containing the phrase โwho put the ear muffs on the cookie?โ
By bad data I mean natural variability. For example, the month and day in the date are transposed, the street address is missing the word Avenue, or the name Marek is misspelled Mark.
In the case of Marek vs. Mark, keeping both is helpful for learning that Marek may sometimes be misheard and recorded as Mark. Alternatively, maybe Marek has decided to take the nickname Mark. Unless one keeps both values, how would a system come to learn this?
Benefitting from bad data is something you have likely experienced personally: Every time Google responds to a search with โdid you mean this?โ you are witnessing a great example of โbad data good.โ Googleโs suggestions arenโt coming from a dictionary. Rather, Google remembers everyoneโs errors and misspellings. If Google didnโt remember this natural variability it would not be so smart.
On a personal note, I have a very funny bad daddy story as my youngest son has two dates of birth. Itโs another โ bad data goodโ story. Ask me someday.
More importantly, if youโre trying to catch clever bad guys, donโt polish every piece of data to perfection first (aka data cleansing) or youโll reduce your chances of catching the clever bastards. More about this here.