By Jeff Jonas, published January 29, 2019

Some organizations are over-cleaning their data, which makes some systems less intelligent.

To be clear: By bad data I am not talking about a date placed in the name field or a phone field containing the phrase “who put the ear muffs on the cookie?”

By bad data I mean natural variability. For example, the month and day in the date are transposed, the street address is missing the word Avenue, or the name Marek is misspelled Mark.

In the case of Marek vs. Mark, keeping both is helpful for learning that Marek may sometimes be misheard and recorded as Mark. Alternatively, maybe Marek has decided to take the nickname Mark. Unless one keeps both values, how would a system come to learn this?

Benefitting from bad data is something you have likely experienced personally: Every time Google responds to a search with “did you mean this?” you are witnessing a great example of “bad data good.” Google’s suggestions aren’t coming from a dictionary. Rather, Google remembers everyone’s errors and misspellings. If Google didn’t remember this natural variability it would not be so smart.

On a personal note, I have a very funny bad daddy story as my youngest son has two dates of birth. It’s another “ bad data good” story. Ask me someday.

More importantly, if you’re trying to catch clever bad guys, don’t polish every piece of data to perfection first (aka data cleansing) or you’ll reduce your chances of catching the clever bastards. More about this here.