How To Clean Dirty Data
We Can Clean Your Dirty Data For You
*Data’s ultimate purpose is to drive decisions. But our data isn’t as reliable or accurate as we want to believe. This leads to a most undesirable result: Bad data means bad decisions. As a data professional, part of our mission is to make data “good enough” for use by others. We spend time
scrubbing and cleaning data to make it consumable by other teams. But we didn’t go to school to become a data janitor. And yet, here we are. We understand that all data is dirty, but some data is useful.
The Origins Of Dirty Data
The main reason why data is dirty and often unreliable is simple: human intervention. If not for humans, then data would be clean, perfect, and clear. If you’ve ever played the game “Telephone,” then you understand how humans are awful at relaying even the simplest of data. But it gets worse when you
realize that data today is collected from machines that are programmed by humans. There are assumptions being made by both the programmer and the person using the tool. These assumptions lead to data quality issues. Here are but a few examples to consider:
Your standards change – Standards change all the time. For example, scientific standards have been updated over the years in an effort to improve precision. In November 2018, the standard definition of a kilogram changed. This means that any system using the old standard is
producing wrong calculations. Tools are subject to rounding errors when the standards change. This leads to bad calculations and dirty data.
Your data collections fail – Our collection tools and methods can collect the wrong data, or no data. Or worse, they could have issues with unit conversion. You might see (ms) and assume milliseconds, but it could be microseconds. Just because a collection is automated doesn’t mean
it can be trusted. Humans are still touching the data.
Your data sets are incomplete – You could have a rather large dataset and think, “Jackpot.” But large datasets are often incomplete. They have missing attributes or were put together by someone scraping websites. While the internet provides everyone the ability to access data at
any time and for any need, it does not guarantee that the data is valid.
Time series collections lack context – Time series collections are all the rage these days. In our effort to be DevOps-y, we stream logs to achieve observability and perform analytics for insights. The problem is that this streaming data often lacks context for what is being measured.
Often, the data being measured is changing. The simplest example is retail sales tied to seasons.
You need context with your data. And SysAdmins know that measuring CPU by itself doesn’t have enough context—you need to collect additional metrics to tell the whole story.
All of the above can lead to the following:
- Duplicate data – A single event is recorded and entered into your dataset twice.
- Missing data – Fields that should contain values don’t.
- Invalid data – Information not entered correctly or not maintained.
- Bad data – Typos, transpositions, variations in spelling, or formatting (say hello to my little friend Unicode!)
- Inappropriate data – Data entered in the wrong field. By now you should understand that it is difficult, nay impossible, to determine if data is ever clean.
Methods For Cleaning Dirty Data
Here’s a handful of techniques that you should consider when working with data. Remember, all data is
dirty; you won’t be able to make it perfect. Your focus should be making it “good enough” to pass along
to the next person. The first thing you should do when working with a dataset is to examine the data.
Ask yourself, ‘Does this data make sense?’ Then, before you do anything else, make a copy or backup
of your data before you begin to make the smallest change. I cannot stress this enough. OK, so we’ve
examined the data to see if it makes sense, and we have a copy. Here are a few data cleaning
- Identify and remove duplicate data – Tools such as Excel and PowerBI make this easy. Of
course, you’ll need to know if the data is duplicated, or two independent observations. For
relational databases, we often use primary keys as a way to enforce this uniqueness of records.
But such constraints aren’t available for every system that is logging data.
- Remove data that doesn’t fit – Data entered that doesn’t help you answer the question you are
- Identify and fix issues with spelling, etc. – There are lots of ways to manipulate strings to help
get your data formatted and looking pretty. For example, you could use the TRIM function to
remove spaces from the text in a column, then sort the data and look for things like
capitalization and spelling. There are also regional terms, like calling a sugary beverage “pop” or
- Normalize data – Set a standard for the data. If the data is a number, make sure it is a number.
Often times you will see “three” instead of a 3, or a blank instead of a 0. If the data attribute is
categorical, make sure the entries that apply for that category.
- Remove outliers – But only when it makes sense to do so! If the outlier was due to poo
collection, then it could be safe to remove. Hammond’s Law states that “Ninety percent of the
time, the next measurement will fall outside the 90% confidence interval.” Be mindful that
outliers are innocent until proven guilty.
- Fix missing data – This gets… tricky. You have two options here. Either you remove the record,
or you update the missing value. Yes, this is how we get faux null values. For categorical data, I
suggest you set the data to the word “missing.” For numerical data, set the value to 0, or to the
average of the field. I avoid using faux nulls for any data, unless it makes sense to note the
absence of information collected. Your mileage may vary.
*From an article by Thomas LaRock, in Orangematter, 2019.
Why Choose Northwest Database Services?
Northwest Database Services is a full-spectrum data service who has been performing data migration, data scrubbing, data cleaning, and de-duping data services for databases and mailing lists, for over 34 years.
How To Contact Us
You can contact us by sending an inquiry thought our Contact Us form or call us at +1 360-841-8168.