The Importance of Clean Data – Introducing ‘spotless data’

Data is collected at so many touchpoints in todays’ digital world that it is no wonder that many businesses are drowning under its weight. The problems with Big Data are knowing what you have got, understanding how the elements of information link together and then most importantly how the data can then help you manage and drive your business or service.

So where to start? Well it’s quite simple in theory; gather all the data together, sort it into formats that can be analysed and pull out the information you want. That’s the theory but the practice is not at all simple. The data comes in many different formats, the information may be incomplete, inaccurate and in some instances just plain wrong. This is where the experts come in, it’s time to introduce spotless data.

spotless data provides a web service where dirty data is transformed into clean data by a combination of automatic and manual data cleansing. spotless data run extensive validation processes including matching data against a known reference set, full regular expression validation, and validation that CSV files are well formed and UTF-8 encoded. The spotless data systems can be used to check fields are well formed and can also manage manually entered data.

At spotless data we are passionate about clean data, we specialise in many areas when it comes to cleaning data – here are some more details of the key methods we use:

1. Data cleansing with regular expressions

One of the most common problems with free form data entry is that the data is not submitted in a standard form. This makes it hard to identify duplicated records and even harder to integrate data from a number of different sources.

At spotless data, we use regular expressions to validate that a particular record is in the right format and if it isn’t, then we automatically cleanse the data to put it into the right format.

  1. Data cleansing against reference datasets

Frequently, data can be checked by seeing if it exists in another database. For example, countries, cities, and streets are all well-defined, as are car registration plates and popular products. However, when getting people to enter information into on-line forms or when combining data from two different databases, different practises can frequently lead to different spellings and data which is apparently different even though it should be the same.

At spotless data, we correct these problems by comparing data to a reference dataset. Any data is checked to see if it is present in the reference dataset and if it is not, then it can be set to the closest match (to correct for typos) or even added to the reference dataset (if it’s missing).

  1. Data integration and Natural Key Validation

When you’re integrating multiple datasets together you need to be able to match them using a specific field, either a primary key, or more frequently, what we call a natural key. Unfortunately it’s quite common for such natural keys to not match quite as often as we want, or for database primary keys to mismatch between different systems.

At spotless data, one of our most important type of validation rules is the reference rule which can be used to validate that one field exists in a separate file. For example, if you’ve got a list of records from one system with unique keys like CPACK222555, PACK32256263, CPACK 2852080, etc… you can validate that each of these keys exists in a master reference list.

With spotless data you get an exception report every time a job is completed so you can easily report the problems to the owner of the system providing the bad data and get them fixed.

If your business has data that is not being used because of quality, format, cleanliness and you know that it contains invaluable business development information get in touch  – we can help you sort it out using spotless data.