Data cleaner tool

A recent experience in the workplace has prompted me to look at how we handle data, especially when it’s received from multiple sources and then compiled into one database. My context is English language teaching, and all data pertains to enrolled students in a language school.

It’s worth pointing out that if the data we use needs to be updated and shared with other parties, using a live master document, preferably a spreadsheet, is the best way forward. Making copies every time there is a need to make changes to our data source will lead to confusion, especially when we learn colleagues are not working from the most recent edition.

However, it’s not just keeping to one master data source that matters, it’s also the quality of the data we receive and add to any existing database in a spreadsheet. Once in place, we can manipulate our data to summarise key information which is pertinent to our needs.

In this example, we want to know how many Polish students are enrolled. We can find this out by applying a simple function: =COUNTIF(A2:A10,”Polish”):

There are two Polish students listed but the function only counts one.
As you can see, an extra trailing space means that the first entry in the list is not counted.
Once the trailing space is removed, the function is able to count properly.

In this instance, it’s obvious there’s an inaccuracy with how the spreadsheet function is reading the data, but this wouldn’t be so obvious if the data source was bigger. Data which can’t be read properly will be missed.

A recent experience, where data was provided from multiple sources, underlines the need to have clean usable data before adding it to your existing database. If data is copied and pasted from other sources, it may have leading, trailing, and extra spaces between words (as in the example above), unwanted line breaks, non-printable characters, as well as irregular fonts and capitalisation. If the data derives from another spreadsheet, it may contain formulas and functions, conditional formatting and data validation, comments and notes, and irregular fill colours, borders, and merging.

Although some of the above won’t affect the readability of your data it will look sloppy and disorganised, so it makes sense to get it into a neat and quantifiable format.

We can do it piecemeal by selecting a given range of cells and choosing an appropriate alignment, font, size, and colour, etc. as well as using set functions to clean and trim our data, but this can be slow work. Also, if your data contains dates, you run the risk of resetting the number format in some spreadsheet programs. In this case, any dates will annoying become a string of seemingly random numbers. You can fix this, but again, it takes time.

From my own recent experience, I wasted too much time, being stumped as to why my formulas and functions weren’t working properly not to mention the amount of time spent tidying up data sent through from different sources.

With that in mind, I have built this macro enabled spreadsheet which is designed to remove:

  • Formulas and functions.
  • Leading, trailing, and extra spaces between words.
  • Line breaks.
  • Non-printable characters.
  • Conditional formatting and data validation.
  • Irregular fonts, fill colours and capitalisation.
  • Irregular borders, alignment, and merging.
  • Comments and notes.

Open the spreadsheet and type/paste your data into the grid. When you are ready, click the Clean button and let the program do its work.

As you can see from the example below, there’s a multitude of sins with this data set, however, once the button has been clicked, the program will get to work and clean it.

From this…
To this.

The program does take some time, so even though it may look as though it’s stuck, it’s hard at work doing its job. A message box will prompt you once it’s done.

You can download the Data cleaner tool by clicking here: Data Cleaner tool

The Data cleaner tool was built using Excel and is macro-enabled. If you are using a different spreadsheet program, you might find the tool does not work. Also, as it has been built by me, an English teacher, and not a programmer, there’s probably glitches (let’s face it, there almost certainly will be). So if anyone does find any problems, please let me know and I’ll try and fix them.

Teaching online? Need a fun activity? If yes, you might be interested in this: Board game generator: Virtual teaching (world capital cities)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s