A recent experience in the workplace has prompted me to look at how we handle data, especially when it’s received from multiple sources and then compiled into one database. My context is English language teaching, and all data pertains to enrolled students in a language school.
It’s worth pointing out that if the data we use needs to be updated and shared with other parties, using a live master document, preferably a spreadsheet, is the best way forward. Making copies every time there is a need to make changes to our data source will lead to confusion, especially when we learn colleagues are not working from the most recent edition.
However, it’s not just keeping to one master data source that matters, it’s also the quality of the data we receive and add to any existing database in a spreadsheet. Once in place, we can manipulate our data to summarise key information which is pertinent to our needs.
In this example, we want to know how many Polish students are enrolled. We can find this out by applying a simple function: =COUNTIF(A2:A10,”Polish”):
In this instance, it’s obvious there’s an inaccuracy with how the spreadsheet function is reading the data, but this wouldn’t be so obvious if the data source was bigger. Data which can’t be read properly will be missed.
A recent experience, where data was provided from multiple sources, underlines the need to have clean usable data before adding it to your existing database. If data is copied and pasted from other sources, it may have leading, trailing, and extra spaces between words (as in the example above), unwanted line breaks, non-printable characters, as well as irregular fonts and capitalisation. If the data derives from another spreadsheet, it may contain formulas and functions, conditional formatting and data validation, comments and notes, and irregular fill colours, borders, and merging.
Although some of the above won’t affect the readability of your data it will look sloppy and disorganised, so it makes sense to get it into a neat and quantifiable format.
We can do it piecemeal by selecting a given range of cells and choosing an appropriate alignment, font, size, and colour, etc. as well as using set functions to clean and trim our data, but this can be slow work. Also, if your data contains dates, you run the risk of resetting the number format in some spreadsheet programs. In this case, any dates will annoying become a string of seemingly random numbers. You can fix this, but again, it takes time.
From my own recent experience, I wasted too much time, being stumped as to why my formulas and functions weren’t working properly not to mention the amount of time spent tidying up data sent through from different sources.
With that in mind, I have built this macro enabled spreadsheet which is designed to remove:
- Formulas and functions.
- Leading, trailing, and extra spaces between words.
- Line breaks.
- Non-printable characters.
- Conditional formatting and data validation.
- Irregular fonts, fill colours and capitalisation.
- Irregular borders, alignment, and merging.
- Comments and notes.
Open the spreadsheet and type/paste your data into the grid. When you are ready, click the Clean button and let the program do its work.
As you can see from the example below, there’s a multitude of sins with this data set, however, once the button has been clicked, the program will get to work and clean it.
The program does take some time, so even though it may look as though it’s stuck, it’s hard at work doing its job. A message box will prompt you once it’s done.
You can download the Data cleaner tool by clicking here: Data Cleaner tool
The Data cleaner tool was built using Excel and is macro-enabled. If you are using a different spreadsheet program, you might find the tool does not work. Also, as it has been built by me, an English teacher, and not a programmer, there’s probably glitches (let’s face it, there almost certainly will be). So if anyone does find any problems, please let me know and I’ll try and fix them.
Teaching online? Need a fun activity? If yes, you might be interested in this: Board game generator: Virtual teaching (world capital cities)
[…] I recently wrote a blog post on how we should handle data and why quality matters. My context is academic management and if you would like read more about my data problem, and download the Data cleaner tool I built, click here: Data cleaner tool […]
[…] Data cleaner tool […]
[…] Data cleaner tool […]
[…] Working with data? You might find this handy tool useful: Data cleaner tool […]