Data cleaning

Getting started

In this tutorial, we're going to see how we can use the data-cleaning tools in statscloud to clean up a data set and make it ready to run some analyses.

Let's start by looking at this project. Here, we some responses to a survey; the green columns contain some technical information about the questionnaire, the orange columns contain the survey responses, and the yellow columns at the end show some demographic information.

Task: Start a new project in StatsCloud

Open the sample survey data set by clicking here

Take a quick look at the data set and see if you can spot any issues with the data in there (e.g. missing cells, or any values that look a bit out of place).

This data set is actually quite messy; there are a few empty cells from people who haven't been completed the survey, as shown in the 'Complete (%) column. We need to clean this up before we can do any meaningful analyses on it.

Using the Data Summary tool

Sometimes, it's not that easy to spot issues with a data set, and it's not really practical to go through our whole spreadsheet to see if any values look out of place; we need statscloud to do that for us.

In some statistics packages, you'll need to run a 'Descriptives' analysis on each of your columns and then go through the output for each one to identify if there any issues (for instance, if there are any outliers, or if your data is skewed). However, in StatsCloud, it's much quicker and easier to find out this information.

Take a look at bottom of our spreadsheet. Here, you'll see a row of cells that provide a summary statistic for each column. For columns with numeric (interval/ratio) data, this summary statistic shows the mean and, for all other columns, it shows the n (number of observations).

You'll notice in this data set that an awful lot of these cells have warning flags next to them. This tells us that there is something about this column we should be aware of. To take a look at what it is, just click on one of the cells and a data summary for that column will pop up.

The warning flags will either tell us that we have some missing values, that we have some outliers and/or that the data is skewed (that it is not normally distributed). Either way, we need to fix these issues before we go ahead with any analyses, so we're going to do that now.

Creating a filter

The "Filters" tab was designed to help us filter out problematic data in our data sets. When you click on the Filters tab, you can see we have the option to add a new filter. We can try doing that now. The first thing we want to do here is filter out the responses from all those who didn't actually complete the questionnaire. So, we need to set up a filter criteria based on the Complete (%) column. To do this, we just click on the Add new filter button and fill out the inputs with the values we want. Here, we want to exclude if any responses in the Complete (%) column are less than 100, so let's try that.

Task: Create a filter to remove participants who didn't complete the questionnaire

With your project open, go to the Filters tab and add a new filter.

Give this filter a name: Did not complete

Edit this filter so that it excludes any responses in our data set where any value in the column Complete (%) is less than 100.

When we revisit the Data tab, we can see that four rows now have lines through them. This means statscloud will ignore those rows when doing anything that involves them (e.g. creating charts or running analyses). You may also spot that the number column has been expanded to show us which rows are included and which aren't. If you click on any cross, statscloud will open a pop-up menu that explains why that row has been filtered out, and give us the option to turn off any active filters in place.

Crucially, you'll notice that we no longer have a warning flag in the Complete (%) column, and that fixing this column has also fixed other issues (e.g., missing data) with a lot of the other columns. However, we still have a couple of other warning flags, so let's try to get rid of those.

Removing outliers

The next issue we have is with the 'Duration' column. When scanning through this column, we can see we have some slightly odd values. One person finished it in 19 seconds (they probably weren't taking it very seriously) and another person took 1133 seconds (almost 20 minutes) to answer 10 questions! We're probably not going to get very reliable data from either of these people, and these values are most likely outliers. So, let's filter out their data now.

If we go back to the Filters tab and add a new filter for the Duration (sec) column, we can see we have the option to filter our a row if a value if beyond a certain point. Typically, in statistics, we identify outliers as values that deviate a certain amount from the average.

Note: A lot of people detect outliers using the mean and standard deviation, but this is not a good idea. The reason is that, if your data is skewed, the mean won't provide an accurate measure of central tendency because (ironically) that would have been skewed as well. For the same reasons, the standard deviation (which uses the mean in its calculation) isn't reliable either.

Instead, we should use the median as our measure of central tendency (because it's not susceptible to outliers), and use another measure; the ](https://en.wikipedia.org/wiki/Median_absolute_deviation) for the dispersion. The Median Absolute Deviation is much like the standard deviation, except it finds the average (median) deviation from the median so, again, it's not susceptible to outliers.

statscloud sets the default criteria for measuring outliers as 2.5 MADs from the median, following recommendations by ](https://doi.org/10.1016/j.jesp.2013.03.013).

Task: Create a filter to remove outliers from the Duration (sec) column.

Go to the Filters tab and add a new filter.

Give this filter a name: Duration outliers.

Edit this filter so that it excludes any responses in our data set where any value in the column Duration (sec) is beyond 2.5 MADs from the median

If you go back to the Data tab, you should see that a few more rows have lines through them and, most importantly, that the Duraction (secs) column no longer has a warning flag. You may notice too that the value for the 'mean' has changed so that this calculation no longer includes the rows that have been crossed out.

In our data set now, some of the rows are filtered out by more than one filter. For instance, if you click on the cross in the number column for row 6, you'll see from the pop-up that it has been filtered out by both filters; (1) because it wasn't complete and (2) because they didn't spend very long on it.

Filtering with a QQ plot

The next warning in our data set is in the Age column. We have another clear outlier here. If you scan up through the column, you can probably spot it. We have a stray '345' in there. This is clearly a typo and will cause an issue when we report the mean age to be 46.8 in our writeup if that's not the case. How you handle this may depend on your situation; you could either delete this value or, if age is an important factor in your experiment, filter it out.

Task: Create a filter to remove participants above a certain age

Go back to the Filters tab and add a new filter.

Give this filter a name: Age outliers.

In this filter, state that we want to exclude all those with ages above 120 (or another number you think is appropriate).

Filters with multiple criteria

Sometimes you may want to filter out values using more specific criteria. Let's say that, for some reason, we want to filter out all 'boys'. For this, we need to use both the Gender variable and the Age variable.

When you create a new filter, you may have noticed that, once you've created a rule, you can add another by clicking the Add another rule button below. This will mean that, in order to filter a row, the values in that row need to meet multiple criteria. In this case, the person must be both male and under 18.

Task: Create a filter to remove boys

Go back to the Filters tab and create a new filter

Give this filter a name: Exclude boys

In this filter, state that we want to exclude all those with a gender of male and then Add another rule for any age below 18

Merging labels

The last thing we want to do is correct a typo in the dataset. If you look down the Gender column, in row 5 we have a value of "Malle". For this column, we were only expecting two values, but clicking on the data summary cell at the bottom of this column will reveal that we actually have three. Clearly, this value should say "Male", so we want to correct this.

The first, obvious, way of doing this is to simply delete that cell and retype it. Another method is using the 'Replace' tool under the 'Data' menu in the toolbar. This will bring up a popup that will allow us to enter text we want to replace and what we want to replace it with, and will apply to all cells in the spreadsheet. In this situation, that's fine, but this wouldn't be ideal if this value appears in other columns and we don't want to change those.

Another solution is to merge this label with another in the Variables tab. If you view the Gender variable in the the Variables tab, you'll see there are three entries under Values. We want all "Malle" labels to become "Male" labels. So, to do this, just click on the menu icon (the three dots) to the right of the "Malle" value and select "Merge with...". This will bring up a pop-up window asking you to select which label you would like to merge this one with. In this case, we want to select the "Male" label.

Task: Correct the "Malle" typo

Using either of these methods, change the "Malle" value to "Male".

Recap

This tutorial should have helped show you how easy it is to clean up data sets in StatsCloud, and the importance of doing so. here's what we've covered:

How to spot issues easily by clicking the 'Data summary' cell
How to remove add filters that remove values based on criteria we set manually
How to remove outliers automatically using the 'beyond' rule in a filter
How to set up a filter with multiple criteria
How to merge labels to fix typos.

Sometimes, the data we collect isn't quite as nice as the data presented in statistics textbooks; it can be messy like this one. When you come to do your own research, it's highly likely the data you collect will be messy and you'll need to do some data cleaning like this before you make a start on your analyses, but now you should be prepared to do so!