When we see data visualized in a graph such as a histogram, we tend to draw some conclusions from it. When data is spread out, or concentrated, or observed to change with other data, we often take that to mean relationships between data sets. Statisticians, though, have to be more rigorous in the way they establish their notions of the nature of a data set, or its relationship with other data sets. Statisticians of any merit depend on test statistics, in addition to visualization, to test any theories or hunches they may have about the data. Usually, normality tests fit into this toolbox.

Normality tests help us understand the chance that any data we have with us may have come from a normal or Gaussian distribution. At the outset, that seems simple enough. However, when you look closer at a Gaussian distribution, you can observe how it has certain specific properties. For instance, there are two main parameters – a location parameter, the mean, and the scale parameter, the standard deviation. Different combinations of this can mean different shapes of distributions. You can therefore have thin and tall normal distributions, or have fat and wide normal distributions.

When you’re checking a data set for normality, it helps to visualize the data too.

## Normal Q-Q Plots

#Generating 10k points of data and arranging them into 100 columns x<-rnorm(10000,10,1) dim(x)<-c(100,100) #Generating a simple normal quantile-quantile plot for the first column #Generating a line for the qqplot qqnorm(x[,1]) qqline (x[,1], col=2)

The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.

The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).

However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

You may ask, what does non-normal data look like in this plot? Here’s an example below, from a binomial distribution, plotted on a Q-Q normal plot.

## Anderson Darling Normality Test

As one of the commonly used normality tests, this is very commonly used to tell us whether or not a sample may represent normally distributed data. This is done in R by using the ad.test() command, in the **“nortest“** package. So, if you don’t have the ad.test command popping up on your R-studio auto-complete, you can easily install it via **nortest** on the “install.packages” command. Running the Anderson-Darling test for normality generally returns a bunch of data. Here’s how to make sense of it.

#Running the A-D test for first column library(nortest) ad.test(x[,1])

The data from the A-D test tells us which data has been tested, and two results: A and p-value.

The A result refers to the Anderson-Darling test statistic. The A-D test uses this test statistic to calculate the probability that this sample could have come from a normal distribution. The A-D test tests the default hypothesis that the data (in this case the first column of x), comes from a normal distribution. Assuming that this hypothesis is true, the p-value we see here tells us the probability that we can see the data we see in this sample purely by random chance. That is to say, in this case, we have a 50.57% probability of seeing the same kind of data from this process, assuming that the process in question does represent normally distributed data. Obviously, such a high chance of normality is hard to ignore, which is why we fail to reject this hypothesis we had originally, that the data does come from a normal distribution.

For another sample, if the p-value were around 3%, for instance, that would mean a 3% chance of seeing the same data from a normal distribution – which is obviously a very low chance. Although a 5% chance is still a small chance, we choose that as the very bottom end of our acceptability and should ideally subject such data to scrutiny before we proceed to do draw inferences from it. P-values can be confusing for some and hard to interpret – I usually try to construct a sentence to interpret the p-value’s meaning in the context of the hypothesis that’s being tested. I’ll write more on this in a future post on hypothesis testing.

## Interpreting and Understanding A-D test Results

Naturally, as the p-values from an Anderson-Darling normality test become smaller and smaller, there is a smaller and smaller chance that we are looking at data from a normal distribution. Most statistical studies peg the “significance” level at which we reject the default hypothesis (that this data comes from a normal distribution) outright, at p-values of 0.05 (5%) or lesser. I’ve known teams and individuals that fail to reject this hypothesis, all the way down to p-values of 0.01. In truth, one significance value (often referred to as ) doesn’t fit all applications, and we have to exercise great caution in rejecting null hypotheses.

Let’s go a bit further with our data and the A-D test: we will not perform the A-D test on all the columns of data we’ve got in our variable x. The easiest way to do this in R is to define a function that returns the p-values from each column, and use that in an **“apply”** command.

#Running the A-D test for first column library(nortest) #defining a function called "adt" to run the A-D tests and return p-values adt<-function(x){ test<-ad.test(x); return(test$p.value) } #store the p-values for each column in a separate variable pvalues<-apply(x,MARGIN = 2, adt)

The code above analyzes the samples in x (as columns) and returns their p-values as columns. So, what do you expect to see when you summarize the variable “pvalues” which stores the test results?

When you summarize p-values, you can see how approximately 9 of the 100 don’t pass the significance criteria (of p>=0.05). You can also see that the p-values in this set of randomly generated samples are randomly distributed over the entire range of probabilities from 0 to 1. We can visualize this too, by plotting the variable “pvalues”.

#Plotting the sample p-values and drawing a significance line plot(pvalues, main = "p-values for columns in x (AD-test)", xlab = "Column number in x", ylab = "p-value") abline(h=0.05, col ="red")

## Other tests: Shapiro-Wilk Test

The Anderson-Darling test isn’t the only one available in the **nortest** package for assessing normality. Statisticians and engineers often use the Shapiro-Wilk test of normality also. For similar data used above (generated as random numbers from the normal distribution), the Shapiro-Wilk test can be performed, with only a few changes to the R script (which is another reason R is so time-efficient).

#Generating 10k points of data and arranging them into 100 columns x<-rnorm(10000,10,1) dim(x)<-c(100,100) #Generating a simple normal quantile-quantile plot for the first column #Generating a line for the qqplot qqnorm(x[,1]) qqline (x[,1], col=2) #Running the A-D test for first column library(nortest) #defining a function called "swt" to run the Shapiro-Wilk tests and return p-values swt<-function(x){ test<-shapiro.test(x); return(test$p.value) } #store the p-values for each column in a separate variable swpvalues<-apply(x,MARGIN = 2, swt) #Plotting the sample p-values and drawing a significance line plot(swpvalues, main = "p-values for columns in x (Shapiro-Wilk-test)", xlab = "Column number in x", ylab = "p-value") abline(h=0.05, col ="red")

Observe than in these samples of 100 points each, the Shapiro Wilk tests returns 96 samples as being normally distributed data, while rejecting 4 (the four dots below the red line). We can’t be sure in this example whether the Shapiro Wilk test is better than the A-D test for normality assessments, however, because these are randomly generated data sets. If we run these tests side-by-side, however, we may get to see interesting results. When I ran these tests side by side, I got very similar number of significantly different samples (non-normal samples) – either 4 or 5 out of the total 100 – from both tests.

## Concluding Remarks

So, what does all this mean for day-to-day data analysis?

- Data analysis of continuous (variable, as opposed to yes/no, or other attribute data) data often uses tools that are meant for normally distributed data
- Visualization alone isn’t sufficient to estimate the normality of a given set of data, and test-statistics are very important for this
- The nortest package in R provides a fast and convenient way to assess samples for normality using tools like A-D and S-W tests
- A-D and S-W tests tend to perform similarly for most data sets (however, research is being done on this)
- The re-usability of R code allows you to set up a macro rather quickly for performing normality tests and a range of studies on test data, in a time-efficient manner