A number of inferential statistical tests (A/B tests and significance tests) assume that the underlying that we’re comparing come from a normal (Gaussian) distribution. However, this isn’t generally true for a number of data sets in practice. In order to use the tools that assume normality, we have to transform the data (and the limits or comparisons being made).
The purpose of transformation, therefore, is to ensure that the data set we have satisfies the minimum assumptions made in the process of conducting the hypothesis tests. In frequentist statistics, where we’re using these statistical distributions to model a process and describe aggregate behaviour, rather than using Bayesian approaches, it is useful to keep such transformation tools at hand.
Let’s generate a sample data set, and plot it, and analyze it using the QQ-Norm plots, to understand normality. We’ll use the standard Shapiro-Wilk test which is a powerful test for normality of a data set.
#Generating a weibull distribution sample x <- rweibull(1000,2,5) #Plotting and visualizing data hist(x, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5)) #Shapiro-Wilk test for normality shapiro.test(x)
Shapiro-Wilk normality test data: x W = 0.9634, p-value = 3.802e-15
The Shapiro-Wilk test results certainly confirm that the data set is non-normal. Now let’s look at a QQ-Norm plot.
Our objective now is to transform this data set into a dataset, on which we can perform operations meant for normally distributed data. The benefit of being able to transform data is many-fold, but chiefly, it allows us to conduct capability analyses and stability analyses, in addition to hypothesis tests like t-tests. Naturally, the reference points which will be used in these tests will also have to be transformed, in order to make meaningful comparisons.
The Log and Square Root functions are commonly used to transform positive data. In our case, since we have data from the Weibull distribution, we can explore the use of the log function and observe its effectiveness at transforming the dataset.
#Transforming data using the log function x_log <- log(x) #Plotting the transformed data hist(x_log, breaks = 20, col = rgb(0.1, 0.1, 0.9, 0.5)) #Shapiro-Wilk test for normality of transformed data shapiro.test(x_log) #Normal QQ Plot qqnorm(x) qqline(x_log)
Shapiro-Wilk normality test data: x_log W = 0.9381, p-value < 2.2e-16
Quite clearly, the results from this aren’t too promising. The data set created as x_log doesn’t exhibit normality, given the p-value in the normality test is extremely small. This means that there is an extremely small chance that the log-transformed data set could have come from a normal distribution, assuming that the null hypothesis of the normality test is true.
The Johnson R Package
The Johnson R package can be used to access certain tried and tested transformation approaches for transformation. The Johnson package contains a number of useful functions, including a normality test (Anderson-Darling, which is comparable in power to Shapiro Wilk), and the Johnson transformation function. The Johnson package can be installed using the “install.packages()” command in R.
library(Johnson) #Running the Anderson-Darling Normality Test on x adx <- RE.ADT(x) adx #Running the Johnson Transformation on x x_johnson <- RE.Johnson(x) #Plotting transformed data hist(x_johnson$transformed, breaks = 25, col = rgb(0.9, 0.1, 0.1, 0.5)) qqnorm(x_johnson$transformed) qqline(x_johnson$transformed, col = "red") #Assessing normality of transformed data adx_johnson <- RE.ADT(x_johnson$transformed) adx_johnson
Running the RE.Johnson() command generates a list assigned in this case to the variable x_johnson. This contains a vector of the transformed values, under x_transformed$transformed – which is our new transformed data set. Using the same plots and tests we did earlier for the base data set, we can understand the effectiveness of the Johnson transformation.
> adx_johnson adx_johnson "Anderson-Darling Test" $p 0.3212095
A p-value of 0.32 from the Anderson Darling test for the transformed data set clearly indicates that we fail to reject the null hypothesis of the A-D test, and cannot rule out the possibility that the transformed data is normally distributed.
We’ve seen in this post how the Johnson R package can be used to transform data that is non-normal (as a lot of real data sets are) to a data set that can be used as arguments in a hypothesis test or other function that assumes normality in the data. Transformations have wide applications, and can be used to extract meaningful information about the dynamics of a distribution, process or data set. Transformation also allows us to present data better. Sometimes, when data is skewed, extreme values get highlighted, at the cost of highlighting the pattern that’s present in most of the data set. In such situations, transformations can come in handy. Transformations can also be used to highlight the scale of phenomena by using their transformed data in graphs.