Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).
Graphical Approaches to Outlier Detection
Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.
y <-read.csv("y.csv") ylarge <- read.csv("ylarge.csv") #summarizing and plotting y summary(y) hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5)) boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]") shapiro.test(y[,2]) qqnorm(y[,2], main = "Normal QQ Plot - y") qqline(y[,2], col = "red") #summarizing and plotting ylarge summary(ylarge) hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5)) boxplot(ylarge[,2], col = rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]") shapiro.test(ylarge[,2]) qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge") qqline(ylarge[,2], col = "red")
The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.
The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.
Dixon and Chi Squared Tests for Outliers
The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.
library(outliers) #Dixon Tests for Outliers for y dixon.test(y[,2],opposite = TRUE) dixon.test(y[,2],opposite = FALSE) #Dixon Tests for Outliers for ylarge dixon.test(ylarge[,2],opposite = TRUE) dixon.test(ylarge[,2],opposite = FALSE) #Chi-Sq Tests for Outliers for y chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE) chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE) #Chi-Sq Tests for Outliers for ylarge chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE) chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)
In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument
opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.
Sample output is below, from one of the tests.
> #Dixon Tests for Outliers for y > dixon.test(y[,2],opposite = TRUE) Dixon test for outliers data: y[, 2] Q = 0.0466, p-value = 0.114 alternative hypothesis: highest value 11.7079474800368 is an outlier
When you closely observe the p-values of these tests alone, you can see the following results:
P-values for outlier tests: Dixon test (y, upper): 0.114 ; Dixon test (y, lower): 0.3543 Dixon test not executed for ylarge Chi-sq test (y, upper): 0.1047 ; Chi-sq test (y, lower): 0.0715 Chi-sq test (ylarge, upper): 0.0012 ; Chi-sq test (ylarge, lower): 4e-04
The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in
ylarge are actually typical in that data set.
We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.