Simple Outlier Detection in R

Outliers are points in a data set that lie far away from the estimated value of the centre of the data set. This estimated centre could be either the mean, or median, depending on what kind of point or interval estimate you’re using. Outliers tend to represent something different from “the usual” that you might observe in a data set, and therefore hold importance. Outlier detection is an important aspect of machine learning algorithms of any sophistication. Because of the fact that outliers can throw off a learning algorithm or deflate an assumption about the data set, we have to be able to identify and explain the outliers in data sets, if the need arises. I’ll only cover the basic R commands here to do outlier detection, but it would be good to look up a more comprehensive resource. A first primer by Sanjay Chawla and Pei Sun (University of Sydney) is here: Outlier detection (PDF slides).

Graphical Approaches to Outlier Detection

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.

y <-read.csv("y.csv")
ylarge <- read.csv("ylarge.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")

The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.

 

Box plot of y (no real outliers observed as per graph)
Box plot of y (no real outliers observed as per graph)
Boxplot of ylarge - a few outlier points seem to be present in graph
Boxplot of ylarge – a few outlier points seem to be present in graph
Histogram of y
Histogram of y
Histogram of ylarge
Histogram of ylarge
Normal QQ Plot of Y
Normal QQ Plot of Y
Normal QQ Plot of ylarge
Normal QQ Plot of ylarge

 

The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

Dixon and Chi Squared Tests for Outliers

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.

library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)


#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)

In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.

Sample output is below, from one of the tests.

> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

	Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier


When you closely observe the p-values of these tests alone, you can see the following results:

P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04

The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see in ylarge are actually typical in that data set.

Concluding Remarks

We’ve seen the graphical outlier detection approaches and also have seen the Dixon and Chi-square tests. The Dixon test is newer, but isn’t applicable to large data sets, for which we need to use the Chi-square test for outliers and other tests. In machine learning problems, we often have to be able to explain some of the values, from a training perspective for neural networks, or be able to deal with lower resolution models such as least squares regression, used in simpler forecasting and estimation problems. Approaches like regression depend heavily on the central tendency of the data, and we can build better models if we’re able to explain outliers and understand the underlying causes for them. Continual improvement professionals generally regard outliers with importance. Statistically, the chance of getting extreme results (extremely good ones and extremely poor ones) is exciting in process excellence and continuous improvement, because they could represent benchmark cases, or worst case scenarios. Either way, outlier detection is an immensely useful activity applicable to different statistical situations business.

Advertisements

4 thoughts on “Simple Outlier Detection in R

  1. Hi,

    Could you please share the format of “y.csv” and “ylarge.csv”, how does the input csv look like.

    because i have data set which looks like as follow ( tab separated )
    s1 s2 s3 s4 s5 s6
    chr:122424-132342 4.888 3.232 7.23423 4.2343 3.92343 1.345345

    Like

    1. Hi, the input CSV comes in as a data frame with one index column, and one data column and that is all. The input is comma separated. In your case, you may have to use import text, to import the data, and specify the separator type. Hope this is helpful! Thanks for visiting and commenting on the blog.

      Like

      1. HI,
        Thank very much i had to transpose my file data as below.
        34488 3.828557
        41745 4.742792
        41843 1.071726
        41832 3.868687
        42006 3.893922
        42026 3.737675
        42040 4.411521

        One query

        How to set X axis displaying Col1 of the input File

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s