Six Sigma and R

As software primarily meant for statistical analysis, R obviously has uses in statistical and data driven problem solving, especially the kind seen in Six Sigma programs. A product of large manufacturing firms that aspired to attain high capability processes with low defects, Six Sigma started out as a child of the Total Quality Management movement and became infused with the statistical methods that were being developed for process analysis and system analysis in the 1980s.


Drawing upon an impressive range of statistical tools, early Six Sigma programs in various companies trained C-suite executives in data driven problem solving, enabling them to use tools such as control charts, capability analysis, and other tools from the TQM movement. While Six Sigma attracted a lot of attention in its hey day, it has its fair share of critics to this day, who criticize the institutionalization of knowledge and the hierarchical approach to problem solving that it seemed to promote. It was often criticized as a flavour-of-the-month program in many companies (and that may indeed have been a fair criticism at times). Negatives aside, Six Sigma still attracts a number of engineers, technologists and managers from various industries, even outside the manufacturing industry, which has traditionally been its bastion. This is because of the rigour that has been developed over the years in definition, measurement and analysis of problems, and the notion that being data driven is a better way to run businesses, than running business by pure experience (or, on a lighter note, chance, luck or favourable risks).

The R language has a number of packages which are suited for the specific kinds of analysis that Six Sigma professionals require. Even if you aren’t a Six Sigma professional (or want nothing to do with it), you can benefit from the many statistical tools and methods that it has made popular. The installed large “user base” of practitioners means that these tools tend to have standard reporting formats. So it does help to have a package which has a range of functionality and standard output.

R Packages

Two packages in R which offer tools and methods for the Six Sigma professional are qcc and SixSigma. Some Six Sigma projects that use time series analysis may benefit from native packages in R, and forecasting packages such as the one developed by Rob Hyndman (forecast).

In this post, I’ll walk through how to use qcc for a simple process capability analysis.

Capability Analysis: 101

A process capability analysis is, in the simplest of terms – a comparison between how your process performs, and the range of specifications that have been agreed upon as its specification limits. These specification limits indicate the acceptable range over which the process should perform. The traditional view is that any process output between these specification limits is considered good, and anything outside that isn’t.

How process variation results in defects
Innate process variation results in defects

Any process that has variation (and lots of processes – in nature and in industry do) occasionally produces results that don’t meet the requirements/specifications intended for it. What process capability methods do is to allow us to measure these variations with some end result in mind. These end results may be something we may be concerned about as someone interested in the process, or it could be someone else’s process that we may be interested in.

In general, a careful, step-wise approach is required in order to evaluate capability correctly. Here’s a chart that sums up how to go about it.

Capability analysis - step by step

If that looks complicated, let me break it down – it only consists of three important steps: assessing normality, assessing stability, and assessing capability. Normality tests tell us whether or not our data may have come from a normal distribution (also known as a Gaussian distribution) of some kind. It turns out that a lot of processes in nature and industry can be modeled using the normal distribution. (Unsurprising then, that the analyses we are doing here are contingent upon it). By some kind of normal distribution, I mean some mean and some standard deviation (since these are the two parameters that make the normal distribution look lean and tall – or fat and spread out). The stability tests are essentially control charts, that tell us if we can expect to see consistent results in our process. Control charts were invented several decades ago by Walter Shewart, and they’re still in use. The final test is the capability analysis itself. Put in a simple way, this compares the specification width to the process variation.

Process Capability in R

For those who are used to the point-and-click interface of user-friendly software like Minitab and MATLAB, R may seem like an unlikely candidate for doing all this in. But the flexibility and modularity of R shine through especially when doing a complicated analysis like this, with multiple steps.

The approach I’ll describe here simply replicates what you see in flow chart above, for a sample of data. For the purpose of this analysis, I’ll describe how it is done in R-studio, with the qcc and nortest packages installed. You can do this by firing up R-studio and then using the “install.packages” command.

#Import process measurements from a file
#Data are in 99 samples of 100 measurements each
x<-read.csv("Process capability data.csv", header = TRUE)

#Assess normality using the Anderson-Darling normality test
temp = NULL
for (i in 2:length(x[,1]))  {
  temp <-append(temp,ad.test(x[,i])$p.value, after = length(temp))

  if (sum(temp>0.05)&>=length(x[,1])*0.95){
      qcc(x, type = "xbar",plot = TRUE)
      qcc(x, type = "S",plot = TRUE)

#The above control charts seem to be "in control" - for only the +/- 3 sigma test as the criterion
#We can now proceed to perform a capability analysis
x<-qcc(x,type = "xbar", nsigma = 3, plot = FALSE)
process.capability(xqcc, spec.limits = c(9,10.5))

In the first section of the code, we’re looking at data being pulled from a CSV file. This data is then formatted as a data frame. At this stage, we haven’t used any of the visualization that we could within R, such as boxplots, or histograms for the different data columns.  Even at this stage, when you’ve pulled in the data, it is perfectly possible to run a histogram:


hist(x[,1], xlab = "Sample 1", col = "blue",main = "Histogram of Sample 1" )

2015-08-10 21_44_28-RStudio


Graphical examination reveals a structure similar to the eponymous bell curve of the normal distribution. However, for statistical purposes, we generally use normality tests such as the Anderson-Darling or the Shapiro-Wilk tests. Think of these tests as a way for checking whether our data is thick around the middle and thin at the sides, just like the normal distribution. These tests are in the package “nortest“, which I have called using the “library()” command.

Now let’s talk through the rest of the code. Since there are 99 subgroups of data here, it makes more sense to use the Anderson-Darling normality test to numerically examine the normality of the data. With a 5% significance, we’d expect around 95% of these 99 samples to come from a normal distribution. (I say this with confidence especially as I put together the sample data set from the normal distribution. In a real-life scenario, of course, you’d have to evaluate samples based on their normality). If you are wondering what that 5% was all about, I’ll talk about significance (and confidence levels) in a later post.

Once we have our normality assessed (and okay), we can put together a pair of control charts. We use the X-bar and S charts to assess process stability. X-bar is the notation for the average value of each sample (which we should hope won’t change a great deal between samples in the same process), and S is the standard deviation of each sample (which we hope stays as consistent and low as possible). Now, let’s look at the X-bar and S charts we generated. Observe how they have black dots representing each sample’s statistic (X-bar or S) and three lines – two dotted lines, called the “control limits” and the centre line.

X-bar Chart for the Sample
X-bar Chart for the Sample


s chart
S chart for the sample

There are a number of rules for evaluating process stability in control charts, but the main and most widely followed one always is – “none of the points in your samples should lie outside the control limits”. There are a series of rules based on the properties of the normal distribution (based on which the control charts are constructed) that use the patterns of points, series of points that indicate constant rise or constant reduction in output (a process trend), and so on. However, for the purposes of demonstration, let’s agree that the process results in this case represent a stable process. The S chart here shows a consistent standard deviation, and the X-bar chart here shows a consistent mean. The center line in each chart simply illustrates the mean of means, and the mean of standard deviations, respectively, for X-bar and S charts.


Now that the stability assessment is also complete, the next command in the sample code helps us assess the capability of this process. Usually, in a manufacturing organization, there are extensive quality plans that help you understand what the specification limits are.

Naturally, capability assessment shouldn’t be done lightly by (goodness forbid) assuming specification limits. Design engineers and teams sometimes use the tolerance limits (or tighter limits than them) to assess the manufacturing process in question. Overall, though, this is what a capability analysis looks like. In our example, the data is is being compared to the USL (upper spec limit) of 10.5, and the LSL (lower spec limit) of 9.0. Presumably these are based on the drawings and documents alluded to earlier in your analysis too. The indices in the third column – C_p, C_pl, C_pu, C_pk, and C_pm – indicate different capability indices. Observe how they’re all different, and how Cpl is the highest. Now, if you observe the data in the capability plot, you’ll note that the peak is a little to the right of the centre line. There is therefore an asymmetry in the way the process behaves with respect to the center of the specification limits.  The higher the capability indices are, the better. If you were to look up the Automotive Industry Action Group manual on Statistical Process Control, you may find the appropriate values for the automotive industry. There may be different required process capability indices for different industries, different manufacturing processes, parts and so on.

Interpreting Process Capability

Process capability is generally a very important part of Six Sigma projects and many managers in manufacturing organizations tend to use the capability metrics alluded to (and other ones such as P_p, P_pk and Z-scores) as measures of process quality. Naturally, a higher process capability is better, because it means that our process will tend to produce results more often, between the agreed-upon specification limits. Why do we have so many indices? And why are they all different? They’re different because they measure different things. Processes need not only have small variations, but should also be centered on specifications. And processes should be consistent in their results in the short and long terms – and since each of these criteria have different capability indices, the number of indices swells.

Process capability is an important concept to know for service organizations too. Your process may not follow the normal distribution like manufacturing or assembly processes may, and you may deal with boolean (1-0) process results or ordinal results (such as the performance of a multi-modal switch, or a processor) – and even for these situations, there are other capability analysis metrics and methods.

Concluding Remarks

We’ve seen how R’s power can be leveraged to conduct process capability analysis. R’s benefit over other commercial software seems to be the modularity, which could bring you a new way to do these charts, perhaps with better fonts, or formatting, or better functionality, and the redoubtable repeatability and speed, like a well-oiled machine. As caveats, it is important to evaluate normality and stability prior to capability. If you have data that isn’t normally distributed, or that couldn’t be considered stable from its control charts, you may have to take more fundamental actions to improve your process’ results and consistency. The Six Sigma tool set also consists of a number of other tools – including inferential statistics and sometimes, times series methods. I’ll cover how these may be carried out in R in the next post, with the SixSigma package and its commands.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s