Basic QA Statistics Series (Part 6) How to Read a Box-and-Whisker Diagram

REFLECTION: FOR STUDENTS:
”Images are the most powerful communicator we have.”
– John Berger, 1926
FOR ACADEMICS:
”We live in a visual intensive society.”
– Paul Martin Lester, 2006
FOR PROFESSIONALS/PRACTITIONERS:
”There is more to visual communications therefore than simply making an image for the eyes to perceive, it has to accommodate the mind of the person being communicated to. That is to say you are not merely making something to be perceived when visually communicating, you are fundamentally making something to be thought about.”
– Aldous Huxley, 1894 –1963
Foundation
The last post promised discussion on the Box-and-Whisker Diagram (or Box Plot). Similar to the Histogram, a box plot is a graphic tool that displays the distribution of the data, but with a critical difference. The box plot shows the Quartiles (minimum, 1st, median, 3rd, and maximum) and clarifying the Interquartile Range- IQR (See the last post) using a box. The “whiskers” of the plot are the minimum and maximum.
Box plots are very useful for quick comparison of multiple sets of data, especially when the focus is on the most critical aspects of the data, or there are not enough data points in each data set to create a reliable histogram. Different Softwares will provide descriptive statistics like the mean value as well.

As you can see, box plots are excellent tools for quickly depicting variation between shifts or machines. Minitab is very effective, but most stat software (even Excel) can easily create a box plot for a presentation with minimal extra effort.
Conclusion
I will be signing off on this short intro to QA statistics, though I will eventually return with a more in-depth series after I have turned my attention back to the broader world of Quality Culture for a while. Thank you all for your enthusiasm for this subject! If you have any subject suggestions, please send them to me!
Basic QA Statistics Series(Part 5)- Basic Histogram

REFLECTION: FOR STUDENTS: When that graph pops up showing you data in histogram, pay closer attention to everything the graph is conveying, because effective conveyance of data is the future
FOR ACADEMICS: Teach your students how to use visual data graphics, and correct them when they slip up. From teachers to the boardroom, being able to construct a histogram for a presentation is a vital skill for information conveyance.
FOR PROFESSIONALS/PRACTITIONERS: Excel and Minitab do the job, but always remember the underlying theory behind the graphs for when the software goes down, or you need to do it quickly without a computer.
Foundation
As Promised from the last post from this series, we will now delve a little bit into histograms. The primary purpose of a histogram is to provide a straightforward graphic representation of the distribution of data. I’m sure everyone has heard the term “a picture speaks a thousand words.” To demonstrate this, I will show you three histograms and think you will see before you read any caption which histogram looks like useful data. Sample data should appear pretty much like a bell curve to be declared “normal.”

When the “Tail” is to the left the data is left skewed- and look at that clear outlier bin

A histogram with an almost perfectly normal distribution

When the “Tail” is to the left the data is right skewed
The histogram is a quick communication of the state of the data. When you see the strong left or right skew, you must investigate the outliers and determine why you have so many.
Constructing a Histogram from your Data
To construct a histogram from a continuous variable, you need to determine the amount of data to be used. If you were researching problems with a production line, Cost would be your horizontal, split into bins, with the recommended number of bins equal to √n (n being the number of samples), and the bins having set boundaries. Fifty data points should be your minimum. Each bin will separate the data into classes based upon frequency, but the histogram will not show you the raw data, only represent the frequency distribution. I would suggest familiarizing yourself with your company’s statistical software so that everyone uniformly performs the analysis. Having the statistical guidelines per the software will save you in some auditing situations. Minitab, Excel, and many others provide straightforward access to histogram construction. (Kubiak, 2017) Most software equalize the width of the bars, but the way I have seen the width determined by hand most often is:
- Determine # of Bars to use based upon the sample size
- Sample size of 100 or less: 7-10 Bars
- Sample size of 100-200: 11-15 Bars
- Sample size of 201 or more: 13-20 Bars
- Choose # of Bars to use
- Width(W) = Overall Range of Data (R) / # of Bars(B)
- W=R/B.
- Keep adding W to the previous bar to find the lower edge of the next bar, starting from 0
(Tague, 2005)
Conclusion
Histograms are kind of like a way to count your data frequency of occurrence within set boundaries, and then show graphically how your data is distributed. Always remember that if a histogram is constructed with too many or too few bins, it can be manipulated misleadingly. Always check the numbers yourself! This tool is one of the Seven Basic Quality Tools and meant to be used to help flag issues like outliers or non-normal data. It is not something that can solve a problem on its own, but a tool that enables you to understand what the data is telling you. The next post we cover will talk about another visual stat tool- the box and whisker diagram (for any cat lovers 😊).
Bibliography
Kubiak, T. a. (2017). The Certified Six Sigma Black Belt Handbook Third Edition. Milwaukee: ASQ Quality Press.
Tague, N. R. (2005). The Quality Tool Box. Milwaukee: Quality Press.
Basic QA Statistics Series(Part 4)- Interquartile Range-IQR

REFLECTION: FOR STUDENTS: A good rule in organizational analysis is that no meeting of the minds is really reached until we talk of specific actions or decisions. We can talk of who is responsible for budgets, or inventory, or quality, but little is settled. It is only when we get down to the action words-measure, compute, prepare, check, endorse, recommend, approve-that we can make clear who is to do what. -Joseph M. Juran
FOR ACADEMICS: Without a standard there is no logical basis for making a decision or taking action. -Joseph M. Juran
FOR PROFESSIONALS/PRACTITIONERS: Both pure and applied science have gradually pushed further and further the requirements for accuracy and precision. However, applied science, particularly in the mass production of interchangeable parts, is even more exacting than pure science in certain matters of accuracy and precision. -Walter A. Shewhart
Foundation
When we left this small series on basic QA statistics, we had just discussed basic measures of Dispersion- Range, Variance, and Standard Deviation. As promised, we are now covering the basics of Interquartile Range (IQR for short). IQR is also a measure of dispersion, but as I’m sure you will be exposed to IQR in the future, I thought it best to give it a separate post.
The IQR range, like the other measures of dispersion, is used to measure the spread of the data points in a data set. IQR is best used with different measurements like median and total range to build a complete picture of a data set’s tendency to cluster around its mean. IQR is also a very useful tool to use to identify outliers (values abnormally far from the mean of a data set), but do not worry about the more in-depth math.
First, to Define all of the aspects of IQR
-First Quartile (Q1)- The value at which 25% of the data are less than or equal to this value (does not have to be a value in the data set).
-Second Quartile (Q2)- The value at which 50% of the data are less than or equal to this value. It is also known as the median. The second quartile or median does not have to be a value in the data set.
-Third Quartile (Q3)- This is the point at which 75% of the data are less than or equal to this value. It also does not have to be in the data set.
-Fourth Quartile (Q4)- This value is the maximum value in the data set (100% of the data are less than or equal to this value).
-Interquartile Range (IQR)- IQR is the Third Quartile minus the First Quartile and considered a measure of dispersion.
(Kubiak, 2017)
Calculating Quartiles
There are several methods for calculating quartiles, so the technique I am going to use is just what I consider the most basic without delving into any more in-depth math.
Steps:
- Order the data set from smallest to largest.
- Determine the median (reference my post: Basic QA Statistics Series(Part 2)- Basic Measures of Central Tendency and Measurement Scales).
- This determination separates the data into two sets (an upper half and lower half). This Median is Q2
- The First Quartile (Q1) is found by determining the median of the lower half of the data (not including the Median from the previous step when calculating the lower half data set median).
- Q3 is the median of the upper half of the data set, not including the value for Q1 in the top half median determination
- Q4 is the maximum in the data set.
(Kubiak, 2017)

Data Set: 22,26,24,29,25,24, 23,26,28,30,35,40,56,56,65,57,57,75,76,77,74,74,76,75,72,71,70,79,78, 1000,10,12,13,15,16,12,11,64, 65,35, 25,28, 21,44,46,55,77, 79,85,84,86,15,25,35, 101,12,25,35,65,75
Conclusion
As you can see, I stacked the data deck with a massive outlier in the data set. 1000 is far from the mean, but the IQR is not affected by this enormous outlier, as it only takes into account Q1 and Q3.
This property of IQR helps prevent outliers from convincing you the mean is just fine, when in fact, the entire system may be out of whack but compensated for by outliers in your data. The little chart you see is called a Box and Whisker plot, and we will give it a separate post later after we discuss Histograms in the nest post.
Bibliography
Kubiak, T. a. (2017). The Certified Six Sigma Black Belt Handbook Third Edition. Milwaukee: ASQ Quality Press.
Basic QA Statistics Series(Part 3)- Basic Measures of Dispersion and Statistical Notation

REFLECTION: FOR STUDENTS: “It is not possible to know what you need to learn.” -Philip Crosby
FOR ACADEMICS: “Quality is the result of a carefully constructed cultural environment. It has to be the fabric of the organization, not part of the fabric.”-Philip Crosby
FOR PROFESSIONALS/PRACTITIONERS: “Quality has to be caused, not controlled.”-Philip Crosby
Foundation
Before we go further, this post will give you the basic notation for simple statistics so we can communicate more efficiently. It will also make understanding instructions from textbooks much less challenging. Please don’t give up here. These notations are just a secret code mathematicians use. If you learn it, you will begin to see that statistics is quite accessible. After the code is passed on, we will move on to the Measures of Dispersion.
Review: Part 1 and 2 covered the definition of Population, Sample, and how the terms Parameter and Statistic relate to Population and Sample, respectively. Also, we covered the concept of what data is, as well as the different kinds of data that exist, and the measurement scales used to analyze measurement data.
STATISTICAL NOTATION
Typically, capital letters and Greek letters are used to refer to population parameters, and lower-case or Roman letters are used to note sample statistics.
I will be providing information in the table below specifically for this post. As posts are added in the series, more tables will be added to address any other notations referenced in the future. This post will become the notation reference page to allow any who are new to statistical notation an easy reference.

(Kubiak, 2017)
MEASURES OF DISPERSION
There are three primary Measures of Dispersion- Range, Variance, and Standard Deviation. I will address each and explain them plainly. If you are new to statistics, I will avoid mathematics as much as possible, but alas, you will find it inescapable.
First comes RANGE. Range is probably the most well known and most easily understood. Range is simply the difference between the largest (Maximum or MAX) value and the smallest (Minimum or MIN) value in a data set.
Example: 24, 36, 54, 89, 12, 14, 44, 55, 75, 86
Min 12, Max, 89
Range (R)= Max-Min = 77
Though Range is easy to use, it is not always as useful as the other measures of dispersion, because sometimes two separate data sets can have very similar ranges, with the other measures looking nothing alike. On that note, comes something a bit more complicated.
At first, it sounds pretty simple:
VARIANCE- This is the measure of how far off the data values are from the mean over-all. Obtaining this measurement by hand can be painful. You have to find the difference between the mean and each data point in the population or sample, square the differences, and then find the average of those squared differences.
Variance RoadMap
- Calculate the mean of all the data points Calculate the difference between the mean and each data point(Xi – μ or x ̅), Xi being a representation ith value of variable X.
- Square the calculated differences for all data points
- Add these Squared values together
- Divide that number by N if the data set is a population (N), or divide by n-1 if the data is a sample
Follow the underlined statements above, and the formula for Variance below is achieved, but most stat software will calculate Variance with minimal effort.
Sample

Population

Standard Deviation (SD)
A negative of Variance is though you can measure the relative spread of the data, it is not representative of the same scale because it has been squared. For example- data collected in inches or seconds and then checked for variance is effectively square inches or seconds squared.
Standard Deviation is more useful because the units of Standard Deviation end up on the same scale and are directly comparable to the mean of the population or sample. Standard Deviation is the Square Root of the Variance and can be described as the average distance from each data point to the mean. The lower the SD, the less spread out the data is. The larger the spread of data, the higher the SD. Once again, most Stats programs and calculators will provide SD with no problem. The SD helps you understand how much your data is varying from the mean.
Two Examples: (using sample sets)
Set 1: 35, 61, 15, 14, 1
Mean(Set 1): 25.2

Set 2: 45, 48, 50, 43, 40
Mean(Set 2): 45.2
S=√(((45-45.2)²+(48-45.2)²+(50-45.2)²+(43-45.2)²+(40-45.2)²)/4)=3.96
When you first glance at the small sample of data, set one looks like it has a much larger spread from the average than set two. When you run the numbers, the SD results back up your “gut feeling.” An analysis is always better than a “gut feeling,” no matter how intuitive you are. The larger the sample set you are looking at, the more the initial appearance of the data can mislead you, so always run those numbers!
Conclusion
To recap, Range is the most well-known and straightforward Measure of Dispersion, but only describes the dispersion of the extremes of the data, and therefore may not always provide much new information. Range is usually most useful with smaller data sets. I should also mention a term known as the Interquartile Range (IQR). I will be dedicating a separate post to IQR next post.
Variance is an overall measure of the variation occurring around the mean using the Sum of Squares methodology. Remember, variance does not relate directly to the mean, so you cannot evaluate a variance number directly, so you should use variance to see how individual numbers relate to each other within a data set. Outliers (data points far from the mean) gain added significance with variance as well. Standard Deviation tends to be the most useful Measure of Dispersion, as it relates directly to the mean, and can be used to compare the spreads of various data sets. Remember, your stats programs will help you, and many online resources will walk you through any calculation. If you have any questions, shoot me a comment, and I will answer it for you. See you next time as we dig a bit deeper into IQR. 😊
Bibliography
Kubiak, T. a. (2017). The Certified Six Sigma Black Belt Handbook Third Edition. Milwaukee: ASQ Quality Press.