Testing for normality using Excel is a routine procedure for Quality Engineers. In this post, I’ll show you how to test for normality using the Anderson Darling procedure. Using this procedure we will learn how to compute the Anderson Darling test statistic and p-value for a normal distribution. We will also compute an Anderson Darling critical value and compare this to the Anderson Darling test statistic. We will do so by learning how to perform these calculations using the cumulative normal distribution function within Excel and test the normality of a sample set of data.
WHY ARE ENGINEERS INTERESTED IN TESTING FOR NORMALITY USING EXCEL?
First of all, Excel is everywhere and not everyone has a statistical package available to them. So, testing for normality using Excel is an easy way to go. Excel is hold the kind of data the Engineers work with on a daily basis. For example, Engineers are often interested in knowing how a specific feature of a product conforms to a desired specification. So, they will sample a specific number of units, measure a feature of interest, and apply statistical procedures to that data to compute a non-conformance rate.
WHAT IS A NON-CONFORMANCE RATE?
A non-conformance rate is often reported in parts-per-million (PPM). This rate, is a calculation that computes the number of units that fall outside the specification limits assuming the data follows a Normal Distribution. Such data should be free of intermittent shifts meaning it is identically distributed. This is often evaluated by a visual inspection of a histogram to see the data is bell-shaped with a single peak. Last, the data should be free of autocorrelation – meaning it is independently distributed. We can plot the data, in the order collected, and look for trends up or down. Once we check these assumptions, we can be confident the data is random and follows a fixed normal distribution having a constant mean and standard deviation. Under these conditions we can estimate our PPM non-conformance rate.
In this post, I’ll discuss a statistical procedure called the Anderson-Darling test for Normality. I will show how this procedure is used for testing normality using Excel. In another post, I’ll show a Runs Test procedure for Detecting Non-Randomness. Together these procedures help test the assumptions for a distribution of interest.
ANDERSON DARLING TEST HYPOTHESES FOR A NORMAL DISTRIBUTION
There are several statistical procedures to test Normality. One type of test is the Anderson-Darling test. This test is named after Theodore Wilbur Anderson and Donald A. Darling who derived this procedure in 1952. Their test requires that we compute the Anderson-Darling statistic (A). Comparing this test statistic to a critical value we can see if the data follows or does not follow a normal distribution. Testing Normality using Excel we will address if the data follows or does not follow a Normal Distribution.
Shown below are the null and alternative hypotheses for this test:
HNULL: The data follows the normal distribution
HALTERNATIVE: The data does not follow the normal distribution
The null and alternative hypotheses are statements that the data are normally versus non-normally distributed.
SAMPLE DATA SET
In Table 1, shown is a data set of 50 observations. We will use this data to show the Anderson Darling step-by-step calculations for testing normality using Excel.
THE ANDERSON-DARLING TEST
The Anderson-Darling test will verify if a data set comes from a specific distribution. In our case, we will test if it came from a normal distribution. This test makes use of the cumulative distribution function F(X). We compute the Anderson-Darling statistic using the following equations.
First, we compute the value, Si, for each observation using the expression below.
Si = (2i – 1)[lnF(Xi) + ln(1-F(Xn-i+1)]
In the expression for Si, i is the ith sample when we sort the data in ascending order. F(Xi) is the cumulative distribution function for a distribution of interest. In our case, will use the normal distribution. F(Xn-i+1) is the cumulative distribution function for the data in descending order.
Once we compute Si for each observation we then sum all Si values from 1 to n as shown in the expression below.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A, using the expression below.
THE ANDERSON-DARLING TEST USING EXCEL
To demonstrate the calculation for testing normality using Excel we will use the data in Table 1. In Figure 1, shown are the calculations.
STEP BY STEP CALCULATIONS
I will illustrate the step by step calculations for testing normality in Excel shown in Figure 1.
STEP 1.
Label cell A1 as shown in Figure 2 and enter/copy the data in cells A2 through A51.
STEP 2.
Label cell B1 as shown in Figure 3. Then enter 1, 2, 3 into cells B2, B3, and B4 and highlight cells B2 through B4 and drag the highlighted cells to cell B51. Doing so will produce a sequential list of values from 1 to 50.
STEP 3.
In Figure 4, shown in cell C4, enter the formula: =SMALL($A$2:$A$51,B2). This formula has the following format: =small(array, k). It finds the smallest value in an array corresponding to its kth order (in our example we use i). So, the formula used in cell C2 gets the k=first smallest value in the array from A2 through A52. Highlight and copy C2 all the way down to cell C52. Note that the dollar signs ($) assures the cell values for the array remain constant when you copy the formula. Rather then typing the dollar signs ($) you can use F4 on your keyboard. Just highlight the array using your mouse and select F4. This will automatically insert the dollar signs ($). Notice cell C3, the array found the second smallest value when k=2.
STEP 4.
In Figure 5 and in cells D1, D2, and D3 type the labels Average, StDev, and n. Next, in cells E1 and E2 use the following excel formulas for the Average and Standard deviation. For the average use: =average(A2:A52). For the standard deviation use: =stdev(A2:A52). Then, in cell E3 enter the value 50 for the sample size n.
STEP 5.
In cells F1, G1, H1, I1, and J1 type the following labels: 2i – 1, LN(F(Xi)), Descending Data, Xn-i+1, LN(1-F(Xn-i+1)), and Si as shown in Figure 6. Then change the ROW HEIGHT and COLUMN WIDTH to your desire. Also feel free to WRAP TEXT in any cell that has a large column width.
STEP 6.
In column F, we need to perform the calculation shown in Figure 7 below. In cell F2, enter the following formula: =2*B2-1, and drag this cell calculation to cell F51.
STEP 7.
As shown in Figure 8, enter this formula into cell G2: =LN(NORM.DIST(C2,$E$1,$E$2,TRUE). This cell contains two Excel formulas. Nested in the natural log (LN) is the NORM.DIST formula. Please note, the normal distribution function has the following form, NORM.DIST(x, mean, standard_dev, cumulative). In this expression x=C2, mean=E1, standard_dev = E2, TRUE = cumulative. In cell G2, we are computing the cumulative area under the normal curve, as a probability from 0 to 1 at Xi = 44. To compute the cumulative area at Xi =44, we use the sample average (49.82) and standard deviation (2.639) as substitutes for the normal distribution parameters (µ and σ). Once we have the area under the normal curve, we then compute the natural log. Don’t forget to place $ sign about the cells E1 and E2 and then copy the contents of cell G2 through to G51.
STEP 8.
In column H, reorder the data in descending order. To do so use this excel formula, =LARGE(array,k). Type the following formula in cell H2:=LARGE($A$2:$A$51,B2). This formula selects the data in cells A2 through A51 and returns the first largest value. Remember to place the $ signs about A2 and A51 before you copy the contents of this cell through to cell H51.
STEP 9.
Enter the following formula in cell I2: =LN(1-NORM.DIST(H2,$E$1,$E$2,TRUE)) as shown in Figure 10. This cell contains two Excel formulas. Nested within the Natural Log (LN) is NORM.DIST, the Normal Distribution function. Please note, this function has the following form, NORM.DIST(x, mean, standard_dev, cumulative). In this expression: x=H2, mean=E1, standard_dev = E2, TRUE = cumulative. We are computing the cumulative area under the normal curve, as a probability from 0 to 1, in reverse order. Here, we are computing the cumulative probability at Xi = 56 using the sample average (49.82) and standard deviation (2.639) as substitutes for the normal distribution parameters (µ and σ). Once we have the cumulative area under the normal curve, we then subtract this from 1 and compute the natural log. Don’t forget to place $ signs about the cells E1 and E2 and then copy the contents of cell I2 through to I51.
EXPLAINING THE ANDERSON DARLING PROCEDURE
At this point we need to pause and take the time to understand what the Anderson-Darling procedure is trying to do. Such an explanation will make our effort testing for normality using Excel meaningful.
Recall, in Figure 10 – column H, we computed the cumulative probability in ascending order. We then computed the cumulative probability in descending order and subtracted this value from 1 in each cell. Because the normal distribution is symmetrical, we expect the probability at i=1 and i=50 to be the same. This is also true for the cumulative probability at i=2 and i=49 and so on. Assuming the data is perfectly normal, the difference in probabilities at i=1 and i=50 would be zero. This is also true for i=2 and i=49. Summing these differences from i=1 to n=50 would result in a value of zero – meaning the data must be normally distributed. Any value other then zero we compare to a critical value associated to a specific risk. If that value exceeds a critical value, we would say the data is not normally distributed at a specific level of confidence.
STEP 10.
We now need to compute Si for each cell in column J. As shown in Figure 11, enter the following formula in cell J2: =F2*(G2+I2). Then copy this cell all the way down to cell J51. Notice in column J, we have now computed the values using the formula, Si=(2i-1)[lnF(Xi)+ln(1-F(Xn-i+1)).
STEP 11.
We are almost done testing for normality in Excel! In cell J52, as shown in Figure 12, we need to sum the values in column J between cells J2 and J51. To do so, enter the following formula in cell J52: =SUM(J2:J51). As a result, we have now computed the value of S using the expression below.
STEP 12.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A. This is shown in Figure 13 using the expression A=-n-(S/n). This is the value in cell J53. Enter the following formula into cell J53: =-E3-1/E3*J52.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A, using the expression below.
ADJUSTED ANDERSON-DARLING TEST STATISTIC (A*) FOR SAMPLES OF SIZE n FOR A NORMAL DISTRIBUTION
When the population mean (µ) and standard deviation (σ) are unknown, but estimated using the sample mean () and standard deviation (
) we must compute an Adjusted Anderson-Darling statistic, A*. I show the expression for the adjusted value below.
Notice n in the denominator of A*. As the sample size, n, increases the value contained within the brackets approaches 1. As such, A* approaches A when n is large. For n = 10 the value contained within the brackets equals 1.0975 and decreases as n becomes larger. As such, A* and A will be similar when the sample size (n) increases.
THE CRITICAL VALUE FOR THE ADJUSTED ANDERSON-DARLING TEST STATISTIC
We will use the values in Table 2 to compute a critical value for the Anderson-Darling statistic. Notice in this Table, I show the values for a, b, and d for an alpha risk of 0.01 (α = 0.01) and 0.05 (α = 0.05).
A statistical difference exists when the sample data and the theoretical distribution provided by A is equal to or exceeds the critical value (A*≥ critical). So, let’s compute a critical value, for a normal distribution. We will do so for a given risk (α). using the expression below.
In our example n = 50. Let’s assume an alpha risk of 0.05 and use a=0.7514, b = 0.795, and d = 0.89. Thus, substituting these values into the equation below then yields a critical value of 0.7392.
THE CRITICAL P-VALUE FOR THE ADJUSTED ANDERSON-DARLING TEST STATISTIC
Let’s now use the expressions in Table 3 to estimate the p-value for A*=0.450. Since A* falls between 0.34 and 0.60 we will use, exp(0.9177-4.279A*-1.38(A*)2 to compute the p-value. The equations in Table 3 are accurate to within 5 decimals places.
As shown, the p-value = 0.276. So, lets assume a critical alpha risk of 0.05 (α = 0.05) we then see that 0.276 > 0.05 and conclude there is no evidence to suggest the data is non-normal.
SUMMARY
In trust you found this Anderson Darling procedure for testing normality in Excel useful. Since many statistical procedures rely on distributional assumptions, knowing how to test for normality is critical. So, testing for normality using Excel we can check if a sample set of data and know if it came from a population with a specific distribution. In our case, we applied the Anderson-Darling procedure to compute a statistic and a critical value. We also estimated a p-value for our Anderson Darling test statistic. We then used this to determine if sample data came from a Normal population.
There are non-parametric techniques that don’t make distributional assumptions, but they are not as powerful as those techniques that use a specific distribution. Thus, if we can confirm the distributional assumptions they are preferred.
POPULAR RESOURCES
- Please visit these popular blog post resources:
- Xbar and R Chart Formula and Constants – The Definitive Guide.
- Estimating the d2 and d3 Constants Using Minitab.
- D2 Values for the Distribution of the Average Range.
- How to Calculate Gage Repeatability Using the Average Range.
- Control Chart Constant – How to Drive A2 and E2.
- Range Statistics – How to Calculate Standard Deviation.
OTHER RESOURCES
Hello,
In step 7, I believe the formula is incorrect (throws an error in Excel): =LN(NORM.DIST(C2,$E$1,$E$52,TRUE)
Should be =LN(NORM.DIST(C2,$E$1,$E$2,TRUE)
Thank you so much for finding and letting me know about this error!
Dear Mr Milivojevich,
I think the critical value is: 0.7392 instead of 0.9837
Thanks for the article!
Thank you for catching the error! It has since been corrected.
Take care,
Andrew
Dear Mr Milivojevich,
I found your step by step guide for calculating the Anderson-Darling statistic in Excel very instructive. But most likely there is an error in the p-value equations in table 3: the equations for when 0.34 < A* = 0.6 should not start with ‘1 – ‘. See for example https://www.statisticshowto.com/anderson-darling-test/.
Kind regards,
Inge Uljee
Hello Inge.
Thank you for pointing this out. I very much appreciate this.
Identifying opportunities to improve serves others that benefit from our collective knowledge – thank you.
Best Regards,
Andrew
Andrew,
Very interesting article for those of us trying to understand further, many thanks.
I have a silly question. In your last paragraph before the summary “As shown, the p-value = 0.724. So, lets assume a critical alpha risk of 0.05 (α = 0.05) we then see that 0.276 > 0.05 and conclude there is no evidence to suggest the data is non-normal”, I am struggling with 0.276, is this a typo and should be 0.724?
Thanks for clarifying.
Hello Andres.
Thank you for identifying the error in my calculation.
When solving the expression (1 – e^-1.2873) you get 1 – 0.276 = 0.724. I copied the value of the exponent (0.276) instead of value 0.724.
Thank you for reading this post and seeing this error – I corrected the post.
Best regards,
Andrew
Hi Andrew,
I would like to understand the P-value stated in ASQ/ANSI Z1.9-2003;
At first I would like to know is the P-Value stated in Above spec. is same as Normality P-Value which we calculate via Minitab(Anderson darling test/Shapiro-wilk/Kolmogorov)?
I am struggling to consider sampling Inspection criteria for variable sampling plan with reference to ASQ/ANSI Z1.9-2003 spec.
COuld you please able to help me to uderstand better?
Hello Sriram.
To answer your question, I would need to observe the standard you referenced above. Unfortunately, I do not have this document and may not be much help to you. But, from what I can determine on the ASQ website, the variable sampling plans mentioned apply to a specific characteristic. Also, which sampling plan used is based on whether or not there is variability estimate. It also appears the sampling plans assume a Normal Distribution. As such, I believe the p-values used in these sampling plans refer to a probability under the normal curve. This is different than the p-value reported by the Anderson-Darling, Shaprios-Wilk Normality Tests. These tests evaluate is the data can be represented by a Normal Distribution. Whereas, the probability used in the variable sampling plans correspond to a defect rate (area under the Normal Curve). Based on the defect rate the appropriate number of samples drawn you can determine to detect exceeds detect rate (p-value) you and your customer deem acceptable.
If anyone else reading this thread has more knowledge in these matters I trust they will add their comments.
Best Regards,
Andrew
Hi Andrew,
I am so glad that I got the perfect answer for my query, well Thankyou so much for the clarification. I understood the P represented in ASQ/ANSI spec is based on the defect rate the appropriate number of samples drawn to determine the defect. And P-Value which we use to check the Normality is different.
Thankyou once again.
Now Inorder to proceed for sampling/reduced inspection. As a first step before considering the variable sampling method as per ASQ/ANSI Z1.9, I understand it is necessary to ensure the continuous inspected parts (eg. 50 – 300nos) shall be Analysed for Normality distribution (P-Value which we get from Minitab).
I would like to know what method is best way to check the normality?
If Using Minitab, which test is more suitable?
If any other method, please can you suggest?
Apolgies if it is layman question, but your help would be much appreciated.
I would glad to follow your posts in future to gain some knowledge.
Your work is more inspiring and much appreciated.
Thankyou,
Regards,
Sriram.V
Hello Sriram.
The Shaprio Wilk Test tends to be more powerful followed by the Anderson Darling Test. Of course the power decreases for small sample sizes. However, for the samples you would be drawing based on your variable sampling plan I would suggest Shaprio wilk followed by Anderson Darling. I suspect both Shaprio Wilk and Anderson Darling would be comparable for large sample sizes.
Best Regards,
Andrew Milivojevich
Hi Andrew,
Thankyou for your clarification. I tried to follow both the methods (Anderson-darling & Shapiro-wilk). Well, for some of the dimensions I get P-Value <0.005 & 0.05 (Correct me if am wrong?).
Based on ASQ/ANSI Z1.9 requirement, it is assumed that before proceeding to the sampling consideration the data should be normal. Well, in my case data is non-normal, so I tried to do Johnson transformation to see the P-Value >0.05. But I really couldnt understand what is the intentsion of such tranformation (Johnson / Box-cox). Ideally data is going to be same right?
I am also considering another SPEC AS13002 for sampling inspection. Which explains Normality test is the first step to consider prior to ensuring Process capability. With/without any transformation my data Cpk/Ppk is going to reflect same value right? provided only P-Value is going to be >0.05.Then why should I consider Normality test at first, and how significantly it is going to affect the Process capability results?
Could you please explain in detail if possible?
I am sorry, if am asking more queries.
Hello Sriram.
Lets deal with your first question…
If your test for normality suggests the data are non-normal then investigate why. Perhaps only a few values are causing the problem?
As for your second question…
You transform the original data to achieve normality because the properties of the normal distribution are known. If your original data is not normally distributed and you wish to compute the number of non-conforming units versus a specification you could not do so because you do not know the function that explain that distribution. So when you transform the original data to achieve normality you need to transform the specification as well. Once you compute the proportion of non-conforming units you would apply the inverse transform to compute the actual number of units non-conforming.
As for your third question…
Capability (Cp, Cpk) and Performance (Pp, PpK) indices are a means of comparing (among other things) if the data is made up of common versus special causes of variation. When a system behaves under common causes the data will be normally distributed. If this is the case then Cp=Pp and Cpk=Ppk (they won’t be exact but very close). If the data contains both common and special causes of variation then Cp > Pp and Cpk > Ppk. This just means that you need to investigate and assign a cause that eliminates or reduces sources of special cause variation. So, I really don’t care if the data is normally distributed – if it were then there would be no special cause to investigate! So, what I care about is identifying opportunities to improve the process. In this case, I compute the indices to tell me how good or bad things are with a process of interest.