Testing for normality using Excel is a routine procedure for Quality Engineers. In this post, I’ll show you how to test for normality using the Anderson Darling procedure. Using this procedure we will learn how to compute the Anderson Darling test statistic and p-value for a normal distribution. We will also compute an Anderson Darling critical value and compare this to the Anderson Darling test statistic. We will do so by learning how to perform these calculations using the cumulative normal distribution function within Excel and test the normality of a sample set of data.
WHY ARE ENGINEERS INTERESTED IN TESTING FOR NORMALITY USING EXCEL?
First of all, Excel is everywhere and not everyone has a statistical package available to them. So, testing for normality using Excel is an easy way to go. Excel is hold the kind of data the Engineers work with on a daily basis. For example, Engineers are often interested in knowing how a specific feature of a product conforms to a desired specification. So, they will sample a specific number of units, measure a feature of interest, and apply statistical procedures to that data to compute a non-conformance rate.
WHAT IS A NON-CONFORMANCE RATE?
A non-conformance rate is often reported in parts-per-million (PPM). This rate, is a calculation that computes the number of units that fall outside the specification limits assuming the data follows a Normal Distribution. Such data should be free of intermittent shifts meaning it is identically distributed. This is often evaluated by a visual inspection of a histogram to see the data is bell-shaped with a single peak. Last, the data should be free of autocorrelation – meaning it is independently distributed. We can plot the data, in the order collected, and look for trends up or down. Once we check these assumptions, we can be confident the data is random and follows a fixed normal distribution having a constant mean and standard deviation. Under these conditions we can estimate our PPM non-conformance rate.
In this post, I’ll discuss a statistical procedure called the Anderson-Darling test for Normality. I will show how this procedure is used for testing normality using Excel. In another post, I’ll show a Runs Test procedure for Detecting Non-Randomness. Together these procedures help test the assumptions for a distribution of interest.
ANDERSON DARLING TEST HYPOTHESES FOR A NORMAL DISTRIBUTION
There are several statistical procedures to test Normality. One type of test is the Anderson-Darling test. This test is named after Theodore Wilbur Anderson and Donald A. Darling who derived this procedure in 1952. Their test requires that we compute the Anderson-Darling statistic (A). Comparing this test statistic to a critical value we can see if the data follows or does not follow a normal distribution. Testing Normality using Excel we will address if the data follows or does not follow a Normal Distribution.
Shown below are the null and alternative hypotheses for this test:
HNULL: The data follows the normal distribution
HALTERNATIVE: The data does not follow the normal distribution
The null and alternative hypotheses are statements that the data are normally versus non-normally distributed.
SAMPLE DATA SET
In Table 1, shown is a data set of 50 observations. We will use this data to show the Anderson Darling step-by-step calculations for testing normality using Excel.
THE ANDERSON-DARLING TEST
The Anderson-Darling test will verify if a data set comes from a specific distribution. In our case, we will test if it came from a normal distribution. This test makes use of the cumulative distribution function F(X). We compute the Anderson-Darling statistic using the following equations.
First, we compute the value, Si, for each observation using the expression below.
Si = (2i – 1)[lnF(Xi) + ln(1-F(Xn-i+1)]
In the expression for Si, i is the ith sample when we sort the data in ascending order. F(Xi) is the cumulative distribution function for a distribution of interest. In our case, will use the normal distribution. F(Xn-i+1) is the cumulative distribution function for the data in descending order.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A, using the expression below.
THE ANDERSON-DARLING TEST USING EXCEL
To demonstrate the calculation for testing normality using Excel we will use the data in Table 1. In Figure 1, shown are the calculations.
STEP BY STEP CALCULATIONS
I will illustrate the step by step calculations for testing normality in Excel shown in Figure 1.
Label cell A1 as shown in Figure 2 and enter/copy the data in cells A2 through A51.
Label cell B1 as shown in Figure 3. Then enter 1, 2, 3 into cells B2, B3, and B4 and highlight cells B2 through B4 and drag the highlighted cells to cell B51. Doing so will produce a sequential list of values from 1 to 50.
In Figure 4, shown in cell C4, enter the formula: =SMALL($A$2:$A$51,B2). This formula has the following format: =small(array, k). It finds the smallest value in an array corresponding to its kth order (in our example we use i). So, the formula used in cell C2 gets the k=first smallest value in the array from A2 through A52. Highlight and copy C2 all the way down to cell C52. Note that the dollar signs ($) assures the cell values for the array remain constant when you copy the formula. Rather then typing the dollar signs ($) you can use F4 on your keyboard. Just highlight the array using your mouse and select F4. This will automatically insert the dollar signs ($). Notice cell C3, the array found the second smallest value when k=2.
In Figure 5 and in cells D1, D2, and D3 type the labels Average, StDev, and n. Next, in cells E1 and E2 use the following excel formulas for the Average and Standard deviation. For the average use: =average(A2:A52). For the standard deviation use: =stdev(A2:A52). Then, in cell E3 enter the value 50 for the sample size n.
In cells F1, G1, H1, I1, and J1 type the following labels: 2i – 1, LN(F(Xi)), Descending Data, Xn-i+1, LN(1-F(Xn-i+1)), and Si as shown in Figure 6. Then change the ROW HEIGHT and COLUMN WIDTH to your desire. Also feel free to WRAP TEXT in any cell that has a large column width.
In column F, we need to perform the calculation shown in Figure 7 below. In cell F2, enter the following formula: =2*B2-1, and drag this cell calculation to cell F51.
As shown in Figure 8, enter this formula into cell G2: =LN(NORM.DIST(C2,$E$1,$E$52,TRUE). This cell contains two Excel formulas. Nested in the natural log (LN) is the NORM.DIST formula. Please note, the normal distribution function has the following form, NORM.DIST(x, mean, standard_dev, cumulative). In this expression x=C2, mean=E1, standard_dev = E2, TRUE = cumulative. In cell G2, we are computing the cumulative area under the normal curve, as a probability from 0 to 1 at Xi = 44. To compute the cumulative area at Xi =44, we use the sample average (49.82) and standard deviation (2.639) as substitutes for the normal distribution parameters (µ and σ). Once we have the area under the normal curve, we then compute the natural log. Don’t forget to place $ sign about the cells E1 and E2 and then copy the contents of cell G2 through to G51.
In column H, reorder the data in descending order. To do so use this excel formula, =LARGE(array,k). Type the following formula in cell H2:=LARGE($A$2:$A$51,B2). This formula selects the data in cells A2 through A51 and returns the first largest value. Remember to place the $ signs about A2 and A51 before you copy the contents of this cell through to cell H51.
Enter the following formula in cell I2: =LN(1-NORM.DIST(H2,$E$1,$E$2,TRUE)) as shown in Figure 10. This cell contains two Excel formulas. Nested within the Natural Log (LN) is NORM.DIST, the Normal Distribution function. Please note, this function has the following form, NORM.DIST(x, mean, standard_dev, cumulative). In this expression: x=H2, mean=E1, standard_dev = E2, TRUE = cumulative. We are computing the cumulative area under the normal curve, as a probability from 0 to 1, in reverse order. Here, we are computing the cumulative probability at Xi = 56 using the sample average (49.82) and standard deviation (2.639) as substitutes for the normal distribution parameters (µ and σ). Once we have the cumulative area under the normal curve, we then subtract this from 1 and compute the natural log. Don’t forget to place $ signs about the cells E1 and E2 and then copy the contents of cell I2 through to I51.
EXPLAINING THE ANDERSON DARLING PROCEDURE
At this point we need to pause and take the time to understand what the Anderson-Darling procedure is trying to do. Such an explanation will make our effort testing for normality using Excel meaningful.
Recall, in Figure 10 – column H, we computed the cumulative probability in ascending order. We then computed the cumulative probability in descending order and subtracted this value from 1 in each cell. Because the normal distribution is symmetrical, we expect the probability at i=1 and i=50 to be the same. This is also true for the cumulative probability at i=2 and i=49 and so on. Assuming the data is perfectly normal, the difference in probabilities at i=1 and i=50 would be zero. This is also true for i=2 and i=49. Summing these differences from i=1 to n=50 would result in a value of zero – meaning the data must be normally distributed. Any value other then zero we compare to a critical value associated to a specific risk. If that value exceeds a critical value, we would say the data is not normally distributed at a specific level of confidence.
We now need to compute Si for each cell in column J. As shown in Figure 11, enter the following formula in cell J2: =F2*(G2+I2). Then copy this cell all the way down to cell J51. Notice in column J, we have now computed the values using the formula, Si=(2i-1)[lnF(Xi)+ln(1-F(Xn-i+1)).
We are almost done testing for normality in Excel! In cell J52, as shown in Figure 12, we need to sum the values in column J between cells J2 and J51. To do so, enter the following formula in cell J52: =SUM(J2:J51). As a result, we have now computed the value of S using the expression below.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A. This is shown in Figure 13 using the expression A=-n-(S/n). This is the value in cell J53. Enter the following formula into cell J53: =-E3-1/E3*J52.
Once we compute, S, we can calculate the Anderson-Darling Statistic, A, using the expression below.
ADJUSTED ANDERSON-DARLING TEST STATISTIC (A*) FOR SAMPLES OF SIZE n FOR A NORMAL DISTRIBUTION
When the population mean (µ) and standard deviation (σ) are unknown, but estimated using the sample mean () and standard deviation () we must compute an Adjusted Anderson-Darling statistic, A*. I show the expression for the adjusted value below.
Notice n in the denominator of A*. As the sample size, n, increases the value contained within the brackets approaches 1. As such, A* approaches A when n is large. For n = 10 the value contained within the brackets equals 1.0975 and decreases as n becomes larger. As such, A* and A will be similar when the sample size (n) increases.
THE CRITICAL VALUE FOR THE ADJUSTED ANDERSON-DARLING TEST STATISTIC
We will use the values in Table 2 to compute a critical value for the Anderson-Darling statistic. Notice in this Table, I show the values for a, b, and d for an alpha risk of 0.01 (α = 0.01) and 0.05 (α = 0.05).
A statistical difference exists when the sample data and the theoretical distribution provided by A is equal to or exceeds the critical value (A*≥ critical). So, let’s compute a critical value, for a normal distribution. We will do so for a given risk (α). using the expression below.
In our example n = 50. Let’s assume an alpha risk of 0.05 and use a=0.7514, b = 0.795, and d = 0.89. Thus, substituting these values into the equation below then yields a critical value of 0.9837.
THE CRITICAL P-VALUE FOR THE ADJUSTED ANDERSON-DARLING TEST STATISTIC
Let’s now use the expressions in Table 3 to estimate the p-value for A*=0.450. Since A* falls between 0.34 and 0.60 we will use, exp(0.9177-4.279A*-1.38(A*)2 to compute the p-value. The equations in Table 3 are accurate to within 5 decimals places.
As shown, the p-value = 0.276. So, lets assume a critical alpha risk of 0.05 (α = 0.05) we then see that 0.276 > 0.05 and conclude there is no evidence to suggest the data is non-normal.
In trust you found this Anderson Darling procedure for testing normality in Excel useful. Since many statistical procedures rely on distributional assumptions, knowing how to test for normality is critical. So, testing for normality using Excel we can check if a sample set of data and know if it came from a population with a specific distribution. In our case, we applied the Anderson-Darling procedure to compute a statistic and a critical value. We also estimated a p-value for our Anderson Darling test statistic. We then used this to determine if sample data came from a Normal population.
There are non-parametric techniques that don’t make distributional assumptions, but they are not as powerful as those techniques that use a specific distribution. Thus, if we can confirm the distributional assumptions they are preferred.
- Please visit these popular blog post resources:
- Xbar and R Chart Formula and Constants – The Definitive Guide.
- Estimating the d2 and d3 Constants Using Minitab.
- D2 Values for the Distribution of the Average Range.
- How to Calculate Gage Repeatability Using the Average Range.
- Control Chart Constant – How to Drive A2 and E2.
- Range Statistics – How to Calculate Standard Deviation.