When you are dealing with data which are presented to you in different groups or sub-populations, you might be interested in knowing whether they arise from the same population, or they represent different populations (with different parameters).

Let’s consider the following picture:

As you can see, there are three different footpaths. Now the question is: are those three footpaths belonging to the same specie of animals? Well, we can guess that two of them (orange and black) actually do, since they belong to birds, while the third one (in blue) is more likely to belong to a dog.

The answer to this problem is what the (One-way) ANalysis Of VAriance is meant for. It basically decomposes the variances within each group and among groups, relying on the Null Hypothesis that groups/samples have been drawn from the same population, hence their means are equivalent (you can read more about Hypothesis tests here). The ANOVA produces an F-statistic (explained later on): for high value of F (generally higher that 3.68, which corresponds to a significance level of 5%) we can reject with a sufficient confidence the Null, concluding that groups are not arising from the same population.

In this article, I’m going to dwell on a practical application, so that we can see step by step how to get our F-statistic. For this purpose, I’m going to use Python, first by manually computing each value, then by showing you a useful tool in the package scipy.

As case study, imagine a bunch of students from different colleges taking the same exam. You want to see if one college outperforms the other, hence your null hypothesis is that the means of GPAs in each group are equivalent to those of the other groups. To keep it simple, we will consider 3 groups (college ‘A’, ‘B’, ‘C’) with 6 students each.

``````a=[25,25,27,30,23,20]
b=[30,30,21,24,26,28]
c=[18,30,29,29,24,26]
list_of_tuples = list(zip(a, b,c))
df = pd.DataFrame(list_of_tuples, columns = ['A', 'B', 'C'])
df``````

We can look at this table as a matrix where the i-index is referring to the students of same college, while the j-index is referring to the group/college. Hence, the Yij entry will be referring to the ith student of the jth college.

Once made the following necessary assumptions:

• Response variable residuals are normally distributed (or approximately normally distributed)
• Variances of populations are equal
• Responses for a given group are independent and identically distributed normal random variables

• Computing the average mark for each college like so:
``````import numpy as np
m1=np.mean(a)
m2=np.mean(b)
m3=np.mean(c)

print('Average mark for college A: {}'.format(m1))
print('Average mark for college B: {}'.format(m2))
print('Average mark for college C: {}'.format(m3))

Output:

Average mark for college A: 25.0
Average mark for college B: 26.5
Average mark for college C: 26.0``````
• Computing the overall mean (k is the number of groups, hence in our case k=3):
``````m=(m1+m2+m3)/3
print('Overall mean: {}'.format(m))

Output:
Overall mean: 25.83``````
• Compute the ‘between-group’ sum of squared differences (where n is the number of observations per group/college, hence in our case n=6):
``````SSb=6*((m1-m)**2+(m2-m)**2+(m3-m)**2)
print('Between-groups Sum of Squared Differences: {}'.format(SSb))

Output:
Between-groups Sum of Squared Differences: 6.9``````

With those results, we can already compute one of the components of our F-score, which is the between-group mean square value (MSb). Indeed, knowing that the between-group degrees of freedom is k-1 (that means, one less than the number of groups), we can compute the MSb as:

Where df are the degrees of freedom.

``````MSb=SSb/2
print('Between-groups Mean Square value: {}'.format(MSb))

Output:

Between-groups Mean Square value: 3.49``````
• Calculate the “within-group” sum of squares.
``````err_a=list(a-m1)
err_b=list(b-m2)
err_c=list(c-m3)
err=err_a+err_b+err_c
ssw=[]
for i in err:
ssw.append(i**2)

SSw=np.sum(ssw)

print('Within-group Sum of Squared Differences: {}'.format(SSw))

Output:

Within-group Sum of Squared Differences: 223.5
``````

Again, knowing that there are k(n-1) within-group degrees of freedom (hence in our case 15), we can compute the within-group mean square value:

``````MSw=SSw/15
print('Within-group Mean Square value: {}'.format(MSw))

Output:
Within-group Mean Square value: 2.083``````
• We can finally compute the F-score, given by:
``````F=MSb/MSw
print('F-score: {}'.format(F))

Output:

F-score: 0.2348``````

Let’s double-check this value with scipy:

``````import scipy.stats as stats
stats.f_oneway(a,b,c)

Output:
F_onewayResult(statistic=0.2348993288590604, pvalue=0.793504662732833)``````

Nice, the two results coincide. If the assumptions above are true, the ration MSb/MSw behaves as a Fisher distribution with (2,15) degrees of freedom:

``````from scipy.stats import f
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
dfn, dfd = 2,15
x = np.linspace(f.ppf(0.01, dfn, dfd),f.ppf(0.99, dfn, dfd), 100)
ax.plot(x, f.pdf(x, dfn, dfd),'r-', lw=5, alpha=0.6, label='f pdf')``````

Let’s say we set alpha, which the level of significance, equal to 5%. The corresponding F-critical value is 3.68. Hence:

``````from scipy.stats import f
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
dfn, dfd = 2,15
x = np.linspace(f.ppf(0.01, dfn, dfd),f.ppf(0.99, dfn, dfd), 100)
ax.plot(x, f.pdf(x, dfn, dfd),'r-', lw=5, alpha=0.6, label='f pdf')
plt.axvline(x=3.68, label='Critical value for alpha=0.05', color='g')
plt.axvline(x=F, label='F-score')
plt.legend()``````

We do not reject the Null hypothesis about equality among means. We can conclude (with an error of 5%, or alternatively, with a confidence of 95%) that there is no significance difference between our three colleges A, B and C.