Analysis of variance (ANOVA) is a statistical test for detecting differences in group means when there is one parametric dependent variable and one or more independent variables. This article summarizes the fundamentals of ANOVA for an intended benefit of the clinician reader of scientific literature who does not possess expertise in statistics. The emphasis is on conceptually-based perspectives regarding the use and interpretation of ANOVA, with minimal coverage of the mathematical foundations. Computational examples are provided. Assumptions underlying ANOVA include parametric data measures, normally distributed data, similar group variances, and independence of subjects. However, normality and variance assumptions can often be violated with impunity if sample sizes are sufficiently large and there are equal numbers of subjects in each group. A statistically significant ANOVA is typically followed up with a multiple comparison procedure to identify which group means differ from each other. The article concludes with a discussion of effect size and the important distinction between statistical significance and clinical significance.

Calculations of three different measures of effect size for a two-factor (Treatment and Gender) ANOVA of data set shown in Figure 2. The effect sizes shown are all based on proportions of sum of squares: eta squared (η 2 ), partial η 2 , and omega squared (ω 2 ). Note the following: (i) The denominator sum of squares term will be larger for η 2 than for partial η 2 in a factorial ANOVA, so η 2 will be smaller than partial η 2. (ii) Omega squared (ω 2 ) is a population estimate, whereas η 2 and partial η 2 are sample estimates, so ω 2 will be smaller than both η 2 and partial η 2. (iii) The sum of all η 2 equals 1, whereas the sum of all partial η 2 does not equal 1 (can be less than or greater than). Refer to text for further explanation of these attributes.

Department of Rehabilitation Sciences, School of Allied Health Sciences,

Texas Tech University Health Sciences Center, Lubbock, TX

Address all correspondence and requests for reprints to: Steven F. Sawyer, PT, PhD,

Assumptions of ANOVA

Assumptions for ANOVA pertain to the

underlying mathematics of general lin-

ear models. Specically, a data set should

meet the following criteria before being

subjected to ANOVA:

Parametric data: A parametric

ANOVA, the topic of this article, re-

quires parametric data (ratio or interval

measures). ere are non-parametric,

one-factor versions of ANOVA for non-

parametric ordinal (ranked) data, spe-

cically the Kruskal-Wallis test for inde-

pendent groups and the Friedman test

for repeated measures analysis.

Normally distributed data within

each group: ANOVA can be thought of

as a way to infer whether the normal dis-

tribution curves of dierent data sets are

best thought of as being from the same

population or dierent populations

(Figure 1). It follows that a fundamental

assumption of parametric ANOVA is

that each group of data (each level) be

normally distributed. e Shapiro-Wilk

test2 is commonly used to test for nor-

mality for group sample sizes (N) less

than 50; D'Agnostino's modication3 is

useful for larger samplings (N>50).

A normal distribution curve can be

described by whether it has symmetry

about the mean and the appropriate

width and height (peakedness). ese

attributes are dened statistically by

"skewness" and "kurtosis", respectively.

A normal distribution curve will have

skewness = 0 and kurtosis = 3. (Note that

an alternative denition of kurtosis sub-

tracts 3 from the nal value so that a

normal distribution will have kurtosis =

0. is "minus 3" kurtosis value is some-

times referred to as "excess kurtosis" to

distinguish it from the value obtained

with the standard kurtosis function. e

kurtosis value calculated by many statis-

tical programs is the "minus 3" variant

but is referred to, somewhat mislead-

ingly, as "kurtosis."). Normality of a data

set can be assessed with a z-test in refer-

ence to the standard error of skewness

(estimated as √[6 / N) and the standard

error of kurtosis (estimated as √[24 /

N)4 . A conservative alpha of 0.01 ( z

FIGURE 1. Graphical representation of statistical Null and Alternative hypotheses for ANOVA in the case of one dependent

variable (change in ankle ROM pre/post manual therapy treatment, in units of degrees), and one independent variable with

three levels (three dierent types of manual therapy treatments). For this ctitious data, the group (sample) means are 13, 14 and

18 degrees of increased ankle ROM for treatment type groups 1, 2 and 3, respectively (raw data are presented in Figure 2). e

Null hypothesis is represented in the le graph, in which the population means for all three groups are assumed be identical

to each other (in spite of dierence in sample means calculated from the experimental data). Since in the Null hypothesis the

subjects in the three groups are considered to compose a single population, by denition the population means of each group are

equal to each other, and are equal to the Grand mean (mean for all data scores in the three groups). e corresponding normal

distribution curves are identical and precisely overlap along the X-axis. e Alternative hypothesis is shown in right graph, in

which dierences in group sample means are inferred to represent true dierences in group population means. ese normal

distribution curves do not overlap along the X-axis because each group of subjects are considered to be distinct populations with

respect to ankle ROM, created from the original single population that experienced dierent ecacies of the three treatments.

Graph is patterned aer Wilkinson et al11.

probability density Function

probability density Function

Null hypothesis:

Identical normal distribution curves

Alternative hypothesis:

Different normal distribution curves

increased ankle r om (degree) increased ankle r om (degree)


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

2.56) is appropriate, due to the overly

sensitive nature of these tests, especially

for large sample sizes (>100)4. As a com-

putational example, for N = 20, the esti-

mation of standard error of skewness =

√[6 / 20] = 0.55, and any skewness value

greater than ±2.56 x 0.55 = ±1.41 would

indicate non-normality. Perhaps the

best "test" is what always should be

done: examine a histogram of the distri-

bution of the data. In practice, any dis-

tribution that resembles a bell-shaped

curve will be "normal enough" to pass

normality tests, especially if the sample

size is adequate.

Homogeneity of variance within

each group: Referring again to the notion

that ANOVA compares normal distri-

bution curves of data sets, these curves

need to be similar to each other in shape

and width for the comparison to be

valid. In other words, the amount of data

dispersion (variance) needs to be similar

between groups. Two commonly in-

voked tests of homogeneity of variance

are by Levene5 and Brown & Forsthye6.

Independent Observations: A gen-

eral assumption of parametric analysis is

that the value of each observation for

each subject is independent of (i.e., not

related to or inuenced by) the value of

any other observation. For independent

groups designs, this issue is addressed

with random sampling, random assign-

ment to groups, and experimental con-

trol of extraneous variables. is as-

sumption is an inherent concern for

repeated measures designs, in which an

assumption of sphericity comes into

play. When subjects are exposed to all

levels of an independent variable (e.g.,

all treatments), it is conceivable that the

eects of a treatment can persist and af-

fect the response to subsequent treat-

ments. For example, if a treatment eect

for one level has a long half-time (analo-

gous to a drug eect) and there is inad-

equate "wash out" time between expo-

sure to dierent levels (treatments),

there will be a carryover eect. A well

designed and executed cross-over ex-

perimental design can mitigate carry-

over eects. Mauchly's test of sphericity

is commonly employed to test the as-

sumption of independence in repeated

measures designs. If the Mauchly test is

statistically signicant, corrections to

the F score calculation are warranted.

e two most commonly used correc-

tion methods are the Greenhouse-

Geisser and Huynh-Feldt, which calcu-

late a descriptive statistic called epsilon,

which is a measure of the extent to which

sphericity has been violated. e range

of values for epsilon are 1 (no sphericity

violation) to a lower boundary of 1 /

(m —1), where m = number of levels. For

example, with three groups, the range

would be 1 to 0.50. e closer epsilon is

to the lower boundary, the greater the

degree of violation. ere are three op-

tions for adjusting the ANOVA to ac-

count for the sphericity violation, all of

which involve modifying degrees of

freedom: use the lower boundary epsi-

lon, which is the most conservative ap-

proach (least powerful) and will gener-

ate the largest p value, or use either the

Greenhouse-Geisser epsilon or the

Huynh-Feldt epsilon (most powerful)

[statistical power is the ability of an in-

ferential test to detect a dierence that

actually exists, i.e., a true positive].

Most commercially available statis-

tics programs perform normality, ho-

mogeneity of variance and sphericity

tests. Determination of the parametric

nature of the data and soundness of the

experimental design is the responsibility

of the investigator, reviewers and critical

readers of the literature.

Robustness of ANOVA to Violations of

Normality and Variance Assumptions

ANOVA tests can handle moderate vio-

lations of normality and equal variance

if there is a large enough sample size and

a balanced design7. As per the central

limit theorem, the distribution of sam-

ple means approximates normality even

with population distributions that are

grossly skewed and non-normal, so long

as the sample size of each group is large

enough. ere is no xed denition of

"large enough", but a rule of thumb is

N≥308 . us, the mathematical validity

of ANOVA is said to be "robust" in the

face of violations of normality assump-

tions if there is an adequate sample size.

ANOVA is more sensitive to violations

of the homogeneity of variance assump-

tion, but this is mitigated if sample sizes

of factors and levels are equal or nearly

so9,10 . If normality and homogeneity of

variance violations are problematic,

there are three options: (i) Mathemati-

cally transform (log, arcsin, etc.) the

data to best mitigate the violation, with

the cost of cognitive fog in understand-

ing the meaning to the ANOVA results

(e.g., "A statistically signicant main ef-

fect was obtained for the arcsin transfor-

mation of degrees of ankle range of mo-

tion"). (ii) Use one of the non-parametric

ANOVAs mentioned above, but at the

cost of reduced power and being limited

to one-factor analysis. (iii) Identify out-

liers in the data set using formal statisti-

cal criteria (not discussed here). Use

caution in deleting outliers from the

data set; such decisions need to be justi-

ed and explained in research reports.

Removal of outliers will reduce devia-

tions from normality and homogeneity

of variance.

If You Understand t-Tests, You Already

Know A Lot About ANOVA

As a starting point, the reader should

understand that the familiar t-test is an

ANOVA in abbreviated form. A t-test is

used to infer on statistical grounds

whether there are dierences between

group means for an experimental design

with (i) one parametric dependent vari-

able and (ii) one independent variable

with two levels, i.e., there is one outcome

measure and two groups. In clinical re-

search, levels oen correspond to dier-

ent treatment groups; the term "level"

does not imply any ordering of the


e Null statistical hypothesis for a

t-test is H0: 1 = 2, that is, the population

means of the two groups are the same.

Note that we are dealing with popula-

tion means, which are almost always

unknown and unknowable in clinical

research. If the Null hypothesis involved

sample means, there would be nothing

to infer, since descriptive analysis pro-

vides this information. However, with

inferential analysis using t-tests and

ANOVA, the aim is to infer, without ac-

cess to "the truth", if the group popula-

tion means dier from each other.

e Alternative hypothesis, which

comes into play if the Null hypothesis is

rejected, asserts that the group popula-


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

tion means dier. e Null hypothesis is

rejected when the p value yielded by the

t-test is less than alpha. Alpha is the pre-

determined upper limit risk for commit-

ting a Type 1 error, which is the statisti-

cal false positive of incorrectly rejecting

the Null hypothesis and inferring the

groups means dier when in fact the

groups are from a single population. By

convention, alpha is typically set to 0.05.

e p value generated by the t-test statis-

tic is based on numerical analysis of the

experimental data, and represents the

probability of committing a Type 1 error

if the Null hypothesis is rejected. When

p is less than alpha, there is a statistically

signicant result, i.e., the values in the

two groups are inferred to dier from

each other and to represent separate

populations. e logic of statistical in-

ference is analogous to a jury trial: at the

outset of the trial (inferential analysis),

the group data are presumed to be in-

nocent of having dierent population

means (Null hypothesis) unless the dif-

ferences in group means in the sampled

data are suciently compelling to meet

the standard of "beyond a reasonable

doubt" (p less than alpha), in which case

a guilty verdict is rendered (reject Null

hypothesis and accept Alternative hy-

pothesis = statistical signicance).

e test statistic for a t-test is the t

score. In conceptual terms, the calcula-

tion of a t score for independent groups

(i.e., not repeated measures) is as fol-


t = statistical signal / statistical noise

t = treatment eect / unexplained vari-

ance ("error variance")

t = dierences between sample means

of the two groups / within-group


e dierence in group means repre-

sents the statistical signal since it is pre-

sumed to result from treatment eects of

the dierent levels of the independent

variable. e within-group variance is

considered to be statistical noise and an

"error" term because it is not explained

by the inuence of the independent vari-

able on the dependent variable. e par-

ticulars of how the t score is calculated

depends on the experimental design (in-

dependent groups vs repeated measures)

and whether variance between groups is

equivalent; the reader is to referred to

any number of statistics books for details

about the formulae. e t score is con-

verted into a p value based on the magni-

tude of the t score (larger t scores lead to

smaller p values) and the sample size

(which relates to degrees of freedom).

ANOVA Null Hypothesis

and Alternative Hypothesis

ANOVA is applicable when the aim is to

infer dierences in group values when

there is one dependent variable and

more than two groups, such as one inde-

pendent variable with three or more lev-

els, or when there are two or more inde-

pendent variables. Since an independent

variable is called a "factor", ANOVAs are

described in terms of the number of fac-

tors; if there are two independent vari-

ables, it is a two-factor ANOVA. In the

simpler case of a one-factor ANOVA,

the Null hypothesis asserts that the pop-

ulation means for each level (group) of

the independent variable are equal. Let's

use as an example a ctitious experi-

ment with one dependent variable (pre/

post changes in ankle range of motion in

subjects who received one of three types

of manual therapy treatment aer surgi-

cal repair of a talus fracture). is con-

stitutes a one-factor ANOVA with three

levels (the three dierent types of treat-

ment). e Null hypothesis is: H0: 1 = 2

= 3. e Alternative hypothesis is that at

least two of group means dier. Figure 1

provides a graphical presentation of this

ANOVA statistical hypotheses: (i) the

Null hypothesis (le graph) asserts that

the normal distribution curves of data

for the three groups are identical in

shape and position and therefore pre-

cisely overlap, whereas (ii) the Alterna-

tive hypothesis (right graph) asserts that

these normal distribution curves are

best described by the distribution indi-

cated by the sample means, which repre-

sent an experimentally-derived estimate

of the population means11.

The Mechanics of Calculating a

One-factor ANOVA

ANOVA evaluates dierences in group

means in a round-about fashion, and

involves the "partitioning of variance"

from calculations of "Sum of Squares"

and "Mean Squares." ree metrics are

used in calculating the ANOVA test sta-

tistic, which is called the F score (named

aer R.A. Fisher, the developer of

ANOVA): (i) Grand Mean, which is the

mean of all scores in all groups; (ii) Sum

of Squares, which are of two kinds, the

sum of all squared dierences between

group means and the Grand Mean (be-

tween-groups Sum of Squares) and the

sum of squared dierences between in-

dividual data scores and their respective

group mean (within-groups Sum of

Squares), and (iii) Mean Squares, also of

two kinds (between-groups Mean

Squares, within-groups Mean Squares),

which are the average deviations of indi-

vidual scores from their respective

mean, calculated by dividing Sum of

Squares by their appropriate degrees of


A key point to appreciate about

ANOVA is that the data set variance is

partitioned into statistical signal and

statistical noise components to generate

the F score. e F score for independent

groups is calculated as:

F = statistical signal / statistical noise

F = treatment eect / unexplained vari-

ance ("error variance")

F = Mean SquaresBetween Groups / Mean

SquaresWithin Groups (Error)

Note that the statistical signal, the MSBe-

tween Groups term, is an indirect measure of

dierences in group means. e MSWithin

Groups (Error) term is considered to represent

statistical noise/error since this variance

is not explained by the eect of the inde-

pendent variable on the dependent vari-

able. Here is the gist of the issue: as

group means increasingly diverge from

each other, there is increasingly more

variance for between-group scores in re-

lation to the Grand Mean, quantied as

Sum of SquaresBetween Groups, leading to a

larger MSBetween Groups term and a larger F

score. Conversely, as there is more vari-

ance within-group scores, quantied as

Sum of SquaresWithin Groups (Error), the

MSWithin Groups (Error) term will increase,

leading to a smaller F score. us, for

independent groups, large F scores arise

from large dierences between group


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

means and/or small variances within

groups. Larger F scores equate to lower

p values, with the p value also inuenced

by the sample size and number of groups,

each of which constitutes separate types

of "degrees of freedom."

ANOVA calculations are now the

domain of computer soware, but there

is illustrative and heuristic value in man-

ually performing the arithmetic calcula-

tion of the F score to garner insight into

how analysis of data set variance gener-

ates a statistical inference about dier-

ences in group means. A numerical ex-

ample is provided in Figure 2, in which

the data set graphed in Figure 1 is listed

and subjected to ANOVA, yielding a cal-

culated F score and corresponding p


Mathematical Equivalence of t-tests

and ANOVA: t-tests are a Special Case


Let's briey return to the notion that a

t-test is a simplied version of ANOVA

that is specic to the case of one inde-

pendent variable with two groups. If we

analyze the data in Figure 2 for the Type

1 treatment vs. Type 3 treatment group

data (disregarding the Type 2 treatment

group data to reduce the analysis to two

groups), the t score for independent

groups is 5.0 with a p value of 0.0025

(calculations not shown). For the same

data assessed with ANOVA, the F score

is 25.0 with a p value of 0.0025. e t-test

and ANOVA generate identical p values.

e mathematical relation between the

two test statistics is: t 2 = F.

Repeated Measures ANOVA: Dierent

Error Term, Greater Statistical Power

e experimental designs emphasized

thus far entail independent groups, in

which each subject is "exposed" to only

one level of an independent variable. In

FIG URE 2. e mechanics of calculating a F score for a one-factor ANOVA with independent groups by partitioning the data

set variance as Sum of Squares and Mean Squares are shown below. is ctitious data set lists increased ankle range of motion

pre/post for three dierent types of manual therapy treatments. For the sake of clarity and ease of calculation, a data set with an

inappropriately small sample size is used.

Subject Manual Therapy Manual Therapy Manual Therapy

Gender treatment Type 1 treatment Type 2 treatment Type 3

Male 14 16 20

Male 14 14 18

Female 11 13 17

Female 13 13 17

Group Means 13 14 18

Grand Mean 15

In the following, SS = Sum of Squares; MS = Mean Squares; df = degrees of freedom.

SSTotal = SS Between Groups + SSWithin Groups (Error), and is calculated by summing the squares of dierences between each data value vs.

the Grand Mean. For this data set with a Grand Mean of 15:

SSTotal = (14-15 )2 + (14-15 )2 + (11-15 )2 + (13-15 )2 + (16-15 )2 + (14-15 )2 + (13-15 )2 + (13-15 )2 + (20-15 )2 + (18-15 )2 + (17-15 )2

+ (17-15 )2 = 74

SSWithin Groups (Error) = SSMT treatment Type 1 (Error) + SSMT treatment Type 2 (Error) + SSMT treatment Type 3 (Error), in which the sum of squares within each

group is calculated in reference to the group's mean:

SSMT treatment Type 1 (Error) = (14-13)2 + (14-13 )2 + (11-13 )2 + (13-13 )2 = 6

SSMT treatment Type 2 (Error) = (16-14)2 + (14-14 )2 + (13-14 )2 + (13-14 )2 = 6

SSMT treatment Type 3 (Error) = (20-18)2 + (18-18 )2 + (17-18 )2 + (17-18 )2 = 6

SSWithin Groups (Error) = 6 + 6 + 6 = 18. By subtraction, SSBetween Groups = 74 - 18 = 56

df refers to the number of independent measurements used in calculating a Sum of Squares.

dfBetween Groups = (# of groups—1) = (3—1) = 2

dfWithin Groups (Error) = (N —# of groups) = (12—1) = 9

ANOVA test statistic, the F score, is calculated from Mean Squares (SS/df ):

F = Mean SquaresBetween Groups / Mean SquaresWithin Groups (Error)

Mean SquaresBetween Groups = SSBetween Groups / df Between Groups = 56 / 2 = 28

Mean SquaresWithin Groups (Error) = SSWithin Groups (Error) / dfWithin Groups (Error) = 18 / 9 = 2

So, F = 28 / 2 = 14

With df Between Groups = 2 and df Within Groups (Error) = 9, this F score translates into p = 0.0017 , a statistically signicant result for

alpha = 0.05.


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

the data set of Figure 2, this would in-

volve each subject receiving only one of

the three dierent treatments. If a sub-

ject is exposed to all levels of an inde-

pendent variable, the mechanics of the

ANOVA are altered to take into account

that each subject serves as their own ex-

perimental control. Whereas the term

for statistical signal, MSBetween Groups, is un-

changed, there is a new statistical noise

term called MSWithin Subjects (Error) that per-

tains to variance within each subject

across all levels of the independent vari-

able instead of between all subjects

within one level. Since there is typically

less variation within subjects than be-

tween subjects, the statistical error term

is typically smaller in repeated measures

designs. A smaller MSWithin Subjects (Error)

value leads to a larger F value and a

smaller p value. As a result, repeated

measures ANOVA typically have greater

statistical power than independent

groups ANOVA.

Factorial ANOVA: Main Eects and


An advantage of ANOVA is its ability to

analyze an experimental design with

multiple independent variables. When

an ANOVA has two or more indepen-

dent variables it is referred to as a facto-

r ial ANOVA, in contrast to the one-

factor ANOVAs discussed thus far. is

is ecient experimentally, because the

eects of multiple independent variables

on a dependent variable are tested on

one cohort of subjects. Furthermore,

factorial ANOVA permits, and requires,

an evaluation of whether there is an in-

terplay between dierent levels of the

independent variables, which is called

an interaction.

Denitions of terminology that is

unique to factorial ANOVA are war-

ranted: (i) Main eect is the eect of an

independent variable (a factor) on a de-

pendent variable, determined separate

from of the eects of other independent

variables. A main eect is a one-factor

ANOVA that is performed on a factor

that disregards the eects of other fac-

tors. In a two factor ANOVA, there are

two main eects, one for each indepen-

dent variable; a three-factor ANOVA

has three main eects, and so on. (ii) In-

teraction describes an interplay between

independent variables such that dier-

ent levels of the independent variables

have non-additive eects on the depen-

dent variable. In formal terms, there is

an interaction between two factors

when the dependent variable response

at levels of one factor dier from those

produced at levels of the other factor(s).

Interactions can be easily identied in

graphs of group means. For example,

again referring to the data set from Fig-

ure 2, let us now consider the eect of

subject gender as a second independent

variable. is would be a two factor

ANOVA: one factor is the sex of sub-

jects, called G, with two levels;

the second factor is the type of manual

therapy treatment, called T,

with three levels. A shorthand descrip-

tion of this design is 2x3 ANOVA (two

factors with two and three levels, re-

spectively). For this two-factor ANOVA,

there are three Null hypotheses: (i)

Main Eect for the G factor: Are

there dierences in the response (ankle

range of motion) for males vs. females

to manual therapy treatment (combin-

ing data for the three levels of the

T factor with respect to the

two G factor levels)? (ii) Main Ef-

fect for the T factor: Are

there dierences in the response for

subjects in the three levels of the T-

 factor (combining data for males

and females in the G factor with

respect to the three T factor

levels)? (iii) Interaction: Are there dif-

ferences due to neither the G or

T factors alone but to the

combination of these factors? With re-

spect to analysis of interactions, Figure

3 shows a table of group means for all

levels the two independent variables,

based on data from Figure 2. Note that

the two independent variables are

graphed in relation to the dependent

variable. e two lines in the le graph

are parallel, indicating the absence of an

interaction between the levels of the two

factors. An interaction would exist if the

graphs were not parallel, such as in the

right graph in which group means for

males and females on the Type 2 treat-

ment were switched for illustrative pur-

poses. If the lines deviate from parallel

to a sucient degree, the interaction

will be statistically signicant. In this

case with two factors, there is only one

interaction to be evaluated. With three

or more independent variables, there

are multiple interactions that need to be

considered. A statistically signicant in-

teraction complicates the interpretation

of the Main Eects, since the factors are

not independent of each other in their

eects on the dependent variable. Inter-

actions should be examined before

Main Eects. If interactions are not sta-

tistically sig nicant, then Main Eects

can be easily evaluated as a series of

one-factor ANOVAs.

So There is a Statistically Signicant

ANOVA—Now What? Multiple

Comparison Procedures

If an ANOVA does not yield statistical

signicance on any main eects or inter-

actions, the Null hypothesis (hypothe-

ses) is (are) accepted, meaning that the

dierent levels of independent variables

did not have any dierential eects on

the dependent variable. e inferential

statistical work is done (but see next sec-

tion), unless confounding covariates are

suspected, possibly warranting analysis

of covariance (ANCOVA), which is be-

yond the scope of this article.

When statistical signicance is ob-

tained in an ANOVA, additional statisti-

cal tests are necessary to determine

which of the group means dier from

each other. ese follow-up tests are re-

ferred to as multiple comparison proce-

dures (MCPs) or post hoc tests. MCPs

involve multiple pair-wise comparisons

(or contrasts) in a fashion designed to

maintain alpha for the family of com-

parisons to a specied level, typically

0.05. is is referred to as the familywise

alpha. ere are two general options for

MCP tests: either perform multiple t-

tests that require "manual" adjustment

of the alpha for each pairwise test to

maintain a familywise alpha of 0.05, or

use a test such as the Tukey HSD (see

below) that has built-in protection from

alpha ination. Multiple t-tests have

their place, especially when only a sub-

set of all possible pairwise comparisons

are to be performed, but the special pur-

pose MCPs are preferable when all pair-

wise comparisons are assessed.


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

Using the simple case of a statisti-

cally signicant one-factor ANOVA,

t-tests can be used for post hoc evalua-

tion with the aim of identifying which

levels dier from each. However, with

multiple t-tests there is a need to adjust

alpha for each t-test in such a way as to

maintain the familywise alpha at 0.05. If

all possible pairwise comparisons are

performed, there will be a geometric in-

crease in the number of t-tests as the

number of levels increases, as dened by

C = m (m - 1) / 2, where C = number of

pairwise comparisons, and m = number

of levels in a factor. For example, there

are three pairwise comparisons for three

levels; six comparisons for four levels;

ten comparisons for ve levels, and so

forth. ere is a need to maintain fami-

lywise alpha to 0.05 in these multiple

comparisons to maintain the risk of

Type 1 errors to no more than 5%. is

is commonly accomplished with the

Bonferroni (or Dunn) adjustment, in

which alpha for each post hoc t-test is

adjusted by dividing the familywise al-

pha (0.05) by the number of pairwise


αMultiple t-tests = αFamilywise / C

FIG URE 3. Factorial ANOVA interactions, which are assessed with a table and a graph of group means. Group means are based

on data presented in Figure 2, and represents a 3x2 two-factor (T x G) ANOVA with independent groups. In

reference to j columns and k rows indicated in the table below, the Null hypothesis for this interaction is:

j1,k1 – j1k2 = j2,k1– j2k2 = j3,k1– j3k2

e graph below le shows the group means of the two independent variables in relation to the dependent variable. e parallel

lines indicate that males and females displayed similar changes in ankle ROM for the three types of treatment, so there was no

interaction between the dierent levels of the independent variables. Consider the situation in which the group means for males

and females on treatment type 2 are reversed. ese altered group means are shown in the graph below right. e graphed lines

are not parallel, indicating the presence of an interaction. In other words, the relative ecacies of the three treatments are dierent

for males and females; whether this meets the statistical level of an interaction is determined by ANOVA (p less than alpha).



Treatment Treatment Treatment

FACTOR B: Type 1 Type 2 Type 3 Factor B Main

G (Level j = 1) (Level j = 2) (Level j = 3) Effect (row means)


(Level k = 1) 14 15 19 16


(Level k = 2) 12 13 17 14

Factor A Main

Effect (column

means) 13 14 18

Type 1 Type 2 Type 3 Type 1 Type 2 Type 3

Manual erapy Treatment Manual erapy Treatment

Increase in ankle ROM

Increase in ankle ROM


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

If there are two pairwise compari-

sons, alpha for each t-test is set to 0.05/2

= 0.25; for three comparisons, alpha is

0.05/3 = 0.0167, and so on. Any pairwise

t-test with a p value less than the ad-

justed alpha would be considered statis-

tically signicant. e trade-o for pre-

venting familywise alpha ination is that

as the number of comparisons increases,

it becomes incrementally more dicult

to attain statistical signicance due to

the lower alpha. Furthermore, the ina-

tion of familywise alpha with multiple

t-tests is not additive. As a result, Bon-

ferroni adjustments overcompensate the

alpha adjustment, making this the most

conservative (least powerful) of all

MCPs. For example, running two t-tests,

each with alpha set to 0.05, does not

double familywise alpha to 0.10; it in-

creases it to only 0.0975. e eects of

multiple t-tests on familywise alpha and

Type 1 error rate is dened by the fol-

lowing formula:

αFamilywise = 1—(1—αMultiple t-tests )C

e overcorrection by the Bonferroni

technique becomes more prominent

with many pairwise comparisons: exe-

cuting 20 t-tests, each with an alpha of

0.05, does not yield a familywise alpha of

20 x 0.05 = 1.00 (i.e., 100% chance of

Type 1 error); the value is actually 0.64.

ere are modications of the Bonfer-

roni adjustment developed by Šidák12 to

more accurately reect the ination of

familywise alpha that result in larger ad-

justed alpha levels and therefore in-

creased statistical power, but the eects

are slight and rarely convert a margin-

ally non-signicant pairwise compari-

son into a statistical signicance. For

example, with three pairwise compari-

sons, the Bonferroni adjusted alpha of

0.167 is increased by only 0.003 to 0.170

with the Šidák adjustment.

e sequential alpha adjustment

methods for multiple post hoc t-tests by

Holm13 and Hochberg14 provide in-

creased power while still maintaining

control of the familywise alpha. ese

techniques permit the assignment of sta-

tistical signicance in certain situations

for which p values are less than 0.05 but

do not meet the Bonferroni criterion for

signicance. e sequential approach by

Holm13 and Hochberg14 are called step-

down and step-up procedures, respec-

tively. In Hochberg's step-up procedure

with C pairwise comparisons, the t-test p

values are evaluated sequentially in de-

scending order, with p 1 the lowest value

and p C the highest. If pC is less than 0.05,

all the p values are statistically signi-

cant. If pC is greater than 0.05, that evalu-

ation is non-signicant, and the next

largest p value, p C - 1, is evaluated with a

Bonferroni adjusted alpha of 0.05/2 =

0.025. If p C - 1 is signicant, then all re-

maining p values are signicant. Each

sequential evaluation leads to an alpha

adjustment based on the number of pre-

vious evaluations, not on the entire set of

possible evaluations, thereby yielding

increased statistical power compared to

the Bonferroni method. For example, if

three p values are 0.07, 0.02 and 0.015,

Hochberg's method evaluates p 3 = 0.07

vs alpha = 0.05/1 = 0.05 (non-signi-

cant); then p 2 = 0.02 vs. alpha = 0.05/2 =

0.025 (signicant); and then p 1 = 0.015

vs alpha = 0.05/3 = 0.0167 (signicant).

Holm's method performs the inverse se-

quence and alpha adjustments, such that

the lowest p value is evaluated rst with a

fully adjusted alpha. In this case: p 1 =

0.015 vs alpha = 0.05/3 = 0.0167 (signi-

cant); then p2 = 0.020 vs. alpha = 0.05/2 =

0.025 (signicant); and then p 3 = 0.070

vs alpha = 0.05/1 = 0.05 (non-signi-

cant). Once Holm's method encounters

non-signicance, sequential evaluations

end, whereas Hochberg' method contin-

ues testing. For these three p values, the

Bonferroni adjustment would nd p =

0.015 signicant but p = 0.02 to be non-

signicant. As can be seen, the methods

of Hochberg and Holm are less conser-

vative and more powerful than Bonfer-

roni's adjustment. Further, Hochberg's

method is uniformly more powerful

than Holm's method15. For example, if

there are three pairwise comparisons

with p = 0.045, 0.04 and 0.03, all would

be signicant with Hochberg's method

but none would be with Holm's method

(or Bonferroni's).

ere are many types of MCPs dis-

tinct from the t-test approaches de-

scribed above16. ese tests have "built-

in" familywise alpha protection that do

not require "manual" adjustment of al-

pha. Most of these MCPs calculate a so-

called q value for each comparison that

takes into account group mean dier-

ences, group variances, and group sam-

ple sizes in a fashion similar but not

identical to the calculation of t. is q

value is compared to a critical value gen-

erated from a q distribution (a distribu-

tion of dierences in sample means).

Protection from familywise alpha ina-

tion is in the form of a multiplier applied

to the critical value. e multiplier in-

creases as the number of comparisons

increases, thereby requiring greater dif-

ferences between group means to attain

statistical signicance as the number of

comparisons is increased. Some MCPs

are better than others at balancing statis-

tical power and Type 1 errors. By general

consensus amongst statisticians, the

Fisher Least Signicant Dierence

(LSD) test and the Duncan's Multiple

Range Test are considered to be overly

powerful, with too high a likelihood of

Type 1 errors (false positives). e

Scheè test is considered to be overly

conservative, with too high a likelihood

of Type 2 errors (false negatives), but is

applicable when group sample sizes are

markedly unequal17. e Tukey Hon-

estly Signicant Dierence (HSD) is fa-

vored by many statisticians for its bal-

ance of statistical power and protection

from Type 1 errors. It is worth noting

that the power advantage of the Tukey

HSD test obtains only when all possible

pairwise comparisons are performed.

e Student-Newman-Keuls (SNK) test

statistic is computed identically to the

Tukey HSD, however the critical value is

determined dierently using a step-wise

approach, somewhat like the Holm

method described above for t-tests. is

makes the SNK test slightly more power-

ful than the Tukey HSD test. However,

an advantage of the Tukey HSD test is

that a variant called Tukey-Kramer HSD

test can be used with unbalanced sample

size designs, unlike the SNK test. e

Dunnett test is useful when planned

pairwise tests are restricted to one group

(e.g., a control group) being compared to

all other groups (e.g., treatment groups).

In summary, (i) the Tukey HSD and

Student-Newman-Keuls tests are rec-

ommended when performing all pair-

wise tests; (2) the Hochberg or Holm se-

quential alpha adjustments enhance the

power of multiple post hoc t-tests while


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

maintaining control of familywise alpha;

and (3) the Dunnett test is preferred

when comparing one group to all other


Break from Tradition: Skip One-Factor

ANOVA and Proceed Directly to a MCP

us far, the conventional approach to

ANOVA and MCPs has been presented,

namely, run an ANOVA and if it is not

signicant, proceed no further; if the

ANOVA is signicant, then run MCPs to

determine which group means dier.

However, it has long been held by some

statisticians that in certain circum-

stances ANOVA can be skipped and that

an appropriate MCP is the only neces-

sary inferential test. To quote an inuen-

tial paper by Wilkinson et al18, the

ANOVA-followed-by-MCP approach

"is usually wrong for several reasons.

First, pairwise methods such as Tukey's

honestly signicant dierence proce-

dure were designed to control a family-

wise error rate based on the sample size

and number of comparisons. Preceding

them with an omnibus F test in a stage-

wise testing procedure defeats this de-

sign, making it unnecessarily conserva-

tive." Related to this perspective is the

fact that inferential discrepancies are

possible between ANOVA and MCPs, in

which one is statistically signicant and

the other is not. is can occur when p

values are near the boundary of alpha.

Each MCP has slightly dierent criteria

for statistical signicance (based on ei-

ther the t or q distribution), and all dier

slightly from the criteria of F scores

(based on the F distribution). An argu-

ment has also been put forth with respect

to performing pre-planned MCPs with-

out the need for a statistically signicant

ANOVA in clinical trials19. Nonetheless,

the convention remains to perform

ANOVA and then MCPs, but MCPs

alone are a statistically valid option.

ANOVA is especially warranted when

there are multiple factors, due to the abil-

ity of ANOVA to detect interactions.

Wilkinson et al.18 also reminds re-

searchers that it is rarely necessary to

perform all pairwise comparisons. Se-

lected pre-planned comparisons that are

driven by the research hypothesis, and

not a subcortical reex to perform every

conceivable pairwise comparison, will

reduce the number of extraneous pair-

wise comparisons and false positives,

and have the added benet of increasing

statistical power.

ANOVA Eect Size

Eect size is a unitless measure of the

magnitude of treatment eects20. For

ANOVA, there are two categories of ef-

fect size indices: (i) those based on pro-

portions of sum of squares (η2, partial η2,

ω2 ), and (ii) those based on a standard-

ized dierence between group means

(such as Cohen's d ) 21,22. e latter type of

eect size index is useful for power anal-

ysis, and will be discussed briey in the

next section. To an ever increasing de-

gree, peer review journals are requiring

the presentation of eect sizes with de-

scriptive summaries of data.

ere are three commonly used ef-

fect size indices that are based on pro-

portions of the familiar sum of squares

values that form the foundation of

ANOVA computations. e three indi-

ces are called eta squared (η2), partial eta

squared (partial η2), and omega squared

2 ). ese indices range in value from 0

(no eect) to 1 (maximal eect) because

they are proportions of variance. ese

indices typically yield dierent values

for eect size.

Eta squared (η2) is calculated as:

η2 = SSBetween Groups / SSTotal

e SSBetween Groups term pertains to the in-

dependent variable of interest, whereas

SSTotal is based on the entire data set. Spe-

cically, for a factorial ANOVA, SSTotal =

[SSBetween Groups for all factors + SSError + all

SSInteractions ]. As such, the magnitude of η2

for a given factor will be inuenced by

the number of other independent vari-

ables. For example, η2 will tend to be

larger in a one-factor design than in a

two-factor design because in the latter

the SSTotal term will be inated to include

sum of squares arising from the second


Partial eta squared (partial η2) is

calculated with respect to the sum of

squares associated with the factor of in-

terest, not the total sum of squares:

partial η2 = SSBetween Groups /

(SSBetween Groups + SSError)

As with the η2 calculation, the SSBetween

Groups numerator term for partial η 2 per-

tains to the independent variable of in-

terest. However, the denominator diers

from that of η2. e denominator for

partial η2 is not based on the entire data

set (SSTotal) but instead on only SSBetween

Groups and SSError for the factor being eval-

uated. For a one-factor ANOVA, the

sum of square terms are identical for η2

and partial η2, so the values are identical;

however, with factorial ANOVA the de-

nominator for partial η2 will always be

smaller. For this reason, partial η2 is al-

ways larger than η2 with factorial

ANOVA (unless a factor or interaction

has absolutely no eect, as in the case of

the interaction in Figure 4, for which

both η2 and partial η2 equal 0).

Omega squared (ω2) is based on an

estimation of the proportion of variance

in the underlying population, in contrast

to the η2 and partial η2 indices that are

based on proportions of variance in the

sample. For this reason, ω2 will always be

a smaller value than η2 and partial η2.

Application of ω2 is limited to between-

subjects designs (i.e., not repeated mea-

sures) with equal samples sizes in all

groups. Omega squared is calculated as


ω2 = [SSBetween Groups —(dfBetween Groups ) *

(MSError )] / (SSTotal + MSError)

In contrast to η2, which provides an up-

wardly biased estimate of eect size

when the sample size is small, ω2 calcu-

lates an unbiased estimate23.

e reader is cautioned that η2and

partial η2 are oen misreported in the

literature (e.g., η2 incorrectly reported as

partial η2 )24,25 . It is advisable to calculate

these values by hand using the formulae

shown above as a conrmation of the

output of statistical soware programs,

to ensure accurate reporting. Refer to

Figure 4 for sample calculations of these

three eect size indices for a two-factor


e η2 and partial η2 indices have

distinctly dierent attributes. Whether a

given attribute is considered to be an ad-

vantage or disadvantage is a matter of

perspective and context. Some authors24

argue the merits of eta squared, whereas

others4 prefer partial eta squared. Nota-


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

ble issues pertaining to these indices


(i) Proportion of variance: When

there is a statistically signicant main ef-

fect or interaction, both η2 and partial η2

(and ω2) can be interpreted in terms of

the percentage of variance accounted for

by the corresponding independent vari-

able, even though they will oen yield

dierent values for factorial ANOVAs.

So if η2 = 0.20 and partial η2 = 0.25 for a

given factor, these two eect size indices

indicate that the factor accounts for 20%

vs. 25%, respectively, of the total vari-

ability in the dependent variable scores.

(ii) Relative values: Since η2 is either

equal to (one-factor ANOVA) or less

than (factorial ANOVA) partial η2, the

η2 index is the more conservative mea-

sure of eect size. is can be viewed as

a positive or negative attribute.

(iii) Additivity: η2 is additive, but

partial η2 is not. Since η2 for each factor

is calculated in terms of the total sum of

squares, all the η2 for an ANOVA are ad-

ditive and sum to 1 (i.e., they sum to

equal the amount of variance in the de-

pendent variable that arises from the ef-

fects of all the independent variables). In

contrast, a factor's partial η2 is calculated

in terms of that factor's sum of squares

(not the total sum of squares), so on

mathematical grounds the individual

partial η2 from an ANOVA are not addi-

tive and do not necessarily sum to 1.

(iv) Eects of multiple factors: As the

number of factors increases, the propor-

tion of variance accounted for by each

factor will necessarily decrease. Accord-

ingly, η2 decreases in an associated way.

In contrast, partial η2 for each factor is

calculated within the sum of squares

variance metrics of that particular fac-

tor, and is not inuenced by the number

of other factors.

How Many Subjects?

e aim of any experimental design is to

have adequate statistical power to detect

dierences between groups that truly

exist. ere is no simple answer to the

question of how many subjects are

needed for statistical validity using

ANOVA. Typical standards are to design

a study with an alpha of 0.05 to have with

statistical power of at least 0.80 (i.e., 80%

FIGURE 4. Calculations of three dierent measures of eect size for a two-factor (T and G) ANOVA of data set

shown in Figure 2. e eect sizes shown are all based on proportions of sum of squares: eta squared (η2), partial η2, and omega

squared (ω2). Note the following: (i) e denominator sum of squares term will be larger for η2 than for partial η2 in a factorial

ANOVA, so η2 will be smaller than partial η2. (ii) Omega squared 2) is a population estimate, whereas η2 and partial η2 are sample

estimates, so ω2 will be smaller than both η2 and partial η2. (iii) e sum of all η2 equals 1, whereas the sum of all partial η2 does

not equal 1 (can be less than or greater than). Refer to text for further explanation of these attributes.

Sum Degrees Mean

Effect of Squares of freedom Squares η 2 partial η 2 ω 2

T 56 2 28 0.76 0.90 0.72

G 12 1 12 0.16 0.67 0.15

T x Gender 0 2 0 0.00 0.00 0.00

Error 6 6 1 0.08 ---- ----

Total 74 11 1.00 1.57

Sample calculations:

η2 = SSBetween Groups / SSTotal

η2 for T = 56 / 74 = 0.76 = accounts for 76% of total variability in DV scores.

η2 for G = 12 / 74 = 0.16 = accounts for 16% of total variability in DV scores.

η2 for T*G interaction = 0 / 4 = 0.00 = accounts for 0% of total variability in DV scores.

η2 for Error = 6 / 74 = 0.08 = accounts for 8% of total variability in DV scores.

Sum of all η2 = 100%

partial η2 = SSBetween Groups / (SSBetween Groups + SSError)

partial η2 for T = 56 / (56 + 6) = 0.90 = accounts for 90% of total variability in DV scores.

partial η2 for G = 12 / (12 + 6) = 0.67 = accounts for 67% of total variability in DV scores.

partial η2 for T*G interaction = 0 / (0 + 6) = 0.00 = accounts for 0% of total variability in DV scores.

Sum of all partial η2 ≠100%

ω2 = [SSBetween Groups —(df Between Groups) * (MSError)] / (SSTotal + MSError)

ω2 for T = [56—(2)(1)] / [74 + 1] = 54 / 75 = 0.72

ω2 for G = [12—(1)(1)] / [74 + 1] = 11 / 75 = 0.15

ω2 for T*G interaction = [0—(2)(1)] / [74 + 1] = 0.00


An Al ys is o F VAr iAn Ce : Th e Fu nd Amen TAl Co n Cep Ts

chance of detecting dierences between

group means that truly exists; alterna-

tively, a 20% chance of committing a

Type 2 error). Statistical power will be a

function of eect size, sample size, and

the number of independent variables

and levels, among other things. Ade-

quate sample size is a critical design con-

sideration, and prospective (a priori)

power analysis is performed to estimate

the required sample size that will yield

the desired level of power in the inferen-

tial analysis aer data are collected. is

entails a prediction of group mean dif-

ferences and group standard deviations

in the yet-to-be collected. Specically,

the eect size index used for prospective

power analysis is based on a standard-

ized measure such as Cohen's d, which is

based on predicted dierences in group

means (statistical signal) divided by

standard deviation (statistical noise).

Being based on dierences instead of

proportions, the d eect size index is

scaled dierently than the η2, partial η2

and ω2 described above, and can exceed

a value of 1.

e prediction of an experiment's

eect size that is part of a prospective

power analysis is nothing more than an

estimate. is estimate can be based on

pilot study data, previously published

ndings, intuition or best guesses. A

guiding principle should be to select an

eect size that is deemed to be clinically


e approach used in a prospective

power analysis is outlined below for the

simple case of a t-test with independent

groups and equal variance, in which the

eect size index is dene as:

d = dierence in group means / standard

deviation of both groups

e estimate of the appropriate number

of subjects in each group for the speci-

ed alpha and power is given by the fol-

lowing equation26:

NEstimated = 2 x [ (zα + zβ ) / d ] 2

in which:

zα is the z value for the specied alpha.

With an alpha = 0.05, zα = 1.96 (2


zβ is the z value for the specied beta

(risk of Type 2 error). Power = 1- β.

For β = 0.20 (power = 0.80), z β =

0.84 (1 tail).

As a computational example, if the

eect size d is predicted to be 1.0 (which

equates to a dierence between group

means of one standard deviation), then

for alpha = 0.05 and power = 0.80 the

appropriate sample size for both groups

would be:

NEstimated = 2 x [(1.96 + 0.84) / 1]2

= 2 x [2.80 / 1]2 = 2 x 2.82 = 16

For a smaller eect size, a larger sample

size is needed, e.g., N = 63 for an eect

size of 0.5. e reader is cautioned that

these sample sizes are estimates based

on guesses about the predicted eect

size; they do not guarantee statistical


Prospective power analysis for

ANOVA is more complex than outlined

above for a simple t-test. ANOVAs can

have numerous levels within a factor,

multiple factors, and interactions, all of

which need to be accounted for in a

comprehensive power analysis. ese

complications raise the following cau-

tionary note: ANOVA power analysis

quickly devolves into a series of progres-

sively more wild guesses (instead of "es-

timates") of eect sizes as the number of

independent variables and possible in-

teractions increase26. It is oen advisable

focus a prospective power analysis for

ANOVA on one factor that is of primary

interest, so as simplify the power analy-

sis and reduce the amount of unjusti-

able guesses. e reader is referred to

statistical textbooks (such as references

22, 26, 27) for dierent approaches that

can be used for prospective power anal-

ysis for ANOVA designs. As a general

guideline, it is desirable for group sam-

ple sizes to be large enough to invoke the

central limit theorem in the statistical

analysis (>30 or so) and for there to be a

balanced design (equal sample sizes in

each group).

Finally, a retrospective (post hoc)

power analysis is warranted aer data

are collected. e aim is to determine

the statistical power of the study, based

on the eect size (not estimated, but cal-

culated directly from the data) and sam-

ple size. is is particularly relevant for

statistically non-signicant ndings,

since the non-signicance may have

been the result of inadequate statistical

power. e textbooks cited above, as

well as many others, also discuss the me-

chanics of how to perform retrospective

power analyzes.

Conclusion: Statistical Signicance

Should not be Confused with

Clinical Signicance

ANOVA is a useful statistical tool for

drawing inferential conclusions about

how one or more independent variables

inuences a parametric dependent vari-

able (outcome measure). It is imperative

to keep in mind that statistical signi-

cance does not necessarily correspond

to clinical signicance. e much sought

aer statistically signicant ANOVA p

value has only two purposes: to play a

role in the inferential decision as to

whether group means dier from each

other (rejection of Null hypothesis), and

to assign a probability of the risk of com-

mitting a Type 1 error if the Null hy-

pothesis is rejected. Statistically signi-

cant ANOVA and MCPs say nothing

about the magnitude of group mean dif-

ferences, other than that a dierence ex-

ists. A large sample size can produce

statistical signicance with small dier-

ences in group means; depending on the

outcome measure, these small dier-

ences may have little clinical signi-

cance. Assigning clinical signicance is

a judgment call that needs to take into

account the magnitude of the dier-

ences between groups, which is best as-

sessed by examination of eect sizes.

Statistical signicance plays the role of a

searchlight to detect group dierences,

whereas eect size is useful for judging

the clinical signicance of these dier-



