Comparing
Group Means: One Sample, Independent, and Dependent t Tests
Refresher - Steps
for conducting a test of statistical significance:
- State
the null and research hypotheses.
- Establish the level of statistical significance (alpha level,
level of risk for committing a Type I error).
- Select
the appropriate test statistic (see the decision tree on the back
inside cover of the text
or view https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html
for an interactive version).
- Check the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
- *Determine
the critical value for the test statistic.
- Compare
the obtained value with the critical value.
- Either
reject or retain the null hypothesis based on the following.
- If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
- If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
*As an alternative to Steps 5-7, just compare the reported Significance
level (p-value) with the preset alpha level (usually .05).
If Sig. < .05, then reject the null hypothesis -
you found a difference.
If Sig. > .05, then retain the null hypothesis -
you didn't find a difference.
Comparing Means: Types of t Tests
There are three types of t
tests that will be introduced in this section: one sample t tests, independent samples t tests, and dependent samples t tests. A one sample t test compares the mean of one
group against a known, predetermined value - for example, a cut point
for a test score. An independent t
test compares the means of two independent groups - for example, boys'
scores with girls' scores. A dependent t test compares two scores that are
related somehow, usually two scores belonging to the same person - for
example, comparing a pre-test mean with a post-test mean.
Comparing a Group Mean with a
Benchmark Score: One Sample t Test
The first example of a t test
compares a group's mean against a known score - either one representing
a predetermined benchmark or one representing a historical performance
level, for example. Consider the situation of a teacher who is meeting
a new class of students and wants to compare their initial knowledge of
history with that of her previous students. Historically, she has
observed a mean score of 45% on a comprehensive test of world history.
She gives the test to her incoming students, and they have a mean score
of 40%. Her question is whether these students are significantly less
prepared than the previous students or whether their lower test score
is essentially equal to the historical trend. A one sample t test will answer this question.
Let's consider another example. The state sets an intended, ultimate
goal of 800 points on the API for each school in the state. The scores
for Alameda county schools have just been reported. Here are three one
sample t tests comparing the
elementary, middle, and high school means for public schools in Alameda
county with the benchmark score of 800. There is one assumption that
must be checked in order to conduct this test - the underlying
distribution needs to be normally distributed or the sample size needs
to be greater than 30. This assumption is satisfied by the number of
schools at each level - there are 229 elementary schools, 62 middle
schools, and 82 high schools. Here are the steps to follow for
comparing the elementary schools with the benchmark score of 800.
- State
the null and research hypotheses.
Null hypothesis: There is no
difference between mean API scores for Alameda elementary schools and
the benchmark score of 800
Research hypothesis: This is a
difference between mean API scores for Alameda elementary schools and
the benchmark score of 800
- Establish the alpha level.
We'll set a two-tailed alpha level
at α = .05
- Select
the appropriate test statistic (see the decision tree on the back
inside cover
or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a t test. The test statistic will be t.
- Check the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
The assumption of normality is met
by checking the number of elementary schools.
- Skip to #8. Determine
the critical value for the test statistic.
The
critical value for the test statistic could be looked up in a table of
t values, but SPSS gives the p-value associated with the observed t value instead.
- Skip to #8. Compare
the obtained value with the critical value.
Again, because SPSS reports the
p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the
associated p-value (.054) to the predetermined alpha level of .05.
- Skip to #8. Either
reject or retain the null hypothesis based on the following.
- If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
- If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
- Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If
p-value < alpha level, then reject the null hypothesis - evidence
supports the research hypothesis. There is a small chance of committing
a Type I error.
If p-value >= alpha level, then retain the null
hypothesis - evidence does not support the research hypothesis. The
chance of committing a Type I error is too large.
In
this example, .054 > .05, so we retain the null hypothesis and
conclude that there is no statistically significant difference between
the API scores of the Alameda elementary schools and the benchmark
level of 800 points.
Here is the output:
Because the risk of committing a Type I error (Sig.) is greater than
.05, the null hypothesis is retained and the observed mean of 787
(rounded) is considered to be close enough to 800 to conclude that
there is no statistically significant difference between the two
numbers.
Following the same procedure, here are the results for middle and high
schools in Alameda county.
Because the risk of committing a Type I error (Sig.) is less than
.05, the null hypothesis is rejected and the observed mean of 741
(rounded) is considered to be far enough away from 800 to conclude that
there
is a statistically significant difference between the two numbers.
Because the risk of committing a Type I error (Sig.) is less than
.05, the null hypothesis is rejected and the observed mean of 624
(rounded) is considered to be far enough away from 800 to conclude that
there
is a statistically significant difference between the two numbers.
In the results for the middle schools and high schools where the null
hypothesis was rejected, the size of the difference between the
observed mean and the benchmark score of 800 can be considered. These
size differences are called effect sizes and are measured using a
statistic called Cohen's d, which is calculated in a manner similar to
a z score.
For the middle schools, d = (741.44 - 800)/122.803 = -58.56/122.803 =
-.4768, which is rounded to -.48. This effect size represents just less
than a one-half standard deviation difference between the observed mean
of 741 and the benchmark of 800. By using the normal distribution
calculator here (http://davidmlane.com/hyperstat/z_table.html),
the asociated percentile for a z score of -.48 is approximately the
31st percentile.
For the high schools, d = (624.34 - 800)/142.192 = -175.66/142.192 =
-1.235, which is rounded to -1.24. This effect size represents just
less than a
one and a quarter standard deviation difference between the observed
mean of 624 and the benchmark of 800. In this case, using the normal
distribution calculator places the result at the 11th percentile.
Comparing the Means of Two
Groups - Independent t Tests
Many
research projects in education include a comparison of two groups.
Examples include comparing two instructional methods, comparing
achievement between girls and boys, comparing performance levels of
English language learners and non-English language learners, or
comparing outcomes for students from high SES schools with those from
low SES schools. These types of comparisons involve stating a null
hypothesis - there is no difference between the groups - and an
alternative (research) hypothesis - there is a difference, which can be
non-directional or directional. An example of a directional hypothesis
is that girls will score higher than boys on a listening comprehension
test. An example of a non-directional hypothesis is that girls will
score differently (i.e., either higher or lower) than boys on the test.
Before comparing the two means, we need to determine if the two
means are comparable. If you will recall, when the standard deviation
was introduced earlier, it was described as a quality-control measure
for the mean. As you know, the standard deviation indicates the amount
of spread around the mean. Larger standard deviations indicate more
spread, smaller standard deviations indicate less spread. Likewise,
larger standard deviations indicate a less representative mean and
smaller standard deviations indicate a more representative mean. So,
before directly comparing the mean, we need to compare the standard
deviations. Are the two standard deviations similar enough? As we
will see shortly, SPSS includes this step in its output labeled as
Levene's Test, which tests a prerequisite of the statistical comparison
of the means called the assumption of homogeneity of variance (think of
equality of spread). Other assumptions that must be met
are independence between the two groups and having an underlying
normally distributed population. Group independence is a result of the
research design. The normality assumption can be checked by inspecting
the histogram for symmetry and can be violated if there are
enough (usually greater than 30) participants in each group.
Let's see what testing two means looks like in practice by conducting
an independent t
test with the 2006 API data for San Francisco elementary schools. In
this test, we are comparing API scores for elementary schools with a
significant number of English learners (EL) with API scores for
elementary
schools without a significant number of ELs. When using SPSS to conduct
a t
test, the steps are easier than those listed in the text. The result is
equivalent, but instead of relying on a table of critical values
(Appendix C in the text), SPSS incorporates the values from the table
into the
statistical output. Instead of comparing the observed t value with the critical t value from the table, you just
need to compare the observed p-value with the predetermined alpha
level. Here are the steps:
- State
the null and research hypotheses.
Null hypothesis: There is no
difference between mean API scores for EL elementary schools and non-EL
elementary schools
Research hypothesis: This is a
difference between mean API scores for EL elementary schools and non-EL
elementary schools
- Establish the alpha level.
We'll set a two-tailed alpha level
at α = .05
- Select
the appropriate test statistic (see the decision tree on the back
inside cover
or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a t test. The test statistic will be t.
- Check the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
The assumption of independence is
met because we are comparing different schools.
The assumption of normality is met
by checking the histograms for each group.
The
assumption of homogeneity of variance is tested with Levene's test
below. Notice that the reported significance level (Sig.) is .86. The
interpretation of this value is that the two standard deviations
(83.454 and 96.063) are similar enough for the t test, which allows us to compare the two
means (782.81 and 735.55).
- Skip to #8. Determine
the critical value for the test statistic.
The
critical value for the test statistic could be looked up in a table of
t values, but SPSS gives the p-value associated with the observed t value instead.
- Skip to #8. Compare
the obtained value with the critical value.
Again, because SPSS reports the
p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the
associated p-value (.046) to the predetermined alpha level of .05.
- Skip to #8. Either
reject or retain the null hypothesis based on the following.
- If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
- If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
- Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If
p-value < alpha level, then reject the null hypothesis - evidence
supports the research hypothesis. There is a small chance of committing
a Type I error.
If p-value >= alpha level, then retain the null
hypothesis - evidence does not support the research hypothesis. The
chance of committing a Type I error is too large.
In
this example, .046 < .05, so we reject the null hypothesis and
conclude that there is a statistically significant difference between
the API scores of the two groups of schools.
Confidence Intervals and the
Standard Error of the Mean or Difference
Hypothesis
testing involves a point estimation and results in a decision about
rejecting or retaining the null hypothesis. An equivalent alternative
to this approach involves an interval instead of a point. The interval
is called a confidence interval and has a researcher-determined
percentage associated with it. The percentage is the level of
confidence that the true difference is within the interval. This is not
a probability that the true difference is within the interval - the
true difference either is or isn't within the interval and because
we'll never be absolutely certain what the true difference is, we'll
never know if it is within the interval. The 95% indicates that if you
repeated this test many times, 95% of the intervals would contain the
true difference and 5% of the intervals would not.
Note in the table above that the 95%
confidence interval for the t test
is reported as ranging from a low of .889 to a high of 93.636. These
numbers indicate that the difference between the two mean API scores
could be as
little as .889 or as large as 93.636, but notice that the difference
cannot be 0 because 0 is less than .889.
You
might wonder what these interval endpoints (.889 and 93.636) are based
on. First, the center of these numbers is 47.263, which is the observed
difference between the two mean API scores. Try subtracting these
numbers (93.636 - .889 = 92.747), dividing the result by 2 (92.747 / 2
= 46.3735), and then adding that result to .889 (46.3735 + .889 =
47.2625, which rounds to 47.263 -- the observed difference). So, the
midpoint of the confidence interval is the observed difference between
the two mean that we are comparing.
How is the length of the
interval determined? The answer to this question is based on a
theoretical probability distribution, similar to the standard normal
distribution, called Student's t
distribution. Recall that we found z scores associated with certain
probabilities. We can also find t
values associated with probabilities as well. The t value
associated with 95% is just under 2 - actually 1.9966. So, we should
construct an interval that is approximately two "standard deviations"
above and below the observed difference of 47.263.
What is the
standard deviation of the difference? This number is shown in the table
above - it is labeled Std. Error Difference. Technically, it is a
weighted average of the two sample standard deviations that is divided
by the square root of the sample size. More about this in the next
paragraph. To determine the endpoints, start with the observed
difference (47.263), subtract from it 1.9966 * Std. Error (47.263 -
(1.9966 * 23.227) = .889) -- this is the left endpoint of the
confidence interval, and finally repeat the same calculation only add
instead of subtract (47.263 + (1.9966 * 23.227) = 93.636) -- this is
the right endpoint of the confidence interval. When conducting this
comparison of the two mean API scores, 95% of the time, the true
difference will be between .889 and 93.636 points.
More about
the standard error. Notice in the first table above, after the standard
deviations, there is a column labeled Std. Error Mean. The numbers in
this column are calculated by dividing the standard deviation by the
square root of the sample size. For example, 12.046 = 83.454 / 6.928,
where 6.928 is the square root of 48. What does this number represent?
Recall the simulation shown in the last module of drawing samples from
a population and then calculating and plotting the mean of each sample.
The result of this process is a distribution of sample means. The
standard deviation of the distribution of sample means is the standard
error of the mean, which is denoted SEM. Think of it the same way that
you do the standard deviation for a regular sample - the only
difference that instead of the sample of observations, we are referring
to a sample of means derived from multiple samples, all of the same
size. You might revisit this site (http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html)
and generate some more distributions of sample means.
One
limitation of confidence intervals is that they are used only
with two-tailed tests. For a one-tailed test, where the alternative
hypothesis is directional, you need to compare the observed p-value to
the predetermined significance (alpha) level. To learn more about
confidence intervals,
see http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html.
Effect Size
All
this work and we really only determined that the observed difference
between the two mean API scores is probably not due to chance. In other
words, we found a statistically significant difference between the two
mean API scores -- meaning that 782.81 is statistically significantly
larger than 735.55. But, we have not determined how important this
difference should be for educators. Is this 47-point difference really
meaningful? To answer that question, we need to calculate another
statistic, called Cohen's d, which is a measure of effect size. Based
on Jacob Cohen's work, the following strengths of effect sizes have
been determined for educational research:
small effect |
.00 to .20 |
medium effect |
.20 to .50 |
large effect |
.50 and higher |
Cohen's d is calculated using the following formula:
It doesn't really matter which mean is Mean1 and which is
Mean2, call the larger mean, Mean1
to work with positive effect sizes. Which standard deviation to use is
a matter of debate. In some situations, you should use the control
group's standard deviation. In other cases, you should use a weighted
average of the two sample standard deviations. This weighted sample is
called the pooled standard deviation - you can see an example formula
on page 364.
To calculate the effect size, you can also use the effect size
calculator mentioned in the text at http://web.uccs.edu/lbecker/Psy590/escalc3.htm.
Doing so, gives the following result.
So,
the effect size is approximately one-half of a standard deviation,
which is a large effect based on Salkind's criteria. If we compute the
percentile associated with a z score of .5, we would get about the 70th
percentile, which can be interpreted to indicate that the mean API
score of the EL elementary schools is higher than 70% of the
non-EL schools.
Excel has a TTEST function that has the following syntax:
=TTEST(array1,array2,tails,type).
Here is an example TTEST formula for comparing two sets of scores,
provided in cells A2:A21 and B2:B31, respectively.
=TTEST(A2:A21,B2:B31,2,3)
The
2 indicates a two-tailed test and the 3 indicates a test that doesn't
assume that the sample SDs are equal. See Excel's help information for
other options for these parameters.
The result of the formula is a
p-value which can be compared to the significance level (α) to
determine the chance of committing a Type I error. Here is what the
spreadsheet looks like.
Dependent t Tests - Comparing
Two Means for Related Groups
In the previous discussion, one
of the assumptions was that the two groups are unrelated or
independent. What happens if they are not unrelated? What if there is
some type of link between pairs of observations? The existence of this
link allows us to calculate the difference between the paired
observations and test if the mean difference is equal to 0. This
process should sound similar to the independent t test. Before
conducting one of these tests, let's consider a little background
information.
First of all, what are related groups?
Consider a study that involves twins. Obviously, twins can share many
characteristics. Let's say we are interested in comparing the GPAs of
the twins. Instead of dividing the sets of twins into two groups and
then conducting an independent t
test, we pair the twins and conduct a dependent t test, which
involves the difference between their two GPAs. If this difference is
found to be statistically significantly different than 0, then we can
calculate an effect size to measure the magnitude of the difference.
The
previous scenario may seem a bit contrived. Who conducts twin studies
in education? These types of studies exist, but they don't represent
the most frequent use of the dependent t test in
education. Instead of related groups, consider the comparison
of related scores. Many education studies compare related scores. The
most common comparison is between pretest scores and posttest scores.
This comparison is used to assess the effect of an instructional
intervention, for example. We might compare last year's score with this
year's score, or the score at the beginning of the semester with that
at the end. There are many versions of this type of scenario. The link
between the pairs of scores is the fact that each pair belongs to a
single individual. Each individual's posttest score can be compared
with a previously observed pretest score. In fact, a difference score
is calculated from the pair of scores and tested statistically.
Assumptions for dependent t tests
The
main assumption for the dependent t
test is that the difference scores are normally distributed, or that
there is a sufficiently large sample size. Notice that we no longer
have an assumption about the homogeneity of the variances, because we
are comparing each score with its pair. This is a benefit of the
dependent t
test.
Effect
size for dependent t tests
If
a statistically significant result is found, an effect size can be
calculated, similar to the process with independent t tests. Only here
instead of Mean1 - Mean2,
we have D, which represents the mean of the individual difference
scores. Like the previous situation, there are a number of ways to
choose which standard deviation (sd) to use. A commonly accepted
choice is to use the standard deviation for the set of difference
scores.
In the
following example, students' scores on a midterm and final will be
compared. The data are listed below. Let's review the steps for
comparing the scores.
Here
are the steps:
- State
the null and research hypotheses.
Null hypothesis: There
is no difference between midterm scores and final scores
Research hypothesis: Final scores
are higher than midterm scores
- Establish
the alpha level.
We'll
set a one-tailed alpha level at α = .05
- Select
the appropriate test statistic (see the decision tree on the back
inside cover of the text
or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a
dependent t
test. The test statistic
will be t.
- Check
the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
The assumption of normality is
met by checking the histogram or the skewness ratio for the differences.
- Skip
to #8. Determine
the critical value for the test statistic.
The
critical value for the test statistic could be looked up in a table of
t values, but SPSS gives the p-value associated with the observed t value instead.
- Skip
to #8. Compare
the obtained value with the critical value.
Again, because SPSS reports the
p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare
the associated p-value (.000) to the predetermined alpha level of .05.
Note that SPSS reports the two-tailed p-value, the appropriate p-value
for a one-tailed test is one half of the p-value for a two-tailed test.
The obtained p-value is less than .0005 for a two-tailed test, so the
corresponding p-value for the one-tailed test is less than .00025.
- Skip to #8. Either
reject or retain the null hypothesis based on the following.
- If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
- If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
- Alternative
to #5-7 - for use with SPSS output:
Compare the reported
p-value (Sig.) with the preset alpha level.
If
p-value < alpha level, then reject the null hypothesis -
evidence
supports the research hypothesis. There is a small chance of committing
a Type I error.
If p-value >= alpha level, then retain
the null
hypothesis - evidence does not support the research hypothesis. The
chance of committing a Type I error is too large.
Here
is the output from SPSS:
To conduct the
same analysis in Excel, you need to first ensure that
the Analysis Toolpak is installed for your version of Excel.
Specific instructions for installing the Toolpak may vary for different
versions of Excel. Check the appropriate Help information by searching
for "Analysis Toolpak." When the Analysis Toolpak is installed, a Data
Analysis option appears on the Tools menu. Choosing t-Test: Paired Two
Sample for Means from the Data Analysis menu and completing the
corresponding dialog box by filling in the data ranges produces the
following output:
In either the SPSS
analysis or the Excel analysis, we see that the mean difference is
statistically significantly different from
0, which indicates that the final scores and the midterm scores are not
the same. By dividing the observed mean difference by the standard
deviation of 3.77, we obtain a large effect size of approximately .92.