Comparing Group Means: One Sample, Independent, and Dependent t Tests

Refresher - Steps for conducting a test of statistical significance:

State the null and research hypotheses.
Establish the level of statistical significance (alpha level, level of risk for committing a Type I error).
Select the appropriate test statistic (see the decision tree on the back inside cover of the text or view https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html for an interactive version).
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
*Determine the critical value for the test statistic.
Compare the obtained value with the critical value.
Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

*As an alternative to Steps 5-7, just compare the reported Significance level (p-value) with the preset alpha level (usually .05).
If Sig. < .05, then reject the null hypothesis - you found a difference.
If Sig. > .05, then retain the null hypothesis - you didn't find a difference.

Comparing Means: Types of t Tests

There are three types of t tests that will be introduced in this section: one sample t tests, independent samples t tests, and dependent samples t tests. A one sample t test compares the mean of one group against a known, predetermined value - for example, a cut point for a test score. An independent t test compares the means of two independent groups - for example, boys' scores with girls' scores. A dependent t test compares two scores that are related somehow, usually two scores belonging to the same person - for example, comparing a pre-test mean with a post-test mean.

Comparing a Group Mean with a Benchmark Score: One Sample t Test

The first example of a t test compares a group's mean against a known score - either one representing a predetermined benchmark or one representing a historical performance level, for example. Consider the situation of a teacher who is meeting a new class of students and wants to compare their initial knowledge of history with that of her previous students. Historically, she has observed a mean score of 45% on a comprehensive test of world history. She gives the test to her incoming students, and they have a mean score of 40%. Her question is whether these students are significantly less prepared than the previous students or whether their lower test score is essentially equal to the historical trend. A one sample t test will answer this question.

Let's consider another example. The state sets an intended, ultimate goal of 800 points on the API for each school in the state. The scores for Alameda county schools have just been reported. Here are three one sample t tests comparing the elementary, middle, and high school means for public schools in Alameda county with the benchmark score of 800. There is one assumption that must be checked in order to conduct this test - the underlying distribution needs to be normally distributed or the sample size needs to be greater than 30. This assumption is satisfied by the number of schools at each level - there are 229 elementary schools, 62 middle schools, and 82 high schools. Here are the steps to follow for comparing the elementary schools with the benchmark score of 800.

State the null and research hypotheses.
Null hypothesis: There is no difference between mean API scores for Alameda elementary schools and the benchmark score of 800
Research hypothesis: This is a difference between mean API scores for Alameda elementary schools and the benchmark score of 800
Establish the alpha level.
We'll set a two-tailed alpha level at α = .05
Select the appropriate test statistic (see the decision tree on the back inside cover or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a t test. The test statistic will be t.
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
The assumption of normality is met by checking the number of elementary schools.
Skip to #8. Determine the critical value for the test statistic.
The critical value for the test statistic could be looked up in a table of t values, but SPSS gives the p-value associated with the observed t value instead.
Skip to #8. Compare the obtained value with the critical value.
Again, because SPSS reports the p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the associated p-value (.054) to the predetermined alpha level of .05.
Skip to #8. Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.
In this example, .054 > .05, so we retain the null hypothesis and conclude that there is no statistically significant difference between the API scores of the Alameda elementary schools and the benchmark level of 800 points.

Here is the output:

one sample t test for Alameda elementary schools

Because the risk of committing a Type I error (Sig.) is greater than .05, the null hypothesis is retained and the observed mean of 787 (rounded) is considered to be close enough to 800 to conclude that there is no statistically significant difference between the two numbers.

Following the same procedure, here are the results for middle and high schools in Alameda county.

one sample t test for middle schools

Because the risk of committing a Type I error (Sig.) is less than .05, the null hypothesis is rejected and the observed mean of 741 (rounded) is considered to be far enough away from 800 to conclude that there is a statistically significant difference between the two numbers.

one sample t test for high schools

Because the risk of committing a Type I error (Sig.) is less than .05, the null hypothesis is rejected and the observed mean of 624 (rounded) is considered to be far enough away from 800 to conclude that there is a statistically significant difference between the two numbers.

In the results for the middle schools and high schools where the null hypothesis was rejected, the size of the difference between the observed mean and the benchmark score of 800 can be considered. These size differences are called effect sizes and are measured using a statistic called Cohen's d, which is calculated in a manner similar to a z score.

one sample effect size

For the middle schools, d = (741.44 - 800)/122.803 = -58.56/122.803 = -.4768, which is rounded to -.48. This effect size represents just less than a one-half standard deviation difference between the observed mean of 741 and the benchmark of 800. By using the normal distribution calculator here (http://davidmlane.com/hyperstat/z_table.html), the asociated percentile for a z score of -.48 is approximately the 31st percentile.

For the high schools, d = (624.34 - 800)/142.192 = -175.66/142.192 = -1.235, which is rounded to -1.24. This effect size represents just less than a one and a quarter standard deviation difference between the observed mean of 624 and the benchmark of 800. In this case, using the normal distribution calculator places the result at the 11th percentile.

Comparing the Means of Two Groups - Independent t Tests

Many research projects in education include a comparison of two groups. Examples include comparing two instructional methods, comparing achievement between girls and boys, comparing performance levels of English language learners and non-English language learners, or comparing outcomes for students from high SES schools with those from low SES schools. These types of comparisons involve stating a null hypothesis - there is no difference between the groups - and an alternative (research) hypothesis - there is a difference, which can be non-directional or directional. An example of a directional hypothesis is that girls will score higher than boys on a listening comprehension test. An example of a non-directional hypothesis is that girls will score differently (i.e., either higher or lower) than boys on the test.

Before comparing the two means, we need to determine if the two means are comparable. If you will recall, when the standard deviation was introduced earlier, it was described as a quality-control measure for the mean. As you know, the standard deviation indicates the amount of spread around the mean. Larger standard deviations indicate more spread, smaller standard deviations indicate less spread. Likewise, larger standard deviations indicate a less representative mean and smaller standard deviations indicate a more representative mean. So, before directly comparing the mean, we need to compare the standard deviations. Are the two standard deviations similar enough? As we will see shortly, SPSS includes this step in its output labeled as Levene's Test, which tests a prerequisite of the statistical comparison of the means called the assumption of homogeneity of variance (think of equality of spread). Other assumptions that must be met are independence between the two groups and having an underlying normally distributed population. Group independence is a result of the research design. The normality assumption can be checked by inspecting the histogram for symmetry and can be violated if there are enough (usually greater than 30) participants in each group.

Let's see what testing two means looks like in practice by conducting an independent t test with the 2006 API data for San Francisco elementary schools. In this test, we are comparing API scores for elementary schools with a significant number of English learners (EL) with API scores for elementary schools without a significant number of ELs. When using SPSS to conduct a t test, the steps are easier than those listed in the text. The result is equivalent, but instead of relying on a table of critical values (Appendix C in the text), SPSS incorporates the values from the table into the statistical output. Instead of comparing the observed t value with the critical t value from the table, you just need to compare the observed p-value with the predetermined alpha level. Here are the steps:

State the null and research hypotheses.
Null hypothesis: There is no difference between mean API scores for EL elementary schools and non-EL elementary schools
Research hypothesis: This is a difference between mean API scores for EL elementary schools and non-EL elementary schools
Establish the alpha level.
We'll set a two-tailed alpha level at α = .05
Select the appropriate test statistic (see the decision tree on the back inside cover or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a t test. The test statistic will be t.
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
The assumption of independence is met because we are comparing different schools.
The assumption of normality is met by checking the histograms for each group.
The assumption of homogeneity of variance is tested with Levene's test below. Notice that the reported significance level (Sig.) is .86. The interpretation of this value is that the two standard deviations (83.454 and 96.063) are similar enough for the t test, which allows us to compare the two means (782.81 and 735.55).
Skip to #8. Determine the critical value for the test statistic.
The critical value for the test statistic could be looked up in a table of t values, but SPSS gives the p-value associated with the observed t value instead.
Skip to #8. Compare the obtained value with the critical value.
Again, because SPSS reports the p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the associated p-value (.046) to the predetermined alpha level of .05.
Skip to #8. Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.
In this example, .046 < .05, so we reject the null hypothesis and conclude that there is a statistically significant difference between the API scores of the two groups of schools.

Confidence Intervals and the Standard Error of the Mean or Difference
Hypothesis testing involves a point estimation and results in a decision about rejecting or retaining the null hypothesis. An equivalent alternative to this approach involves an interval instead of a point. The interval is called a confidence interval and has a researcher-determined percentage associated with it. The percentage is the level of confidence that the true difference is within the interval. This is not a probability that the true difference is within the interval - the true difference either is or isn't within the interval and because we'll never be absolutely certain what the true difference is, we'll never know if it is within the interval. The 95% indicates that if you repeated this test many times, 95% of the intervals would contain the true difference and 5% of the intervals would not.

Note in the table above that the 95% confidence interval for the t test is reported as ranging from a low of .889 to a high of 93.636. These numbers indicate that the difference between the two mean API scores could be as little as .889 or as large as 93.636, but notice that the difference cannot be 0 because 0 is less than .889.

You might wonder what these interval endpoints (.889 and 93.636) are based on. First, the center of these numbers is 47.263, which is the observed difference between the two mean API scores. Try subtracting these numbers (93.636 - .889 = 92.747), dividing the result by 2 (92.747 / 2 = 46.3735), and then adding that result to .889 (46.3735 + .889 = 47.2625, which rounds to 47.263 -- the observed difference). So, the midpoint of the confidence interval is the observed difference between the two mean that we are comparing.

How is the length of the interval determined? The answer to this question is based on a theoretical probability distribution, similar to the standard normal distribution, called Student's t distribution. Recall that we found z scores associated with certain probabilities. We can also find t values associated with probabilities as well. The t value associated with 95% is just under 2 - actually 1.9966. So, we should construct an interval that is approximately two "standard deviations" above and below the observed difference of 47.263.

What is the standard deviation of the difference? This number is shown in the table above - it is labeled Std. Error Difference. Technically, it is a weighted average of the two sample standard deviations that is divided by the square root of the sample size. More about this in the next paragraph. To determine the endpoints, start with the observed difference (47.263), subtract from it 1.9966 * Std. Error (47.263 - (1.9966 * 23.227) = .889) -- this is the left endpoint of the confidence interval, and finally repeat the same calculation only add instead of subtract (47.263 + (1.9966 * 23.227) = 93.636) -- this is the right endpoint of the confidence interval. When conducting this comparison of the two mean API scores, 95% of the time, the true difference will be between .889 and 93.636 points.

More about the standard error. Notice in the first table above, after the standard deviations, there is a column labeled Std. Error Mean. The numbers in this column are calculated by dividing the standard deviation by the square root of the sample size. For example, 12.046 = 83.454 / 6.928, where 6.928 is the square root of 48. What does this number represent? Recall the simulation shown in the last module of drawing samples from a population and then calculating and plotting the mean of each sample. The result of this process is a distribution of sample means. The standard deviation of the distribution of sample means is the standard error of the mean, which is denoted SEM. Think of it the same way that you do the standard deviation for a regular sample - the only difference that instead of the sample of observations, we are referring to a sample of means derived from multiple samples, all of the same size. You might revisit this site (http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html) and generate some more distributions of sample means.

One limitation of confidence intervals is that they are used only with two-tailed tests. For a one-tailed test, where the alternative hypothesis is directional, you need to compare the observed p-value to the predetermined significance (alpha) level. To learn more about confidence intervals, see http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html.

Effect Size
All this work and we really only determined that the observed difference between the two mean API scores is probably not due to chance. In other words, we found a statistically significant difference between the two mean API scores -- meaning that 782.81 is statistically significantly larger than 735.55. But, we have not determined how important this difference should be for educators. Is this 47-point difference really meaningful? To answer that question, we need to calculate another statistic, called Cohen's d, which is a measure of effect size. Based on Jacob Cohen's work, the following strengths of effect sizes have been determined for educational research:

small effect	.00 to .20
medium effect	.20 to .50
large effect	.50 and higher

Cohen's d is calculated using the following formula:

It doesn't really matter which mean is Mean₁ and which is Mean₂, call the larger mean, Mean₁ to work with positive effect sizes. Which standard deviation to use is a matter of debate. In some situations, you should use the control group's standard deviation. In other cases, you should use a weighted average of the two sample standard deviations. This weighted sample is called the pooled standard deviation - you can see an example formula on page 364.

To calculate the effect size, you can also use the effect size calculator mentioned in the text at http://web.uccs.edu/lbecker/Psy590/escalc3.htm. Doing so, gives the following result.

So, the effect size is approximately one-half of a standard deviation, which is a large effect based on Salkind's criteria. If we compute the percentile associated with a z score of .5, we would get about the 70th percentile, which can be interpreted to indicate that the mean API score of the EL elementary schools is higher than 70% of the non-EL schools.

Excel has a TTEST function that has the following syntax: =TTEST(array1,array2,tails,type).
Here is an example TTEST formula for comparing two sets of scores, provided in cells A2:A21 and B2:B31, respectively.
=TTEST(A2:A21,B2:B31,2,3)
The 2 indicates a two-tailed test and the 3 indicates a test that doesn't assume that the sample SDs are equal. See Excel's help information for other options for these parameters.
The result of the formula is a p-value which can be compared to the significance level (α) to determine the chance of committing a Type I error. Here is what the spreadsheet looks like.

Dependent t Tests - Comparing Two Means for Related Groups

In the previous discussion, one of the assumptions was that the two groups are unrelated or independent. What happens if they are not unrelated? What if there is some type of link between pairs of observations? The existence of this link allows us to calculate the difference between the paired observations and test if the mean difference is equal to 0. This process should sound similar to the independent t test. Before conducting one of these tests, let's consider a little background information.

First of all, what are related groups? Consider a study that involves twins. Obviously, twins can share many characteristics. Let's say we are interested in comparing the GPAs of the twins. Instead of dividing the sets of twins into two groups and then conducting an independent t test, we pair the twins and conduct a dependent t test, which involves the difference between their two GPAs. If this difference is found to be statistically significantly different than 0, then we can calculate an effect size to measure the magnitude of the difference.

The previous scenario may seem a bit contrived. Who conducts twin studies in education? These types of studies exist, but they don't represent the most frequent use of the dependent t test in education. Instead of related groups, consider the comparison of related scores. Many education studies compare related scores. The most common comparison is between pretest scores and posttest scores. This comparison is used to assess the effect of an instructional intervention, for example. We might compare last year's score with this year's score, or the score at the beginning of the semester with that at the end. There are many versions of this type of scenario. The link between the pairs of scores is the fact that each pair belongs to a single individual. Each individual's posttest score can be compared with a previously observed pretest score. In fact, a difference score is calculated from the pair of scores and tested statistically.

Assumptions for dependent t tests

The main assumption for the dependent t test is that the difference scores are normally distributed, or that there is a sufficiently large sample size. Notice that we no longer have an assumption about the homogeneity of the variances, because we are comparing each score with its pair. This is a benefit of the dependent t test.

Effect size for dependent t tests

If a statistically significant result is found, an effect size can be calculated, similar to the process with independent t tests. Only here instead of Mean₁ - Mean₂, we have D, which represents the mean of the individual difference scores. Like the previous situation, there are a number of ways to choose which standard deviation (sd) to use. A commonly accepted choice is to use the standard deviation for the set of difference scores.

In the following example, students' scores on a midterm and final will be compared. The data are listed below. Let's review the steps for comparing the scores.

Here are the steps:

State the null and research hypotheses.
Null hypothesis: There is no difference between midterm scores and final scores
Research hypothesis: Final scores are higher than midterm scores
Establish the alpha level.
We'll set a one-tailed alpha level at α = .05
Select the appropriate test statistic (see the decision tree on the back inside cover of the text or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a dependent t test. The test statistic will be t.
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
The assumption of normality is met by checking the histogram or the skewness ratio for the differences.
Skip to #8. Determine the critical value for the test statistic.
The critical value for the test statistic could be looked up in a table of t values, but SPSS gives the p-value associated with the observed t value instead.
Skip to #8. Compare the obtained value with the critical value.
Again, because SPSS reports the p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the associated p-value (.000) to the predetermined alpha level of .05. Note that SPSS reports the two-tailed p-value, the appropriate p-value for a one-tailed test is one half of the p-value for a two-tailed test. The obtained p-value is less than .0005 for a two-tailed test, so the corresponding p-value for the one-tailed test is less than .00025.
Skip to #8. Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.

To conduct the same analysis in Excel, you need to first ensure that the Analysis Toolpak is installed for your version of Excel. Specific instructions for installing the Toolpak may vary for different versions of Excel. Check the appropriate Help information by searching for "Analysis Toolpak." When the Analysis Toolpak is installed, a Data Analysis option appears on the Tools menu. Choosing t-Test: Paired Two Sample for Means from the Data Analysis menu and completing the corresponding dialog box by filling in the data ranges produces the following output:

In either the SPSS analysis or the Excel analysis, we see that the mean difference is statistically significantly different from 0, which indicates that the final scores and the midterm scores are not the same. By dividing the observed mean difference by the standard deviation of 3.77, we obtain a large effect size of approximately .92.