On Friday, I began talking about what counts as a big effect. Turns out I’m reinventing the wheel, as there is an excellent paper by Carolyn Hill and her colleagues at Manpower Development Research Corporation on this topic, entitled “Empirical Benchmarks for Interpreting Effect Sizes in Research.” But I’ll press onward nevertheless.
Last month, the federal Institute for Education Sciences released the third-year report on the evaluation of the DC Opportunity Scholarship Program, which provides vouchers for K-12 children and youth in the DC Public Schools who win a lottery to attend a private school. The key outcomes in the study were scale scores on the Stanford Achievement Test (SAT-9) in reading and mathematics. (Scale scores are converted from “raw” scores based on the number of correct responses to the test.) The evaluators found that, after three years, students who were offered a voucher scored 4.46 points higher on the SAT-9 reading test, which represented an effect size of .13. This effect was statistically different from zero. Interestingly, the impact of being offered a voucher on reading scores was not reliably different from zero for male students. In mathematics, there was no evidence of a positive effect of being offered a voucher: after three years, students offered vouchers scored .81 points higher on the SAT-9 math test, an effect that was not statistically different from zero, and which corresponded to an effect size of .03.
Based on how these effect sizes equate with percentile changes, these are pretty small effects, and the presence of an asterisk denoting statistical significance for the effect of being offered a voucher on reading scores for girls alone, and no effects on math scores for either boys or girls, doesn’t justify the political spectacle that surrounds the program. After three years, the net movement in reading for voucher students starting at around the 34th percentile nationally is about five percentiles; in math, it’s about one percentile. Anyone who thinks that effects of this size are altering the life trajectories of DC children is kidding himself.
Part of the hoopla stems from another way in which the size of the voucher effect is being reported: months and years of additional learning. The overall effect of 4.5 scale score points in reading is reported as equivalent to 3.1 months of additional learning for members of the treatment group, and the 5.3 point scale score gain for those who actually used the voucher is reported as 3.7 additional months of learning. The Wall Street Journal’s op-ed page, always good with math, rounded this up to “Children attending private schools with the aid of the scholarships are reading nearly a half-grade ahead of their peers who did not receive vouchers.”
Where do numbers like this come from? They hinge on the fact that the SAT-9 is vertically-equated across grades K-12, which means that a common scale is used for the forms of the test that are administered at different grades. Using the same scale across grades facilitates the measurement of growth over time. Although a given scale score is supposed to represent the same level of proficiency regardless of what grade a student is in, the reality is that the skills tested at widely-differing grade levels don’t overlap much, so that a given scale score in the third grade may represent a different set of content skills than that same scale score in the seventh grade. (It’s for this reason that the oft-cited claim that, based on the National Assessment of Educational Progress, white students in the 12th grade are, on average, four years ahead of their African American peers is unsupportable. Although there is a single NAEP scale, a given score represents different competencies in eighth grade than that same score does in the 12th grade.) Vertically-equated scale scores in adjacent grades are much more credible than score in grades that are far apart.
The DC evaluation report states that the conversion to months of learning is based on dividing the impact effect size by the effect size of the weighted average annual increase in scale scores for the control group. In other words, if control group students gain 10 points a year, on average, on the SAT-9 reading test, and the group using a voucher scored 5 points higher than the control group, then the voucher group is 5/10 = .5 years, or 4.5 months, ahead of the control group.
What this implies is that if a test shows relatively small gains in performance from one year to the next, then a given effect will look like a larger difference, in terms of months or years of learning gains, than if that test shows relatively large changes over time. Hill and her colleagues show that, for most nationally-normed tests, the largest changes over time occur in the earliest elementary grades, and get progressively smaller as students move into secondary school. This could mean that students simply learn less in high school than they do in elementary school. But it might also mean that tests with a common scale aren’t very good at picking up changes over time in the content of what is taught or learned. The reason that the effects of using the voucher in the DC study appear relatively large in terms of months or years of learning is that there wasn’t much evidence of learning in the control group population—much less learning than is implied by the national norms on the SAT-9 test or students’ scores on DC’s own Comprehensive Assessment System (DC-CAS).
The moral to the story: when the effects of an intervention are reported in terms of months or years of learning gains, treat the numbers with a healthy dose of skepticism. The magnitude of an effect size has to be placed into a meaningful context, which includes knowledge of what Hill et al. refer to as the “natural growth for its target population.”
Tomorrow I’ll have a few more things to say about the DC study, and some anomalies in the scores that I find troubling.