Looking for the culprits behind tests' dropping standards

What does it mean for tests to get easier? And is that really what happened to New York’s tests?

The analysis that has spurred that idea in the last few weeks actually found something slightly different. The tests aren’t necessarily easier, in the way that a kindergarten spelling bee is easier than the SAT. Instead, between 2007 and 2009, students who hadn’t learned much came out looking like they had.

This is an important distinction because it points to a different culprit behind the dropping standards than simply the individual test items themselves. Instead, Harvard professor Daniel Koretz – the lead author of the analysis commissioned by the state education department — names two possible causes: a phenomenon called “score inflation” and a possible psychometric error tied to an obscure state law.

The actual questions on the test play a role in both, but just as important is the practice of prepping students extensively for tests. Another key is a state law that forces New York to release all test items publicly, making it easier for teachers to practice test prep and making it harder for officials to keep tests consistent over time.

What Koretz found: A dropping hurdle

The question that motivated this week’s scrutiny of the state tests was: Is the increasing number of New York students passing the tests a sign that they know more — or is it a mirage?

In other words, imagine that the passing score of Level 3 out of 4 is a hurdle. Koretz wanted to figure out if more students were leaping over it because more of them could actually jump higher or, alternatively, because the bar had somehow been tugged down.

Maintaining a “proficiency” bar at the same height over time is harder than you might think, because unlike physical height, academic performance is abstract. An entire field of statistics called “psychometrics” exists just to keep the bars at the same height over time.

Likewise, it was a challenge for Koretz to test whether the Level 3 bar for the exam he studied first — the eighth-grade math test — stood at the same height in 2009 as it did in 2007. To compare two abstract things, Koretz needed a stable measurement of students’ raw competence. How much competence did it take to score a Level 3 in 2009 versus 2007? If the hurdle had stuck at the same height, the knowledge needed to clear it would be exactly the same.

To approximate raw competency, Koretz used the NAEP exam, which is the most respect national test and on which — conveniently for Koretz’s purposes — performance was relatively stable overall between 2007 and 2009. Using a mix of national and state test results, he could estimate the rough percentile rank on the NAEP that students had to get to achieve a Level 3 on the New York State test.

The move was like saying, If you gave the New York test to students nationally, what percentage would fail? (Scoring at the 80th percentile on a test means that you have reached a level that 80 percent of people couldn’t.) The national failure rate to match New York’s Level 3 was a rough way of knowing the New York “proficient” students’ raw competency.

If the number stayed the same between 2007 and 2009, then the bar must have stayed put. If raw competency dropped, Level 3 must have sunk, too.

As Koretz put it, “If people have to jump over a similar hurdle, the proportion failing to get there shouldn’t have changed dramatically – because NAEP scores didn’t change very much.”

But this is not what he found. “In fact,” he said this week, “the hurdle had been dropped so much that almost no kids would have failed to jump it.”

In 2007, 12 percent of students nationwide failed to reach the NAEP level equivalent to a Level 2 on the math exam. In 2009, the percentage had dropped to 2. For Level 3, the percentage dropped from 36 in 2007 to 19 in 2009.

Why?

Koretz says he can’t yet be certain why the Level 3 hurdle dropped over time, but he has two guesses. The first — and the one he suspects most strongly — is a phenomenon called “score inflation.”

Score inflation’s primary cause, Koretz told me, is what he calls “inappropriate test prep” — coaching students on material that teachers know will be covered on the test to the exclusion of other material covered by state standards, but that for a range of reasons doesn’t get tested. It can also be caused by deliberate attempts to game the tests, like by barring certain students from taking the test.

The result is that students get better at scoring high on tests over time, but they don’t learn more.

The other possible explanation Koretz cites has to do with the test’s makers, who are charged with “linking” tests from one year to another so that a Level 3 holds the same meaning over time.

In New York, linking is especially challenging because of a law we first wrote about last year that requires the state to release all its test items publicly. That prevents the state from following the industry-standard method of linking, which is to hide secret test questions from one year to the next, and use them as benchmarks that stay constant between years. New York instead has to use a less-reliable method called field testing, in which the state gives separate tests each year that aren’t attached to high-stakes.

“The problem,” Koretz explained, “is kids know it’s a field test.” They don’t take it as seriously as they take the state test, and the results, therefore, are compromised.

A failure to “link” properly doesn’t mean that McGraw-Hill, the company that makes the state tests, broke rules. But, said Koretz, “Even though the process that the contractor used was kosher, it doesn’t mean it worked.”

Moving forward

How do you fix score inflation and bad linking? Koretz said it’s not enough simply to raise the score that equates to “proficient.” But he said that, so far, state education officials are taking the right steps to do more.

Though they haven’t yet decreed a ban on test prep (something that would be hard to do), they have asked McGraw-Hill to redesign the tests so that they are less easily gamed. That includes trying to test a broader set of subjects within math and reading, as happened with this year’s math (but not English Language Arts) tests. It also includes making the test less predictable from year to year. (See a story we ran last year showing how the annual math tests repeat themselves.)

Koretz also said that McGraw Hill has performed “complicated psychometric work to reduce the affect score inflation might have on the linking.”

And the tests will be entirely re-written when the national common core standards effort to re-make assessments is completed in the next few years.

It’s all a big departure from what New York State was saying just two years ago, when Koretz first requested permission to analyze the state’s tests. Then, a spokesman for the State Education Department told me:

“All of New York’s tests are checked many times to be sure that a score this year means the same next year… The only way for a student to improve performance is by learning the curriculum — reading, writing, and math.”

The full Koretz five-page memo summarizing his findings so far:

Memo: Evidence about the leniency of 8th-grade standards