When scores seem too good to believe

This story is by Greg Toppo, Denise Amos, Jack Gillum and Jodi Upton of USA Today

MILFORD, Ohio — Scott Mueller seemed to have an uncanny sense about what his students should study to prepare for upcoming state skills tests.

By 2010, the teacher had spent his 16-year career entirely at Charles Seipelt Elementary School. Like other Seipelt teachers, Mueller regularly wrote study guides for his classes ahead of state tests.

On test day last April, several fifth-grade students immediately recognized some of the questions on their math tests. They were the same as those on the study guide Mueller had given them the day before. Some numbers on the actual tests were identical to those in the study guide and the questions were in the same order, the kids told other Seipelt teachers.

The report of possible cheating quickly reached district officials, who put Mueller on paid leave. He initially denied doing anything wrong. Ultimately, investigators concluded that Mueller had looked at questions for both fifth-grade math and science tests in advance — a violation of testing rules — and then copied them, sometimes word for word, into a school computer to develop his study guides.

The 50-year-old teacher resigned. He signed a consent agreement with the Ohio State Board of Education admitting that, by looking at the 2010 tests in advance to prepare study guides, he had “engaged in conduct unbecoming a licensed educator.” His teaching license was suspended for three months.

At Seipelt, as in other schools nationwide, young students tipped off officials that something was amiss. Yet if anyone had taken a closer look at the past few years’ scores, they might have noticed other testing irregularities at Seipelt.

In several grades, standardized test scores at Seipelt fluctuated year to year, sometimes rising sharply, then falling just as fast, according to data USA Today obtained from the Ohio Department of Education.

In 2005, for example, the school’s third-graders tested in the 67th percentile statewide in math. As fourth-graders a year later, when Mueller was one of their teachers, their scores jumped to the 97th percentile, among the best in the state. As fifth-graders in 2007, the scores plunged to the 49th percentile. Then, in 2008, when they were in sixth grade, their scores climbed again to the 90th percentile.

Seipelt’s gains and losses are typical of a pattern uncovered by a USA Today investigation of the standardized tests of tens of millions of students in six states and the District of Columbia. The newspaper identified 1,610 examples of anomalies in which public school classes — a school’s entire fifth grade, for example — boasted what analysts regard as statistically rare, perhaps suspect, gains on state tests.

Such anomalies surfaced in Washington, D.C., and all the states — Arizona, California, Colorado, Florida, Michigan and Ohio — where USA Today analyzed test scores. For each state, the newspaper obtained three to seven years’ worth of scores.

There were another 317 examples of equally large, year-to-year declines in an entire grade’s scores.

How the analysis was done

USA Today used a methodology widely recognized by mathematicians, psychometricians and testing companies. It compared year-to-year changes in test scores and singled out grades within schools for which gains were 3 standard deviations or more from the average statewide gain on that test. In layman’s language, that means the students in that grade showed greater improvement than about 99.8% of their classmates statewide.

John Fremer, Caveon (Mackenzie McCluer)

“An individual student can exceed beyond their wildest dreams in any given year, but when a whole group shifts its position dramatically, you have to worry.”

The higher the standard deviation, the rarer that improvement is. In dozens of cases, USA Today found 5, 6 and even 7 standard deviations, making those gains even more exceptional.

Large year-to-year jumps in test scores by an entire grade should raise red flags, especially if scores drop in later grades, says Brian Jacob, director of the Center on Local, State and Urban Policy at the University of Michigan. Such fluctuations by themselves do not prove there was cheating, but Jacob says they offer “a reasonable way to identify suspicious things” that should be investigated.

Education reformers say a surge in scores is possible without cheating. Mike Feinberg, the Houston-based founder of KIPP, a 99-school chain of charter schools widely recognized for raising test scores, says “remarkable growth” is possible with “great teaching and more of it.” Where you have “just an amazing teacher who can motivate his kids to really work hard,” classes can see gains that might seem “unbelievable” at first glance.

“You know something is profoundly different” at schools with spikes in scores, says John Tanner of Test Sense, a San Antonio consulting firm that works with schools nationwide. But, if an investigation shows the school is making “profound changes” commensurate with the gains, “I would give them the benefit of the doubt,” he adds.

Mike Feinberg, KIPP (Mackenzie McCluer)

When you have “just an amazing teacher who can motivate his kids to really work hard,” classes can see gains that might seem “unbelievable” at first glance.

Others are more skeptical. “An individual student can exceed beyond their wildest dreams in any given year, but when a whole group shifts its position dramatically, you have to worry,” says John Fremer, president of Caveon Test Security, a Utah company hired by states and school districts to investigate test irregularities.

Thomas Haladyna, a professor emeritus at Arizona State University who investigates cheating, says test gains of 3 standard deviations or more for an entire grade are “so incredible that you have to ask yourself, ‘How can this be real?’.” Haladyna says such a spike in scores is so rare that it would be like finding “a weight-loss clinic where you lose 100 pounds a day.”

Data show that two of Seipelt Elementary’s score fluctuations — the fifth-graders’ 49-percentile decline in 2007 and the 41-percentile climb in sixth grade a year later — registered more than 3 standard deviations. School Principal Melissa Borger attributed the fifth-grade decline to inexperienced teachers, the sixth-grade jump to a teacher with 25 years’ experience teaching math.

School and district officials said they saw no reason to suspect that any of their teachers cheated before 2010. Robert Farrell, superintendent of the Milford school district, said in an interview with The Cincinnati Enquirer, a partner of USA Today on this project, that he considered Mueller’s transgression a one-time event. Mueller did not respond to requests for comment from the Enquirer.

The 66 fifth-graders who used Mueller’s study guides had to take retake their math and science tests, at a total cost of $3,300. They passed with high marks.

Cause for celebration

Dramatic test improvements are usually causes for celebration.

That’s because of the increasingly high stakes attached to the tests required under the federal No Child Left Behind (NCLB) law. Although most school districts retain the power to hire and fire teachers, 10 states now require that student scores be the main criterion in teacher evaluations. Some states and districts reward educators for raising scores; a teacher may earn a bonus of as much as $25,000 in Washington, D.C., if his or her students’ scores climb.

Educators who say they can produce big gains in a year “should be selling that, they should be taking it on the road. They should be doing anything they can to get the word out, because we’ve never seen anything like that.”

NCLB also puts principals’ jobs on the line if students’ scores don’t improve. Most of the 130 Detroit public schools closed since 2005 were cited for having low test scores.

The Obama administration has begun doling out extra money to the states that tie teacher evaluations to test scores. At the same time, NCLB’s harshest penalties for underachieving schools and teachers are about to kick in: By 2014, the law dictates, 100% of public school students must be “proficient”in math and reading. If not, a school can face replacement of its entire staff.

Given the mounting pressure on teachers, principals and superintendents to produce high scores, “no one has incentives to vigorously pursue” testing irregularities, says Gregory Cizek, a professor at the University of North Carolina-Chapel Hill who studies cheating. “In fact, there’s a strong disincentive.”

Fremer, the Caveon executive, says the idea of investigating high scores seems silly to most educators. “Scores are going up? Good. Scores are going down? Bad. And you want me to investigate when scores are going up? What’s the matter with you?”

Investigations can be time-consuming and expensive. In Michigan, the average cost for an investigation ranges from “several hundred to a few thousand dollars,” says Joseph Martineau, director of the Office of Educational Assessment and Accountability. He told The Detroit Free Press, another partner of USA Today, that the costs include hiring “independent investigators who are former school administrators and have been trained in police investigation methods.”

John Boivin, administrator of California’s standardized testing program, says his state once conducted random test audits at 150 to 200 schools a year. California dropped the audits two years ago because of record budget deficits. And the state no longer collects data on which schools show unusually high rates of erasures on answer sheets, sometimes a clue, experts say, that either students or school officials might be cheating. Total savings: $105,000.

Even when suspicious scores are investigated, it can be hard to identify a culprit unless someone confesses. Ed Roeber, head of Michigan’s testing program for nearly 20 years, says evidence of cheating is almost always circumstantial. “It’s very difficult to prove,” Roeber says. “Often what you end up with is the feeling that something might have happened, but I don’t know for sure and, even if I did, I couldn’t prove it in a court of law.”

That result frustrated him, Roeber says. “It made me angry because you were cheating kids. You’re not finding out if they need help. You’re painting this picture that is incorrect.”

‘We earned it’

School officials often attribute improvements in scores to inspired teaching, curriculum changes tailored to the tests, more emphasis on basics such as math and reading or other innovations. For example:

• In Pico Rivera, Calif., a Los Angeles suburb, students at Montebello Gardens Elementary School jumped from the 17th percentile statewide in second-grade math in 2005 to the 85th percentile in third grade a year later. Similarly, second-graders scored in the 40th percentile in math in 2007, then jumped as third-graders to the 93rd percentile in 2008. In both cases, the gains were lost the following year.

“I can look at my faculty and I know unequivocally they are following the directions in a moral, ethical manner,” says Norma Perez, the school principal. “We will never, ever have anyone from the state question, ‘How is it that you did well?’ We did it because we earned it.” She attributes the school’s success to “optimism and hard work” and a team of “incredibly dedicated teachers.”

She adds, “It’s never worth even attempting to do something that’s not fair because it’s not fair to the children.”

• Montebello Unified School District Superintendent Art Revueltas chuckles at the suggestion that the school’s gains could be too good to be true. “Technically, I’ve been told, that’s impossible, but they did it,” he says. “They got everybody to learn.” He adds that Perez’s students, like others in the district, take many practice tests.

“Yeah, we do that. Everyone does that in America,” he says. “There’s no secret about the test, it’s a standards-based test. We teach to the test.”

• In Colorado, third-graders at Aragon Elementary School in the Fountain-Fort Carson School District scored in the 31st percentile on the state math exam in 2005. A year later, Aragon’s fourth-graders achieved at the 97th percentile in math, showing greater improvement than any other group of students in the state. The next year, Aragon fifth-graders dropped to the 47th percentile in math.

David Roudebush, the district’s assistant superintendent, says the district recognized the spike as “a statistical anomaly” but says no “real conclusions can be drawn due to multiple factors that can affect results.”

• In central Phoenix, seventh-graders at Friendly House Academia del Pueblo, a K-8 charter school where nearly all students qualify for federally subsidized lunches, showed remarkable gains in math in 2009. Their scores soared from the 15th percentile in sixth grade to the 92nd percentile in seventh.

Principal Ximena Doyle attributes the seventh-graders’ gains mainly to a math teacher who has since moved to another school. The teacher had a good understanding of the academic standards that are tested by the state, Doyle says, and focused instruction on those areas. “It’s my belief those are legitimate scores,” she says.

The achievements at Friendly House didn’t entirely hold up the next year. The students’ mean-scale scores dropped from 586 to 419 in 2010, when they were eighth-graders. The percentage who passed the state math test also fell, from 88% to 43%.

Friendly House’s seventh-grade math gains were 5.77 standard deviations, a change so rare that it could happen by chance less than once in 100 million times.

Given that most students have fairly stable scores from one year to the next, standard deviations of 3 or more are so unusual that Cizek, the testing specialist in North Carolina, likens gains on that scale to “a miracle in a bottle.”

Cizek says even the most powerful curriculum changes or educational interventions usually produce gains of only a quarter or half a standard deviation. “None of our educational interventions — none — produce 3 standard deviation gains,” he says. “We’ve never seen that.”

If some educators believe they can produce such dramatic gains, “they should be selling that, they should be taking it on the road,” he adds. “They should be doing anything they can to get the word out, because we’ve never seen anything like that.”

Zeroing in on students

Because year-to-year swings in a classroom’s test scores reveal only part of a complicated picture, state or district investigators usually look for other data, especially the scores of individual students, to make sure that the students tested in one grade were the same who were tested the year before. An influx of new students, whether caused by changing school boundary lines or the construction of a new housing development, could cause some changes in scores.

For that reason, USA Today used open-records laws to secure the test scores of millions of individual students from four states — Ohio, Michigan, Florida and Arizona — and Washington, D.C., then analyzed them to determine long-term trends for those individuals. Because of privacy laws, the states encrypted the information, assigning to each student an ID number that could not be traced to the actual student but allowed the newspaper to track the student’s test scores through the years.

In some cases, where there was little if any turnover at the school, the individual student data reinforced the notion that exceptionally large score gains are statistically rare.

That was the case at Portsmouth West Elementary School in the Appalachian foothills of southern Ohio, a school that had long ranked near the bottom of the state in test scores. Then a few years ago, its fourth-graders began scoring with the state’s elite.

In 2007, third-grade math scores at the school ranked in the 21st percentile among the state’s 1,961 schools. In 2008, when those students were in fourth grade, scores soared to the 94th percentile statewide — only to drop again in the fifth grade a year later.

The fourth-grade scores were triggered by unusual jumps in the scores of dozens of students. One Portsmouth West student went from the 8th percentile in the third grade to the 90th in the fourth. Another jumped from the 17th percentile to the 96th. Still others went from the 12th percentile to the 90th, from the 10th to the 88th, from the 17th to the 93rd, from the 24th to the 95th and so on.

In all, 40 of the 115 students who took the state math test both years scored above the 90th percentile in the fourth grade. The year before, in third grade, only one student had. These big individual gains evaporated in the fifth grade along with the classwide improvement.

That pattern — low third-grade scores, a sudden jump in fourth grade, then a dramatic decline in fifth — has been repeated every year since.

Patricia Ciraso, superintendent of Ohio’s Washington-Nile School District, attributes the scores at Portsmouth West, the district’s only elementary school, to “a combination of things.” She says the school started grouping fourth-graders based on ability — top students together, for example — and also added an intervention specialist to address academic problems early.

“We did not cheat,” Ciraso says.

In 2008, Portsmouth West’s fourth-graders outperformed their counterparts at every other school in the county and nearly every affluent district in Ohio. They did better, for example, than fourth-graders at the ultrawealthy suburban Cincinnati school district of Indian Hill, which had an average family income of $278,432 in 2008, according to tax records, and perpetually ranks near the top in both income and academic performance in Ohio.

By contrast, the average income in Portsmouth West’s school district was $39,820 in 2008, reports the Ohio Department of Taxation.

Ciraso and her staff examined individual test scores after being contacted by USA Today, and she acknowledges some oddities that she said need to be reviewed further. So far, there has been no outside investigation.

Investigating irregularities

From Arizona, California and Florida, USA Today obtained testing “incident reports” about irregular occurrences on test day that violated security protocols. In California alone, about 112 such incidents were investigated by school districts and the results were reported to the state during the past two years.

In one California case, the cheating was traced to administrators at a group of charter schools renowned for their high scores, documents show. An investigation by the Los Angeles Unified School District concluded that John Allen, the founder and executive director of the six-school Crescendo charter system, had orchestrated widespread cheating on 2010 tests.

According to documents from the State Superintendent of Public Education, Allen gave copies of upcoming state tests to his principals and ordered them to prepare students by using the actual test questions. Teachers blew the whistle, and the charters’ own board put Allen on a six-month unpaid leave. Each principal was given a 10-day suspension. All 600 tests by second through fifth graders were invalidated.

Jose Cole-Gutierrez, director of the Los Angeles district’s charter school division, said it was the most ambitious cheating investigation he had ever undertaken, affecting the largest number of students. The case boiled over last week after inquiries from USA Today and the Los Angeles Times. The Los Angeles Board of Education overruled staff recommendations and voted to shut down all six schools at the end of the school year. Monica Garcia, the board president, said the schools’ high scores had been called into question by the cheating incident.

USA Today’s analysis showed that three of the schools had shown extreme fluctuations in scores. Scores for one group of students at Crescendo Charter Academy, in Gardena, jumped from the 40th percentile to the 94th statewide between third and fourth grades, a standard deviation of 3.4.

Not all investigations end so conclusively.

In Washington, D.C., 52 schools –– a little under a third of schools in the school district — have been flagged at least once in the past three years because of unusually high erasure rates on standardized tests. At one school, Stanton Elementary, erasure rates in fourth grade in 2009 were at least 10 times the district average; on a math test administered to 20 students, 345 answers were changed — 97% of them to the correct answer. That was the same year the fourth-graders’ achievement scores skyrocketed from the bottom to near the top of D.C.’s fourth-graders. Those scores were out of line with those for the rest of the school; Stanton did not make “adequate yearly progress” in 2009.

Documents obtained through a Freedom of Information Act request show that Stanton was one of eight public schools investigated in D.C. in 2009. The investigation found no testing violation at Stanton but concluded that one teacher — whom officials would not identify — should no longer proctor tests. Documents did not reveal why.

Stanton no longer is classified as a traditional D.C. campus. Since last fall, it has been operated as a charter school by Philadelphia-based Scholar Academies, and its faculty has mostly turned over.

A lack of accountability

The reluctance to investigate spikes in scores is especially marked at schools that have been struggling to meet the requirements of NCLB.

That was the case at Charles W. Duval Elementary School in Gainesville, Fla., where math scores in the fifth grade rose sharply year after year. In 2005, the school’s fourth-graders finished near the bottom of the state, in the 5th percentile. The next year, as fifth-graders, they scored in the 79th percentile. Duval fifth-grade classes repeated the feat the next two years and, by 2008, they were testing in the 91st percentile.

Then, in 2008, Duval’s test scores took a nose dive. Math scores as well as reading scores crashed, and the fifth-graders finished in the 1st percentile, at the very bottom. The school overall dropped in state rankings from an “A” school — among the best in the state — to an “F” school.

Only then did the state step in, but not to investigate the high scores or how they were achieved. Instead, the state sent education specialists to help the school get back on track.

Sandy Hollinger, a deputy superintendent of the Alachua County school district, attributes the drop in Duval’s scores to the sudden death of a beloved math teacher and the district’s failure to screen adequately for kids who needed extra help in reading.

After nearly two years of state intervention, the school has improved its grade from an “F” to a “D.”

When cheating is proven, often no one is held accountable. At Mount Pleasant Standard Based Middle School, a Tampa charter school, state statisticians in 2006 invalidated virtually all of the school’s reading tests — about 100 — after an analysis showed a suspicious number of erasures. Principal Yolonda Waitress says investigators interviewed a few teachers — mostly to make sure that everyone had signed a security agreement.

The district initially had ruled there was no cheating, but reversed itself after the state concluded that “evidence indicates that the test books were most likely tampered with following student testing.”

That finding would seem to suggest school officials, either teachers or administrators, might be responsible. Both the state and district told the school to turn over teachers’ names for possible discipline, but no one was disciplined, Waitress says.

On its website, the school now boasts that in 2010 it made adequate yearly progress and was given an “A” grade. The district, however, administers tests for the school.

As for the events of 2006, district investigator John Hilderbrand concluded in an e-mail obtained by USA Today that “something stranged (sic) happened” then. He did not say what or who did it.

Toppo reported from New York; Gillum from Washington, D.C.; Upton from Gainesville, Fla.; and Amos, who also reports for The Cincinnati Enquirer, from Ohio. Contributing: USA TODAY reporters Dennis Cauchon in Ohio and Marisol Bello in Washington, D.C.; Detroit Free Press reporters Chastity Pratt Dawsey and Peggy Walsh-Sarnecki and database editor Kristi Tanner-White in Michigan; The Arizona Republic reporter Anne Ryman in Phoenix; reporter Nancy Mitchell of ednewscolorado.org in Colorado; and Jennifer Oldham and April Dembosky of the Hechinger Institute at Columbia University reported in California.