New York City’s teacher evaluation system has serious weaknesses, according to new research that raises questions about the reliability of classroom observations and test scores as measures of high-quality teaching. And at least one of the researchers’ recommendations suggests that the city’s plan is headed in the wrong direction.

In the first study, released this week, researchers at the Brown Center on Education Policy looked at observation data from four school districts between 2009 and 2012. They affirmed that observations help teachers improve their practice, but their data showed that teachers of students with higher incoming achievement levels were typically rated higher than those with lower-performing students. They write:

Put simply, the current observation systems are patently unfair to teachers who are assigned less able and prepared students. When this bias is combined with a system design in which school-wide value-added contributes significantly to individual teachers’ overall evaluation score, the result is an unintended but strong incentive for good teachers to avoid teaching low-performing students and to avoid teaching in low-performing schools.

Researchers Grover Whithurst, Matthew Chingos, and Katharine Lindquist recommend that districts control observation scores for differences in student demographics. They also suggest that outsiders with a more independent perspective be brought in, in some instances, to replace principals and assistant principals who typically conduct observations.

If anything, New York City is moving away from the outsider model. A new evaluation deal eliminates the role of “independent validators,” who observe teachers after they received “ineffective” ratings, and replaces them with teachers who are currently working in the city school system.

The second peer-reviewed study, released Tuesday by the American Educational Research Association, reviewed the value-added scores for 327 fourth- and eighth- grade mathematics and English teachers from six school districts, including New York City. They found that some highly-regarded teachers’ students scored poorly on tests, and some poorly-regarded teachers’ students scored especially well. The Washington Post writes:

“The concern is that these state tests and these measures of evaluating teachers don’t really seem to be associated with the things we think of as defining good teaching,” said Morgan S. Polikoff, an assistant professor of education at the Rossier School of Education at the University of Southern California. He worked on the analysis with Andrew C. Porter, dean and professor of education at the Graduate School of Education at the University of Pennsylvania.

“We need to slow down or ease off completely for the stakes for teachers, at least in the first few years, so we can get a sense of what do these things measure, what does it mean,” Polikoff said. “We’re moving these systems forward way ahead of the science in terms of the quality of the measures.”

This isn’t a surprise to anyone familiar with New York City’s use of value-added growth scores from 2007 to 2010. A New York University researcher studied the scores and found widespread errors in the data and instability among individual teachers’ scores from one year to the next.