Growing pains

Teachers aren’t the only only ones facing new evaluation system

PHOTO: Tajuana Cheshier/ Chalkbeat TN
Kenneth Woods and his daughters Breanna Rosser (r) and Taylor Woods (r) reviewed 12 powerful words with sixth grade language arts teacher Patricia Hervey.

Since a landmark piece of school reform legislation passed in 2010, teacher evaluations have become a hot topic in Colorado education circles. What’s lesser-known is that the system ushered in by Senate Bill 10-191 extends to thousands of school employees who don’t work in classrooms, but rather counseling offices, health rooms and other school spaces.

Starting this year, the law requires annual evaluations of counselors, psychologists, therapists, nurses and other staff labeled “Specialized Service Professionals” or SSPs. These employees fall into nine categories, number about 4,700 and make up around 9 percent of the licensed school workforce.

As with the introduction of statewide teacher evaluations last year, the evaluation process for SSPs has brought some predictable bumps in the road: anxiety, confusion and a steep learning curve.

“The details of it are huge,” said Anne Hilleman, director of exceptional student services for Montrose and Olathe Schools. “It’s a bear.”

It’s a sentiment familiar to Katy Anthes, executive director of educator effectiveness at the Colorado Department of Education.

“There was a big fear factor with Senate Bill 191,” she said.

Despite concerns, there’s a sense among some SSP staff that the new system offers meaningful professional feedback and concrete avenues for improvement.

“I was kind of excited to have a useful evaluation tool,” said school nurse Jackie Valpiando, who came to the Widefield School District last year after working in other districts. “Before, we were kind of evaluated like a teacher because they didn’t know what to do with us…Most of the time I wasn’t evaluated.”

Widefield is one of 19 sites—mostly school districts and BOCES—that piloted SSP evaluations last year. So was the Montrose district. Previously, evaluations there were conducted using generic rubrics that didn’t always fit well with the employees’ responsibilities.

Under the state model system, which districts can use to guide the evaluation process, there are rubrics defining high-quality practice for all nine SSP categories.

“These rubrics [are] very detailed, very specific to each specialty area, which in terms of morale had to feel good to folks,” said Hilleman.

“Before last year, we were like, ‘Which rubric do you put an audiologist on?”

New model includes student outcomes

While many states have implemented evaluation systems for teachers and principals, mandating evaluations for other licensed school personnel is less common.

SSP Numbers in Colorado

    • School counselors: 1,617
    • Speech language pathologists: 1,065
    • Psychologists: 738
    • Social workers: 461
    • School nurses: 357
    • Occupational Therapists: 325
    • Physical therapists: 79
    • Audiologists: 67
    • Orientation and mobility specialists: 12
*Numbers are from the 2012-13 school year

“We’re probably in a small grouping [of states] that includes the specialized services professionals,” said Anthes. “Our state was pretty all-encompassing and comprehensive when the law said all licensed personnel must be evaluated.”

By the end of this school year, Colorado’s SSP staff will earn one of four final ratings: ineffective, partially effective, effective or highly effective. Eventually, the ratings will be posted publicly in the aggregate, but individual employee ratings will not be available.

While all Colorado SSPs must be evaluated this year, districts do have some leeway in how they come up with the final rating. They can choose to weigh only professional practice scores—those based on the SSP rubrics—in the final rating. That will change next year when 50 percent of the final rating must include “measures of student outcomes.”

Those outcomes, usually three to four different measures, are defined by each district and will vary by SSP category. For example, nurses may be asked to ensure that a certain percentage of asthmatic students can demonstrate the proper use of inhalers. Meanwhile, a counselor may be judged on students’ acquisition of knowledge after a social skills program.

In some districts, student outcomes measures may include things like state test scores. That may sound counterintuitive since SSP staff don’t provide academic instruction, but the rationale is that all school staff have ownership of student achievement.

“Interestingly, a lot of the SSPs are including some portion of student growth in the collective measure,” said Anthes. “Kind of as a nod to saying, ‘We’re all supporting students. We’re all contributing to the environment that helps them learn.’”

In Widefield, state test scores will count for 5 percent of SSP evaluations.

“They want us to have buy-in and I agree with that 100 percent,” said Vialpando. “We need to make sure the kids are successful too.”

She added, “I’m glad it’s 5 percent and not 50 percent.”

Adjusting to a new system

While most district administrators have always had some role in evaluating SSP staff, most agree that the new system is far more time-consuming. Hilleman, who evaluates SSP staff as well as other employees, said the new system has tripled her evaluation workload.

“You are more frequently engaged in coaching and evaluative conversations with people,” she said.

Overall, she believes the process is valuable, but given the time commitment wonders if “rock star” employees truly need annual evaluations.

James McGhee, assistant director of special education in Widefield, said the district’s old process, which entailed a written narrative about the employee’s strengths and weaknesses, took about an hour to complete. Not only do the new write-ups take 1.5-2 hours to complete, the district opted to move from one formal evaluation a year to two though that’s not required by the state.

“It’s a big shift,” he said, one that was rough at first but ultimately more informative for staff.

“The feedback is more specific in helping them grow as professionals.”

SSP staff have noticed the increased time commitment too, but some say the close examination of their day-to-day work is welcome.

“It’s a chance to be acknowledged and validated for what we do as special service providers,” said Christine Gray, a counselor at Aspen Elementary School.

Working outside the classroom sometimes gives SSPs the sense, “You’re an ‘other,’ a  little out of the mainstream,” she said.

The evaluation process–time-consuming though it is–helps remedy that feeling. For Gray, the new system has also meant more on-going reflection. Under the previous system, she’d usually turn her attention to her evaluation for a day, maybe two.

Now, she says she can’t quantify the minutes and hours she spends preparing for, having, or reflecting on her evaluation because it’s woven throughout her job.

“Its not something you put to bed anytime,” she said. “Hopefully its something you carry with your and it guides your practice.”

Moderating expectations

Aside from the extra time investment, many SSP employees find the new system challenging because earning top ratings on the professional practice half of the evaluation is tougher than under most previous evaluation systems.

Under the state model system, SSP staff can earn one of five ratings for professional practice: exemplary, accomplished, proficient, partially proficient and basic. While “proficient” meets state standards, it can seem like a mediocre rating to employees who are used to superlatives.

Valpiando said she earned “exemplary” on a few standards last year, but overall would have fallen into the proficient category.

“I’ve always thought of myself as better than proficient….so that was hard for me to take,” she said.

One of the criteria that distinguishes proficient from “accomplished” or “exemplary” for all types of SSP staff, is whether they move from carrying out required duties to empowering students, parents or teachers around certain professional goals. For example, a proficient employee might make a recommendation to a student, whereas an exemplary employee prompts the student to act on the recommendation.

“That is a really unique piece of all of our rubrics…the same things happen with principal and assistant principal rubrics,” said Anthes. “When you move to accomplished or exemplary it’s what has the work you’ve done enabled others to do?”

Hilleman said while her SSP staff all scored well into proficiency based on the rubric, few were exemplary.

“I did really have to frontload especially with my overachievers…Don’t feel like this is a ding.”

Impacting personnel decisions

With many SSP staff employed on single-year contracts, their employment status may depend more on student enrollment and district needs than evaluation ratings. Still, those not on single-year contracts who score below effective for two years in a row can lose non-probationary status. Technically, this could make it easier for districts to dismiss them.

“It is easier to fire you if you don’t have non-probationary status,” said Anthes. “Whereas if you had non-probationary status… it might take a district longer to remove you.”

No SSPs will lose non-probationary status till the end of the 2016-17 school year at the earliest, since this year is considered a hold-harmless year. Even then, districts will not be required to dismiss partially effective or ineffective employees, though administrators will have that option.

Despite the potential influence of SSP evaluations on job security, Anthes said, “That’s really not the main point of the law…We really try to emphasize…it’s about professional growth.

As always, she said, districts should use evaluation ratings for personnel decisions, such as determining what professional development to offer, how to draft professional growth plans or where to place staff.

“Every professional in public schools deserves meaningful practice.”

a high-stakes evaluation

The Gates Foundation bet big on teacher evaluation. The report it commissioned explains how those efforts fell short.

PHOTO: Brandon Dill/The Commercial Appeal
Sixth-grade teacher James Johnson leads his students in a gameshow-style lesson on energy at Chickasaw Middle School in 2014 in Shelby County. The district was one of three that received a grant from the Gates Foundation to overhaul teacher evaluation.

Barack Obama’s 2012 State of the Union address reflected the heady moment in education. “We know a good teacher can increase the lifetime income of a classroom by over $250,000,” he said. “A great teacher can offer an escape from poverty to the child who dreams beyond his circumstance.”

Bad teachers were the problem; good teachers were the solution. It was a simplified binary, but the idea and the research it drew on had spurred policy changes across the country, including a spate of laws establishing new evaluation systems designed to reward top teachers and help weed out low performers.

Behind that effort was the Bill and Melinda Gates Foundation, which backed research and advocacy that ultimately shaped these changes.

It also funded the efforts themselves, specifically in several large school districts and charter networks open to changing how teachers were hired, trained, evaluated, and paid. Now, new research commissioned by the Gates Foundation finds scant evidence that those changes accomplished what they were meant to: improve teacher quality or boost student learning.  

The 500-plus page report by the Rand Corporation, released Thursday, details the political and technical challenges of putting complex new systems in place and the steep cost — $575 million — of doing so.

The post-mortem will likely serve as validation to the foundation’s critics, who have long complained about Gates’ heavy influence on education policy and what they call its top-down approach.

The report also comes as the foundation has shifted its priorities away from teacher evaluation and toward other issues, including improving curriculum.

“We have taken these lessons to heart, and they are reflected in the work that we’re doing moving forward,” the Gates Foundation’s Allan Golston said in a statement.

The initiative did not lead to clear gains in student learning.

At the three districts and four California-based charter school networks that took part of the Gates initiative — Pittsburgh; Shelby County (Memphis), Tennessee; Hillsborough County, Florida; and the Alliance-College Ready, Aspire, Green Dot, and Partnerships to Uplift Communities networks — results were spotty. The trends over time didn’t look much better than similar schools in the same state.

Several years into the initiative, there was evidence that it was helping high school reading in Pittsburgh and at the charter networks, but hurting elementary and middle school math in Memphis and among the charters. In most cases there were no clear effects, good or bad. There was also no consistent pattern of results over time.

A complicating factor here is that the comparison schools may also have been changing their teacher evaluations, as the study spanned from 2010 to 2015, when many states passed laws putting in place tougher evaluations and weakening tenure.

There were also lots of other changes going on in the districts and states — like the adoption of Common Core standards, changes in state tests, the expansion of school choice — making it hard to isolate cause and effect. Studies in Chicago, Cincinnati, and Washington D.C. have found that evaluation changes had more positive effects.

Matt Kraft, a professor at Brown who has extensively studied teacher evaluation efforts, said the disappointing results in the latest research couldn’t simply be chalked up to a messy rollout.

These “districts were very well poised to have high-quality implementation,” he said. “That speaks to the actual package of reforms being limited in its potential.”

Principals were generally positive about the changes, but teachers had more complicated views.

From Pittsburgh to Tampa, Florida, the vast majority of principals agreed at least somewhat that “in the long run, students will benefit from the teacher-evaluation system.”

Source: RAND Corporation

Teachers in district schools were far less confident.

When the initiative started, a majority of teachers in all three districts tended to agree with the sentiment. But several years later, support had dipped substantially. This may have reflected dissatisfaction with the previous system — the researchers note that “many veteran [Pittsburgh] teachers we interviewed reported that their principals had never observed them” — and growing disillusionment with the new one.

Majorities of teachers in all locations reported that they had received useful feedback from their classroom observations and changed their habits as a result.

At the same time, teachers in the three districts were highly skeptical that the evaluation system was fair — or that it made sense to attach high-stakes consequences to the results.

The initiative didn’t help ensure that poor students of color had more access to effective teachers.

Part of the impetus for evaluation reform was the idea, backed by some research, that black and Hispanic students from low-income families were more likely to have lower-quality teachers.  

But the initiative didn’t seem to make a difference. In Hillsborough County, inequity expanded. (Surprisingly, before the changes began, the study found that low-income kids of color actually had similar or slightly more effective teachers than other students in Pittsburgh, Hillsborough County, and Shelby County.)

Districts put in place modest bonuses to get top teachers to switch schools, but the evaluation system itself may have been a deterrent.

“Central-office staff in [Hillsborough County] reported that teachers were reluctant to transfer to high-need schools despite the cash incentive and extra support because they believed that obtaining a good VAM score would be difficult at a high-need school,” the report says.

Evaluation was costly — both in terms of time and money.

The total direct cost of all aspects of the program, across several years in the three districts and four charter networks, was $575 million.

That amounts to between 1.5 and 6.5 percent of district or network budgets, or a few hundred dollars per student per year. Over a third of that money came from the Gates Foundation.

The study also quantifies the strain of the new evaluations on school leaders’ and teachers’ time as costing upwards of $200 per student, nearly doubling the the price tag in some districts.

Teachers tended to get high marks on the evaluation system.

Before the new evaluation systems were put in place, the vast majority of teachers got high ratings. That hasn’t changed much, according to this study, which is consistent with national research.

In Pittsburgh, in the initial two years, when evaluations had low stakes, a substantial number of teachers got low marks. That drew objections from the union.

“According to central-office staff, the district adjusted the proposed performance ranges (i.e., lowered the ranges so fewer teachers would be at risk of receiving a low rating) at least once during the negotiations to accommodate union concerns,” the report says.

Morgaen Donaldson, a professor at the University of Connecticut, said the initial buy-in followed by pushback isn’t surprising, pointing to her own research in New Haven.

To some, aspects of the initiative “might be worth endorsing at an abstract level,” she said. “But then when the rubber hit the road … people started to resist.”

More effective teachers weren’t more likely to stay teaching, but less effective teachers were more likely to leave.

The basic theory of action of evaluation changes is to get more effective teachers into the classroom and then stay there, while getting less effective ones out or helping them improve.

The Gates research found that the new initiatives didn’t get top teachers to stick around any longer. But there was some evidence that the changes made lower-rated teachers more likely to leave. Less than 1 percent of teachers were formally dismissed from the places where data was available.

After the grants ran out, districts scrapped some of the changes but kept a few others.

One key test of success for any foundation initiative is whether it is politically and financially sustainable after the external funds run out. Here, the results are mixed.

Both Pittsburgh and Hillsborough have ended high-profile aspects of their program: the merit pay system and bringing in peer evaluators, respectively.

But other aspects of the initiative have been maintained, according to the study, including the use of classroom observation rubrics, evaluations that use multiple metrics, and certain career-ladder opportunities.

Donaldson said she was surprised that the peer evaluators didn’t go over well in Hillsborough. Teachers unions have long promoted peer-based evaluation, but district officials said that a few evaluators who were rude or hostile soured many teachers on the concept.

“It just underscores that any reform relies on people — no matter how well it’s structured, no matter how well it’s designed,” she said.

Correction: A previous version of this story stated that about half of the money for the initiative came from the Gates Foundation; in fact, the foundation’s share was 37 percent or about a third of the total.

evaluating evaluation

Teaching more black or Hispanic students can hurt observation scores, study finds

Thomas Barwick | Getty Images

A teacher is observed in her first period class and gets a low rating; in her second period class she gets higher marks. She’s teaching the same material in the same way — why are the results different?

A new study points to an answer: the types of students teachers instruct may influence how administrators evaluate their performance. More low-achieving, black, Hispanic, and male students lead to lower scores. And that phenomenon hurts some teachers more than others: Black teachers are more likely to teach low-performing students and students of color.

Separately, the study finds that male teachers tend to get lower ratings, though it’s not clear if that’s due to differences in actual performance or bias.

The results suggest that evaluations are one reason teachers may be deterred from working in classrooms where students lag farthest behind.

The study, conducted by Shanyce Campbell at the University of California, Irvine, analyzed teacher ratings compiled by the Measures of Effective Teaching Project, an effort funded by the Bill and Melinda Gates Foundation. (Gates is also a supporter of Chalkbeat.)

The paper finds that for every 25 percent increase in black or Hispanic students taught, there was a dip in teacher’s rating, similar to the difference in performance between a first and second-year teacher. (Having more low-performing or male students had a slightly smaller effect.)

That’s troubling, Campbell said, because it means that teachers of color — who often most frequently work with students of color — may not be getting a fair shot.

“If evaluations are inequitable, then this further pushes them out,” Campbell said.

The findings are consistent with previous research that shows how classroom evaluations can be biased by the students teachers serve.

Cory Cain, an assistant principal and teacher at the Urban Prep charter network in Chicago, said he and his school often grapple with questions of bias when trying to evaluating teachers fairly. His school serves only boys and its students are predominantly black.

“We’re very clear that everyone is susceptible to bias. It doesn’t matter what’s your race or ethnicity,” he said.

While Cain is black, it doesn’t mean that he doesn’t see how black boys are portrayed in the media, he said. And also he knows that teachers are often nervous they will do poorly on their evaluations if students are misbehaving or are struggling with the content on a given day, knowing that it can be difficult for observers to fully assess their teaching in short sessions.

The study can’t show why evaluation scores are skewed, but one potential explanation is that classrooms appear higher-functioning when students are higher-achieving, even if that’s not because of the teacher. In that sense, the results might not be due to bias itself, but to conflating student success with teacher performance.

Campbell said she hopes her findings will add nuance to the debate over the best ways to judge teachers.

One idea that the study floats to address the issue is an adjustment of evaluation scores based on the composition of the classroom, similar to what is done for value-added scores, though the idea has received some pushback, Campbell said.

“I’m not saying we throw them both out,” Campbell said of classroom observations and value-added scores. “I’m saying we need to be mindful.”