First Person

Student growth percentiles are problematic too

Editor’s note: This piece is cross-posted from Bruce D. Baker’s School Finance 101 blog. Baker is a professor at the Rutgers University Graduate School of Education. He recently testified for the plaintiffs in the Lobato Colorado school funding trial.

In the face of all of the public criticism over the imprecision of value-added estimates of teacher effectiveness, and debates over whether newspapers or school districts should publish VAM estimates of teacher effectiveness, policymakers in several states have come up with a clever shell game. Their argument?

We don’t use VAM… ‘cuz we know it has lots of problems, we use Student Growth Percentiles instead. They don’t have those problems.

WRONG! WRONG! WRONG! Put really simply, as a tool for inferring which teacher is “better” than another, or which school outperforms another, SGP is worse, not better than VAM. This is largely because SGP is simply not designed for this purpose. And those who are now suggesting that it is are simply wrong. Further, those who actually support using tools like VAM to infer differences in teacher quality or school quality should be most nervous about the newly found popularity of SGP as an evaluation tool.

To a large extent, the confusion over these issues was created by Mike Johnston, a Colorado State Senator who went on a road tour last year pitching the Colorado teacher evaluation bill and explaining that the bill was based on the Colorado Student Growth Percentile Model, not that problematic VAM stuff. Johnston naively pitched to legislators and policymakers throughout the country that SGP is simply not like VAM (True) and that therefore, SGP is not susceptible to all of the concerns that have been raised based on rigorous statistical research on VAM (Patently FALSE!).  Since that time, Johnston’s rhetoric that SGP gets around the perils of VAM has been widely adopted by state policymakers in states including New Jersey, and these state policymakers understanding of SGP and VAM is hardly any stronger than Johnston’s.

This brings me back to my exploding car analogy. I’ve pointed out previously that if we lived in a society where pretty much everyone still walked everywhere, and then someone came along with this new automotive invention that was really fast and convenient, but had the tendency to explode on every third start, I think I’d walk. I use this analogy to explain why I’m unwilling to jump on the VAM bandwagon, given the very high likelihood of falsely classifying a good teacher as bad and putting their job on the line – a likelihood of misfire that has been validated by research.  Well, if some other slick talking salesperson  then showed up at my door with something that looked a lot like that automobile and had simply never been tested for similar failures, leading the salesperson to claim that this one doesn’t explode (for lack of evidence either way), I’d still freakin’ walk! I’d probably laugh in his face first. Then I’d walk.

Origins of the misinformation aside, let’s do a quick walk through about how  and why, when it comes to estimating teacher effectiveness, SGP is NOT immune to the various concerns that plague value-added modeling. In fact, it is potentially far more susceptible to specific concerns such as the non-random assignment of students and the influence of various student, peer and school level factors that may ultimately bias ratings of teacher effectiveness.

What is a value-added estimate?

A value added estimate uses assessment data in the context of a statistical model, where the objective is quite specifically to estimate the extent to which a student having a specific teacher or attending a specific school influences that student’s difference in score from the beginning of the year to the end of the year – or period of treatment (in school or with teacher). The best of VAMs attempt to account for several prior year test scores (to account for the extent that having a certain teacher alters a child’s trajectory), classroom level mix of students, individual student background characteristics, and possibly school characteristics. The goal is to identify most accurately the share of the student’s value-added that should be attributed to the teacher as opposed to all that other stuff (a nearly impossible task)

What is a Student Growth Percentile?

To oversimplify a bit, a student growth percentile is a measure of the relative change of a student’s performance compared to that of all students and based on a given underlying test or set of tests. That is, the individual scores obtained on these underlying tests are used to construct an index of student growth, where the median student, for example, may serve as a baseline for comparison. Some students have achievement growth on the underlying tests that is greater than the median student, while others have growth from one test to the next that is less (not how much the underlying scores changed, but how much the student moved within the mix of other students taking the same assessments, using a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution).  For more precise explanations, see here.

So, on the one hand, we’ve got Value-Added Models, or VAMs, which attempt to construct a model of student achievement, and to estimate specific factors that may affect student achievement growth, including teachers, schools, and ideally controlling for prior scores of the same students, characteristics of other students in the same classroom and school characteristics. The richness of these various additional controls plays a significant role in limiting the extent to which one incorrectly assigns either positive or negative effects to teachers. Briggs and Domingue run various alternative scenarios to this effect here:http://nepc.colorado.edu/publication/due-diligence

On the other hand, we have a seemingly creative alternative for descriptively evaluating how one student’s  performance over time compares to the larger group of students taking the same assessments. These growth measures can be aggregated to the classroom or school level to provide descriptive information on how the group of students grew in performance over time, on average, as a subset of a larger group. But, these measures include no attempt at all to attribute that growth or a portion of that growth to individual teachers or schools. That is, sort out the extent to which that growth is a function of the teacher, as opposed to being a function of the mix of peers in the classroom.

What do we know about Value-added Estimates?

  • They are susceptible to non-random student sorting, even though they attempt to control for it by including a variety of measures of student level characteristics, classroom level and peer characteristics, and school characteristics. That is, teachers who persistently serve more difficult students, students who are more difficult in unmeasured ways, may be systematically disadvantaged.
  • They produce different results with different tests or different scaling of different tests. That is, a teacher’s rating based on their students performance on one test is likely to be very different from that same teacher’s rating based on her students performance on a different test, even of the same subject.
  • The resulting ratings have high rates of error for classifying teacher effectiveness, likely in large part due to error or noise in underlying assessment data and conditions under which students take those tests.
  • They are particularly problematic if based on annual assessment data, because these data fail to account for differences in summer learning, which vary widely by student backgrounds (where those students are non-randomly assigned across teachers).

What do we know and don’t we know about SGP?

  • They rely on the same underlying assessment data as VAMs, but simply re-express performance in terms of changes in relative growth rather than the underlying scores (or rescaled scores).
    • They are therefore susceptible to at least equal error of classification concern
    • Therefore, it is reasonable to assume that using different underlying tests may result in different normative comparisons of one student to another
    • Therefore, they are equally problematic if based on annual assessment data
  • They do not even attempt (because it’s not their purpose) to address non-random sorting concerns or other student and peer level factors that may affect “growth.”
    • Therefore, we don’t even know how badly these measures are biased by these omissions? Researchers have not tested this because it is presumed that these measures don’t attempt such causal inference.

Unfortunately, while SGPs are becoming quite popular across states including Massachusetts, Colorado and New Jersey, and SGPs are quickly becoming the basis for teacher effectiveness ratings, there doesn’t appear to be a whole lot of specific research addressing these potential shortcomings of SGPs. Actually, there’s little or none! This dearth of information may occur because researchers exploring these issues assume it to be a no brainer that if VAMs suffer classification problems due to random error, then so too would SGPs based on the same data. If VAMs suffer from omitted variables bias then SGP would be even more problematic, since it includes no other variables. Complete omission is certainly more problematic than partial omission, so why even bother testing it.

In fact, Derek Briggs, in a recent analysis in which he compares the attributes of VAMs and SGPs explains:

We do not refer to school-level SGPs as value-added estimates for two reasons. First, no residual has been computed (though this could be done easily enough by subtracting the 50th percentile), and second, we wish to avoid the causal inference that high or low SGPs can be explained by high or low school quality (for details, see Betebenner, 2008).

As Briggs explains and as Betebenner originally proposed, SGP is essentially a descriptive tool for evaluating and comparing student growth, including descriptively evaluating growth in the aggregate. But, it is not by any stretch of the imagination designed to estimate the effect of the school or the teacher on that growth.

Again, Briggs in his conclusion section of his analysis of relative and absolute measures of student growth explains:

However, there is an important philosophical difference between the two modeling approaches in that Betebenner (2008) has focused upon the use of SGPs as a descriptive tool to characterize growth at the student-level, while the LM (layered model) is typically the engine behind the teacher or school effects that get produced for inferential purposes in the EVAAS. (value-added assessment system)

To clarify for the non-researcher, non-statisticians, what Briggs means in his reference to “inferential purposes,” is that SGPs, unlike VAMs are not even intended to “infer” that the growth was caused by differences in teacher or school quality.  Briggs goes further to explain that overall, SGPs tend to be higher in schools with higher average achievement, based on Colorado data. Briggs explains:

These result suggest that schools that higher achieving students tend to, on average, show higher normative rates of growth than schools serving lower achieving students. Making the inferential leap that student growth is solely caused by the school and sources of influence therein, the results translate to saying that schools serving higher achieving students tend to, on average, be more effective than schools serving lower achieving students. The correlations between median SGP and current achievement are (tautologically) higher reflecting the fact that students growing faster show higher rates of achievement that is reflected in higher average rates of achievement at the school level.

Again, the whole point here is that it would be a leap, a massive freakin’ unwarrented leap to assume a causal relationship between SGP and school quality, if not building the SGP into a model that more precisely attempts to distill that causal relationship (if any).

It’s a fun and interesting paper and one of the few that addresses SGP and VAM together, but intentionally does not explore the questions and concerns I pose herein regarding how the descriptive results of SGP would compare to a complete value added model at the teacher level, where the model was intended for estimating teacher effects. Rather, Briggs compares the SGP findings only to a simple value-added model of school effects with no background covariates,[1] and finds the two to be highly correlated. Even then Briggs finds that the school level VAM is less correlated with initial performance level than is the SGP (where that correlation is discussed above).

So then, where does all of this techno-babble bring us? It brings us to three key points.

  1. First, there appears to be no analysis of whether SGP is susceptible to the various problems faced by value-added models largely because credible researchers (those not directly involved in selling SGP to state agencies or districts) consider it to be a non-issue. SGPs weren’t ever meant to nor are they designed to actually measure the causal effect of teachers or schools on student achievement growth. They are merely descriptive measures of relative growth and include no attempt to control for the plethora of factors one would need to control for when inferring causal effects.
  2. Second, and following from the first, it is certainly likely that if one did conduct these analyses, that one would find that SGPs produce results that are much more severely biased than more comprehensive VAMS and that SGPs are at least equally susceptible to problems of random error and other issues associated with test administration (summer learning, etc.).
  3. Third, and most importantly, policymakers are far too easily duped into making really bad decisions with serious consequences when it comes to complex matters of statistics and measurement.  While SGPs are, in some ways, substantively different from VAMS, they sure as heck aren’t better or more appropriate for determining teacher effectiveness. That’s just wrong!

And this is only an abbreviated list of the problems that bridge both VAM and SGP and more severely compromise SGP. Others include spillover effects (the fact that one teacher’s scores are potentially affected by other teachers on his/her team serving the same students in the same year), and the fact that only a handful of teachers (10 to 20%) could be assigned SGP scores, requiring differential contracts for those teachers and creating a disincentive to teach core content in elementary and middle grades.  Bad policy is bad policy. And this conversation shift from VAM to SGP is little more than a smokescreen intended to substitute a potentially worse, but entirely untested method with a method for which serious flaws are now well known.

Note: To those venders of SGP (selling this stuff to state agencies and districts) who might claim my above critique to be unfair, I ask you to show me the technical analyses conducted by a qualified fully independent third party that shows that SGPs are not susceptible to non-random assignment problems, that they miraculously negate bias resulting from differences in summer learning even when using annual test data, that they have much lower classification error rates when assigning teacher effectiveness ratings, that teachers receive the same ratings regardless of which underlying tests are used and that one teacher’s ratings are not influenced by the other teachers of the same students. Until you can show me a vast body of literature on these issues specifically applied to SGP (or even using SGP as a measure within a VAM), comparable to that already in existence on more complete VAM models, don’t waste my time.


[1] Noting: “while the model above can be easily extended to allow for multivariate test outcomes (typical of applications of the EVAAS by Sanders), background covariates, and a term that links school effects to specific students in the event that students attend more than one school in a given year (c.f., Lockwood et al., 2007, p. 127-128), we have chosen this simpler specification in order to focus attention on the relationship between differences in our choice of the underlying scale and the resulting schools effect estimates.”

First Person

I’m a principal who thinks personalized learning shouldn’t be a debate.

PHOTO: Lisa Epstein
Lisa Epstein, principal of Richard H. Lee Elementary, supports personalized learning

This is the first in what we hope will be a tradition of thoughtful opinion pieces—of all viewpoints—published by Chalkbeat Chicago. Have an idea? Send it to cburke@chalkbeat.org

As personalized learning takes hold throughout the city, Chicago teachers are wondering why a term so appealing has drawn so much criticism.

Until a few years ago, the school that I lead, Richard H. Lee Elementary on the Southwest Side, was on a path toward failing far too many of our students. We crafted curriculum and identified interventions to address gaps in achievement and the shifting sands of accountability. Our teachers were hardworking and committed. But our work seemed woefully disconnected from the demands we knew our students would face once they made the leap to postsecondary education.

We worried that our students were ill-equipped for today’s world of work and tomorrow’s jobs. Yet, we taught using the same model through which we’d been taught: textbook-based direct instruction.

How could we expect our learners to apply new knowledge to evolving facts, without creating opportunities for exploration? Where would they learn to chart their own paths, if we didn’t allow for agency at school? Why should our students engage with content that was disconnected from their experiences, values, and community?

We’ve read articles about a debate over personalized learning centered on Silicon Valley’s “takeover” of our schools. We hear that Trojan Horse technologies are coming for our jobs. But in our school, personalized learning has meant developing lessons informed by the cultural heritage and interests of our students. It has meant providing opportunities to pursue independent projects, and differentiating curriculum, instruction, and assessment to enable our students to progress at their own pace. It has reflected a paradigm shift that is bottom-up and teacher led.

And in a move that might have once seemed incomprehensible, it has meant getting rid of textbooks altogether. We’re not alone.

We are among hundreds of Chicago educators who would welcome critics to visit one of the 120 city schools implementing new models for learning – with and without technology. Because, as it turns out, Chicago is fast becoming a hub for personalized learning. And, it is no coincidence that our academic growth rates are also among the highest in the nation.

Before personalized learning, we designed our classrooms around the educator. Decisions were made based on how educators preferred to teach, where they wanted students to sit, and what subjects they wanted to cover.

Personalized learning looks different in every classroom, but the common thread is that we now make decisions looking at the student. We ask them how they learn best and what subjects strike their passions. We use small group instruction and individual coaching sessions to provide each student with lesson plans tailored to their needs and strengths. We’re reimagining how we use physical space, and the layout of our classrooms. We worry less about students talking with their friends; instead, we ask whether collaboration and socialization will help them learn.

Our emphasis on growth shows in the way students approach each school day. I have, for example, developed a mentorship relationship with one of our middle school students who, despite being diligent and bright, always ended the year with average grades. Last year, when she entered our personalized learning program for eighth grade, I saw her outlook change. She was determined to finish the year with all As.

More than that, she was determined to show that she could master anything her teachers put in front of her. She started coming to me with graded assignments. We’d talk about where she could improve and what skills she should focus on. She was pragmatic about challenges and so proud of her successes. At the end of the year she finished with straight As—and she still wanted more. She wanted to get A-pluses next year. Her outlook had changed from one of complacence to one oriented towards growth.

Rather than undermining the potential of great teachers, personalized learning is creating opportunities for collaboration as teachers band together to leverage team-teaching and capitalize on their strengths and passions. For some classrooms, this means offering units and lessons based on the interests and backgrounds of the class. For a couple of classrooms, it meant literally knocking down walls to combine classes from multiple grade-levels into a single room that offers each student maximum choice over how they learn. For every classroom, it means allowing students to work at their own pace, because teaching to the middle will always fail to push some while leaving others behind.

For many teachers, this change sounded daunting at first. For years, I watched one of my teachers – a woman who thrives off of structure and runs a tight ship – become less and less engaged in her profession. By the time we made the switch to personalized learning, I thought she might be done. We were both worried about whether she would be able to adjust to the flexibility of the new model. But she devised a way to maintain order in her classroom while still providing autonomy. She’s found that trusting students with the responsibility to be engaged and efficient is both more effective and far more rewarding than trying to force them into their roles. She now says that she would never go back to the traditional classroom structure, and has rediscovered her love for teaching. The difference is night and day.

The biggest change, though, is in the relationships between students and teachers. Gone is the traditional, authority-to-subordinate dynamic; instead, students see their teachers as mentors with whom they have a unique and individual connection, separate from the rest of the class. Students are actively involved in designing their learning plans, and are constantly challenged to articulate the skills they want to build and the steps that they must take to get there. They look up to their teachers, they respect their teachers, and, perhaps most important, they know their teachers respect them.

Along the way, we’ve found that students respond favorably when adults treat them as individuals. When teachers make important decisions for them, they see learning as a passive exercise. But, when you make it clear that their needs and opinions will shape each school day, they become invested in the outcome.

As our students take ownership over their learning, they earn autonomy, which means they know their teachers trust them. They see growth as the goal, so they no longer finish assignments just to be done; they finish assignments to get better. And it shows in their attendance rates – and test scores.

Lisa Epstein is the principal of Richard H. Lee Elementary School, a public school in Chicago’s West Lawn neighborhood serving 860 students from pre-kindergarten through eighth grade.

Editor’s note: This story has been updated to reflect that Richard H. Lee Elementary School serves 860 students, not 760 students.

First Person

I’ve spent years studying the link between SHSAT scores and student success. The test doesn’t tell you as much as you might think.

PHOTO: Photo by Robert Nickelsberg/Getty Images

Proponents of New York City’s specialized high school exam, the test the mayor wants to scrap in favor of a new admissions system, defend it as meritocratic. Opponents contend that when used without consideration of school grades or other factors, it’s an inappropriate metric.

One thing that’s been clear for decades about the exam, now used to admit students to eight top high schools, is that it matters a great deal.

Students admitted may not only receive a superior education, but also access to elite colleges and eventually to better employment. That system has also led to an under-representation of Hispanic students, black students, and girls.

As a doctoral student at The Graduate Center of the City University of New York in 2015, and in the years after I received my Ph.D., I have tried to understand how meritocratic the process really is.

First, that requires defining merit. Only New York City defines it as the score on a single test — other cities’ selective high schools use multiple measures, as do top colleges. There are certainly other potential criteria, such as artistic achievement or citizenship.

However, when merit is defined as achievement in school, the question of whether the test is meritocratic is an empirical question that can be answered with data.

To do that, I used SHSAT scores for nearly 28,000 students and school grades for all public school students in the city. (To be clear, the city changed the SHSAT itself somewhat last year; my analysis used scores on the earlier version.)

My analysis makes clear that the SHSAT does measure an ability that contributes to some extent to success in high school. Specifically, a SHSAT score predicts 20 percent of the variability in freshman grade-point average among all public school students who took the exam. Students with extremely high SHSAT scores (greater than 650) generally also had high grades when they reached a specialized school.

However, for the vast majority of students who were admitted with lower SHSAT scores, from 486 to 600, freshman grade point averages ranged widely — from around 50 to 100. That indicates that the SHSAT was a very imprecise predictor of future success for students who scored near the cutoffs.

Course grades earned in the seventh grade, in contrast, predicted 44 percent of the variability in freshman year grades, making it a far better admissions criterion than SHSAT score, at least for students near the score cutoffs.

It’s not surprising that a standardized test does not predict as well as past school performance. The SHSAT represents a two and a half hour sample of a limited range of skills and knowledge. In contrast, middle-school grades reflect a full year of student performance across the full range of academic subjects.

Furthermore, an exam which relies almost exclusively on one method of assessment, multiple choice questions, may fail to measure abilities that are revealed by the variety of assessment methods that go into course grades. Additionally, middle school grades may capture something important that the SHSAT fails to capture: long-term motivation.

Based on his current plan, Mayor de Blasio seems to be pointed in the right direction. His focus on middle school grades and the Discovery Program, which admits students with scores below the cutoff, is well supported by the data.

In the cohort I looked at, five of the eight schools admitted some students with scores below the cutoff. The sample sizes were too small at four of them to make meaningful comparisons with regularly admitted students. But at Brooklyn Technical High School, the performance of the 35 Discovery Program students was equal to that of other students. Freshman year grade point averages for the two groups were essentially identical: 86.6 versus 86.7.

My research leads me to believe that it might be reasonable to admit a certain percentage of the students with extremely high SHSAT scores — over 600, where the exam is a good predictor —and admit the remainder using a combined index of seventh grade GPA and SHSAT scores.

When I used that formula to simulate admissions, diversity increased, somewhat. An additional 40 black students, 209 Hispanic students, and 205 white students would have been admitted, as well as an additional 716 girls. It’s worth pointing out that in my simulation, Asian students would still constitute the largest segment of students (49 percent) and would be admitted in numbers far exceeding their proportion of applicants.

Because middle school grades are better than test scores at predicting high school achievement, their use in the admissions process should not in any way dilute the quality of the admitted class, and could not be seen as discriminating against Asian students.

The success of the Discovery students should allay some of the concerns about the ability of students with SHSAT scores below the cutoffs. There is no guarantee that similar results would be achieved in an expanded Discovery Program. But this finding certainly warrants larger-scale trials.

With consideration of additional criteria, it may be possible to select a group of students who will be more representative of the community the school system serves — and the pool of students who apply — without sacrificing the quality for which New York City’s specialized high schools are so justifiably famous.

Jon Taylor is a research analyst at Hunter College analyzing student success and retention.