Why Teacher Evaluation Shouldn’t Rest on Student Test Scores

Note: For a print formatted PDF of this Fact Sheet click here.

To win federal Race to the Top grants or waivers from No Child Left Behind, most states have adopted teacher and principal evaluation systems based largely on student test scores. Many educators have resisted these unproven policies. Researchers from Massachusetts and Chicago-area universities and more than 1,500 New York state principals signed statements against such practices. Chicago teachers even struck over this issue, among others. Here’s why these systems-- including “value added” (VAM) or “growth” measures -- are not effective or fair.

  1. Basing teacher evaluations on inadequate standardized tests is a recipe for flawed evaluations. Value-added and growth measures are only as good as the exams on which they are based. They are simply a different way to use the same data. Unfortunately, standardized tests are narrow and limited indicators of student learning. They leave out a wide range of important knowledge and skills. Most states assess only the easier to measure parts of math and English curricula (Guisbond, et al., 2012; IES, 2009).
  2. Test-based teacher evaluation methods too often reflect the students teachers have, not how well they teach. Researchers calculate teacher influence on student test scores ranges from as little as 7.5% to 20% (Education Week, 2011). Out-of-school factors are the most important. As a result, test scores are greatly dependent on a student’s class, race, disability status and knowledge of English. Some value-added measures claim to take account of students’ backgrounds through statistical techniques. But the techniques do not adequately adjust for different populations or for the impact of things like grouping and tracking students. So the measures remain inaccurate (Darling-Hammond, et al., 2012; Baker, 2013).
  3. Basing teacher evaluations on VAM or growth harms educational quality. Since educators’ careers would depend on their students’ scores, these measures will intensify incentives to further narrow the curriculum and teach to the test (Guisbond, et al., 2012). More students will lose access to untested subjects, such as history, science, art, music, and physical education. Schools are likely to give less attention to teaching cooperation, communication, creativity and other essential skills. Teachers may try to avoid students who are harder to help show gains (Mass. Working Group, 2012).  
  4. Because of unreliable and erratic results, many teachers are incorrectly labeled “effective” or “ineffective.” On the surface, it makes sense to look at student gains, rather than students’ one-time scores. Measuring progress is important. However, VAM and growth measures are not accurate enough to use for important decisions. One study found that among teachers ranked in the top 20 percent of effectiveness in the first year, fewer than a third were in that top group the next year. Another third moved all the way down to the bottom 40 percent (Newton, et al., 2010). A RAND study found that using different math subtests resulted in large variations in teachers’ ratings, suggesting the measure, not the teacher, was the cause of the differences (Lockwood, et al., 2007). In some states currently using these methods, the results are no more accurate than flipping a coin (Baker, 2012).  
  5. It is difficult if not impossible to isolate the impact of a single individual on a student because teaching is a collaborative and developmental process. A recent research investigation of available evidence concluded that teacher evaluation has not shown to produce positive results in learning outcomes or school improvement. (Murphy, et al., 2013). Teams of teachers and others like social workers, guidance counselors and school nurses all work together to educate students. Individual classroom teachers also build on the efforts of a student’s previous teachers. If a student has a breakthrough in 5th grade, it could be largely due to groundwork built in 3rd and 4th grade. Who should get the credit? And if teacher evaluation is not helpful, why spend scarce resources in this area?
  6. Use of VAM/growth models drives good teachers away from needy students or out of the profession. Excellent teachers are already being judged as inadequate; some are leaving the profession (Winerip, 2011). Teachers working with the most needy students are put at risk because of their students’ background characteristics (Burris, 2012; Mass. Working Group, 2012).  Ironically, students who score highest on state tests also are likely to show little “growth,” endangering their teachers (Pallas, 2012).
  7. Many independent researchers conclude these methods are inadequate and will cause harm. VAM defenders claim the current teacher evaluation system is weak and must be changed.  At a minimum, they say VAM will be better that what now exists. However, the Board on Testing and Assessment of the National Research Council concluded, “VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable” (BOTA, 2009). Bruce Baker (2011) summarized the research evidence: Value-added “just doesn’t work, at least not well enough to even begin considering using it for making high-stakes decisions about teacher tenure, dismissal or compensation… In fact, it will likely make things much worse.” Edward Haertel (2013) concluded, “Teacher VAM scores should emphatically not be… [given] a fixed weight in consequential teacher personnel decisions.” Most states do put high, fixed weights on this data. As Baker (2012) says, the statistical models actually used by states “increasingly appear to be complete junk!” (emphasis in original).   
  8. Two high-profile studies often cited to support VAM provide insufficient justification for its use. A Gates Foundation study argued that teachers who scored high on VAM tend to do well on other measures (Kane, et al., 2010). Another study found that teachers whose students had high value-added scores also had students with better long-term outcomes such as higher incomes (Chetty, et al., 2012). But independent reviews found that neither study provided strong evidence that VAM’s benefits outweigh the potential damage it can cause (Rothstein, 2012). Rothstein (2011) also concluded the Gates report provided more reasons to not use VAM than to use it. Adler (2013) discovered that the Chetty claims are false because “they are contradicted by the findings of the study itself.” Both Gates and Chetty used data from teachers who did not face high-stakes consequences. Pressure to teach to the test to boost scores would likely corrupt the results, undermining the findings of these studies.
  9. To measure teacher and principal quality and effectiveness, use multiple measures based on school and classroom evidence. To the very limited extent limited resources should be spent on teacher evaluation (Murply, et al., 2012), the fair and accurate way to determine a teacher’s quality is with an array of different measures (Mathis, 2012). These would include observations by principals or other skilled educators and reviews of students’ classroom work. States and districts should use techniques that do not rely on student test scores, such the Peer Assistance and Review Model (Darling-Hammond, et al., 2012; SRI, 2011). Evidence from districts such as Montgomery County, Maryland and Toledo, Ohio shows that peer review systems (which focus mainly on professional learning) can be fair and accepted by educators (Winerip, 2011; SRI, 2011). They also can improve the quality of teaching and counsel out teachers who should be in a different profession. Using VAM or growth as part of measuring and blaming will undermine, not improve, educational quality.


Adler, M. 2013. “Findings vs. Interpretation in ‘The Long-Term Impacts of Teachers’ by Chetty et al.” Education Policy Analysis Archives, V. 21, N. 10, February. http://epaa.asu.edu/ojs/article/view/1264/1033

Baker, B. 2013. “The Value Added & Growth Score Train Wreck is Here.” Oct 13.  http://schoolfinance101.wordpress.com/2013/10/16/the-value-added-growth-...

Baker, B. 2011. “Opinion: 7 reasons why teacher evaluations won't work,” NorthJersey.com. http://www.northjersey.com/news/education/evaluation_031311.html?page=al...

Board on Testing and Assessment. 2009. “Letter Report to the U.S. Department of Education on the Race to the Top Fund,” The National Academies. http://www.nap.edu/openbook.php?record_id=12780&page=1

Burris, C. 2012. “New teacher evaluations start to hurt students.” The Answer Sheet. The Washington Post. September 30. http://www.washingtonpost.com/blogs/answer-sheet/post/new-teacher-evalua...

Chetty, R., Friedman, J.N. and Rockoff, J.E. 2011. The Long-Term Impact of Teachers: Teacher Value-Added and Student Outcomes in Adulthood. National Bureau of Economic Research. http://www.nber.org/papers/w17699

Guisbond, L., Neill, M., and Schaeffer, B. 2012. NCLB’s Lost Decade for Educational Progress:
    What Can We Learn from this Policy Failure?

Lockwood, J.R., McCaffrey, D.F., Hamilton, L.S., Stecher, B.M., Le, V. and Martinez, F. 2007.  "The Sensitivity of Value-Added Teacher Effect Estimates to Different Mathematics Achievement Measures," Journal of Educational Measurement.

Mathis, W. 2012. Research-Based Options for Education Policy Making: Teacher Evaluation. National Education Policy Center.

Darling-Hammond, L. et al. 2012. “Evaluating Teacher Evaluation,” Phi Delta Kappan. March. http://www.kappanmagazine.org/content/93/6/8.short

Education Week. 2011. “Teacher Quality.” July 8. http://www.edweek.org/ew/issues/teacher-quality/

Massachusetts Working Group on Teacher Evaluation. 2012. Flawed Massachusetts Teacher Evaluation Proposal Risks Further Damage to Teaching and Learning. http://www.fairtest.org/sites/default/files/fairtest%20report%206611.ind...

Murphy, J., Hallinger, P, and Heck, R.H. 2013. “Leading via Teacher Evaluation: The Case of the Missing Clothes?” Educational Reseracher, V.42, N.6.
Institute of Education Sciences. 2009. Appendix A: State Testing Programs Under NCLB. http://ies.ed.gov/ncee/pubs/2009013/appendix_a.asp

Montgomery County Public Schools. (N.d.) Professional Growth System. http://www.montgomeryschoolsmd.org/departments/development/teams/admin/a...

Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. 2010. “Value-Added Modeling of Teacher Effectiveness: An exploration of stability across models and contexts.” Educational Policy Analysis Archives, 18 (23). http://epaa.asu.edu/ojs/article/view/810

Pallas, A. 2012. “Meet the ‘worst’ 8th grade math teacher in NYC,” The Answer Sheet, The Washington Post. May 16.  http://www.washingtonpost.com/blogs/answer-sheet/post/meet-the-worst-8th...

Rothstein, J. 2011. Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.” Boulder, CO: National Education Policy Center. http://nepc.colorado.edu/thinktank/review-learning-about-teaching.

Rothstein, J. 2012. “Let’s Not Rush Into Value-Added Evaluations,” Room for Debate, The New York Times. http://www.nytimes.com/roomfordebate/2012/01/16/can-a-few-years-data-rev...

SRI International. 2011. “The Search for Teacher Effectiveness: A Study of Exemplary Peer Review Programs.” http://policyweb.sri.com/cep/projects/displayProject.jsp?Nick=PARPeer

Winerip, M. 2011a. “Evaluation New York Teachers, Perhaps the Numbers Do Lie,” The New York Times. March 6. http://www.nytimes.com/2011/03/07/education/07winerip.html?pagewanted=all

Winerip, M. 2011b,. Helping Teachers Help Themselves. The New York Times. June 6.  

TeacherEvaluationFactSheetOct2013.pdf557.84 KB