School Rankings Resemble Lottery

K-12 Testing

New research has found that test scores, particularly year-to-year score gains, are too erratic to use fairly for holding schools accountable. Average school scores and short-term gains or losses typically fluctuate widely due to factors outside a school’s control. Thus, rewards or sanctions based on test scores are likely to be inaccurate.


Proposed revisions to the federal Elementary and Secondary Education Act call for major decisions about schools to be made solely on the basis of annual changes in test scores (see article, p. 1). While that legislation specifies that tests must be “valid and reliable,” researchers Thomas Kane and Douglas Staiger found that year-to-year gains and losses on state tests are in fact quite unreliable if used for decision-making. About half the states now use test scores as the key measure for rewarding or punishing schools.


In two papers, researchers Kane and Staiger examined data, primarily from North Carolina, to determine the precision of school scores. They categorized variation in test scores as due to sampling changes (e.g., a different group of students each year in a tested grade), a particularly severe problem in small schools; “one-time factors” such as a “barking dog” that distracts a group of test-takers; and persistent differences in actual performance among schools. The researchers found that fifty to eighty percent of the year-to-year observed fluctuation in a typical North Carolina school’s average score is due to the first two factors, and not differences in tested achievement.


As a result, the researchers explain school rankings, based largely on score increases, “generally resemble a lottery.” Only one percent of the state’s schools ranked in the top ten percent in math for all six of the years studied (1993-1998). In reading, which is more volatile than math, more than one-third of all schools ranked in the top 10 percent at some point.


In states that use racial subgroup test performance to determine ratings, Kane and Staiger found that racially segregated schools are less likely to suffer the consequences of score variability. This is largely because the number of students in any racial group within an integrated school is likely to be so small as to make scores for the subgroup more volatile than for the school as a whole. In California, more diverse schools were substantially less likely to be rewarded for their test score gains than were more homogenous schools, even though the more diverse schools actually had “greater improvements in overall test scores.” Thus, use of test score gains to reward or punish “can generate perverse incentives for districts to segregate their students.”


Selecting “good” programs that other schools should emulate, a common goal of test-based accountability programs, is also a matter of luck. Since a large percentage of schools will at some point be labeled “best practice” schools if test scores are the determining factor, the result would be an ever expanding menu of “best practices” from schools whose scores often decline the next year or two.


In the hopes of making more accurate judgements, some states have taken to considering multi-year averages rather than just the previous year. However, the authors report, a North Carolina school which desired to predict its current year reading score gains would be better off to simply pick the state’s average score increase rather than to use its own previous four years of score changes. Thus, averaging a few year’s scores in an effort to solve the problem of random fluctuations, which several states now do, appears not to work.


The authors also asked whether score gains are due to such factors as teaching to the test rather than to real improvements in learning. Lacking a direct measure, they examined several characteristics of “student engagement” - absenteeism, time doing homework, and time watching television. Those measures did not improve in schools in which scores rose substantially. The authors note that this lack of improvement “would be consistent with the hypothesis that schools began tailoring their curricula to improve performance on the tests, without generating similar improvements on other measures.”


Recently, the researchers studied North Carolina and Texas in light of the proposed federal testing requirements (see story, p. 1). They conclude that almost all schools in those states will be labeled as failing to make adequate progress. A somewhat similar study by the Congressional Research Service found that the reason more than 95 percent of the schools in Texas, North Carolina and Maryland would fail to make “adequate yearly progress” for two consecutive years was the “variation in school-level scores from year to year, and among different pupil groups in the same year.”


• Volatility in School Test Scores: Implications for Test-Based Accountability Systems (Kane and Staiger, April 2001)
• Improving School Accountability Measures (Kane and Staiger, March 2001);
• A FairTest study on score changes in “good” schools in Massachusetts is available at