Test Scores Unreliable Means of Assessing School Quality

A review by the National Center for Fair and
Open Testing (FairTest) of the test score patterns of Massachusetts
schools recently singled out for special recognition on the basis
of gains in MCAS scores highlights a variety of ways in which test
score gains are an unreliable means of assessing school quality. 
A variety of factors can contribute to improvements in scores
without any appreciable improvement in student learning. 
MCAS gains may be a matter of luck, especially when fewer than
60 students are tested. Under these circumstances in particular,
gains are fragile; few schools sustain score increases over time.
Rises may also reflect changes in the population of students tested
more than particular school improvement strategies.  Overall
gains may also mask widening score gaps and students “stuck”
at the low-scoring levels.

Along with other national testing experts,
FairTest advises policy makers and educational practitioners that
using standardized test scores alone as the basis for describing
school performance or improvement is both unreliable and unfair. 
Research and experience underscore that tests score gains provide
an inadequate and sometimes misleading picture of school “improvement.” 

Nevertheless, such misuse of standardized
test scores has become a hallmark of education policy in
Massachusetts.  This week, the business-supported and pro-MCAS
group, MassInsight, names ten schools and districts selected as
Vanguard schools for others to emulate.    The State
Education Department last year released a school performance rating
report, assigning all schools across the state a rating based
on the average of 1999 and 2000 MCAS results compared to 1998

The MassInsight and DOE school recognition
procedures follow the lead of William Edgerly, chairman emeritus
of State Street Corporation, who annually identifies schools whose
principals receive awards of $10,000 based on MCAS score gains. 
Edgerly awards are based on largely meaningless single-year score
gains. FairTest research shows that two-year averages are likewise
inadequate for determining rewards and sanctions. 

FairTest’s review highlights the following:

• In many award-winning schools, the
numbers of students tested are so small that MCAS score gains
and claims for “most improved” status are misleading,
having more to do with “luck” than with authentic improvement.

• MCAS score gains in recognized schools
are not necessarily sustained over time.

• Students “lost” from the
testing pool may create the appearance of improved schools when
in fact what has changed is the composition of students tested.  

• Overall score gains may mask wider
achievement gaps between high-scoring and low-scoring students,
with percentages of students scoring in the lower ranges stuck
or increasing.

Small numbers of students tested
mean MCAS score gains in award-
winning schools may represent little more than good luck.

A recent report delivered to the Brookings
Institute clearly warns of the danger of misidentifiying “good”
and “bad” schools on the basis of short-term test-score
gains (Lynn Olson,  “Study Questions Reliability of
Single-Year Test Score Gains,” Education Week, 23 May 2001:
9).   Commenting on the report’s implication for
school accountability and recognition programs, David Grissmer
of RAND Corporation notes, “The question is, are we picking
out lucky schools or good schools, and unlucky schools or bad
schools? The answer is, we’re picking out lucky and unlucky
schools” (Olson, p. 9).   In small schools, the
presence of even a few students who score especially high or low
can skew average scores dramatically from one year to the next.

Small numbers of students tested in several
MassInsight award-winning schools suggest that MCAS score gains
may reflect little more than variations in the testing pool.  
In these schools, overall school scores may fluctuate widely from
one year to the next depending on the scores of individual students
who, for better or worse, happen to be enrolled in a particular
grade during a particular year.

For many award-winning elementary schools,
the average number of fourth graders taking MCAS is frequently
less than 60 students, the number researchers say can lead to
“considerable volatility” of results.  Under these
conditions, MCAS gains in a number of recognized schools should
be considered unreliable. For example:

• One award-winning school (Sunderland)
tested only 36 students in 1998, 20 students in 1999, and 41
students in 2000.
• In one district where two schools were honored, only 26
students were tested in 1998, 35 in 1999, and 25 in 2000 in one
school (Albert Lewis, Everett); in the second (Devens), only
50 were tested in 1998, 42 in 1999, and 46 in 2000.
• In another district where three schools were honored,
only 34 students were tested in 1998, 34 in 1999, and 32 in 2000
in one school (Altavista in Woburn); in another (Goodyear in
Woburn), 41 students were tested in 1998, 33 in 1999, and 53
in 2000.

Recognized schools do not all
sustain MCAS gains over time. 

Observers from across the country have reported
that test score gains are often short-lived. Commenting on the
“ping-pong” pattern of scores examined over a number
of years, Stanford University professor Linda Darling-Hammond
notes that score gains made over one or two years are often followed
by sharp drops in scores in subsequent years (Darling-Hammond,
The Right to Learn, 1997).  Indeed, as the Brookings report
indicates, the “considerable volatility” in scores is
to be expected when low numbers are tested (T.J. Kane and D.O.
Staiger, School Accountability Measures, 2000).

An examination of the scores of previous award-winning
schools based on MCAS gains from 1998 to 1999 shows mixed results
in 2000.   The mixed results indicate that MCAS gains
over one or two years may not signal lasting improvements. 

Of the five schools winning Edgerly awards in 1999, only one elementary
school (Riverside School, Danvers, testing 53 students in 1998
and 66 students in 1999) continued to show MCAS score gains in
2000 (when 61 students were tested).  At the only recognized
high school (Swampscott), gains in math were sustained, but while
14% of the students failed MCAS English in 1999, 24% failed in
English in 2000, the same percentage failing in 1998.

At other schools recognized for “improvements”
from 1998 to 1999, the percentage of students scoring at higher
levels dropped from 1999 to 2000, while the percentage of students
scoring at lower levels, including “failing,” increased. 

• In one school (Abraham Lincoln, Revere)
26% of the students were deemed advanced or proficient in English/
language arts in 1999; in 2000, only 14% scored at the advanced
or proficient levels.  In 1999, 1% of students scored “failing”
in English; 6% in math.  In 2000, 5% scored “failing”
in English; 6% in math.

• At another school, (Franklin D. Roosevelt,
Boston, testing 55, 51, and 54 students in 1998, 1999, and 2000
respectively) results were not sustained.   In 1999,
20% of fourth graders scored at advanced and proficient levels
in English; in 2000 only 9 per cent reached this level.   
In 1999, 53% of the students scored at advanced or proficient
levels in math; in 2000, only 22 per cent reached those levels. 
In 1999, two percent failed math; in 2000, 30 per cent failed.

• At a third school (Kensington Avenue,
Springfield, where only 48, 42, and 43 fourth graders were tested
in 1998, 1999, 2000), gains made from 1998 to 1999 were not sustained
in 2000.
> After moving 40% of the students into the proficient English
category from 1998 to 1999, students scoring proficient dropped
back to 0% in 2000, lower than the 2% of students scoring proficient
in 1998.  In 2000, all of the students scored in the needs
improvement and failing categories in English. 
> Math scores have also bounced around at Kensington. 
In 1999, 21% of the students scored in the advanced category
in math; in 2000, 0% scored advanced.  In 1999, 30% of the
students scored in the needs improvement and failing category
in math last year; in 2000, 62% of the students fell into that

Students “lost” from the testing
pool, especially in 10th grade, may create an illusion of improved
learning because of a reduction in the number of lower-scoring
students tested. 

In some schools and districts, test score
increases may reflect less an improvement in teaching and learning
than a loss of weaker students from the test-taking pool.  
In particular, high schools may “lose” students from
the test-taking pool during the school year. This “loss”
may contribute to boosting school scores overall or may reduce
the percentage of students scoring in the “failing”
or “needs improvement” category, but such “improvements”
will not necessarily represent authentic learning gains.

Several recognized schools and districts show
students enrolled in October but “missing” from the
numbers taking MCAS the following May.  Over the past three
years, depending on the district, this October-to-May loss in
MassInsight Vanguard schools ranges from 1.6% to 30%.

• In two award-winning schools (Hudson
and Nauset Regional), the percentage of tenth grades “missing”
from the test-taking rolls has increased steadily from 1998 MCAS
administration to 2000 MCAS administration.

> In Hudson, 18.2% of 10th graders were
“lost” between October and May in 1998, climbing to
24.1% in 1999, and 29.9% in 2000.  (For example, although
157 students were enrolled in Hudson’s 10th grade in October
1999, only 110 10th graders took MCAS in 2000.)

> In Nauset, only 1.6% of 10th graders were “lost”
between October and May in 1998; but 6.5% were “missing”
in 1999 and 8.5% were “missing” in 2000.  
(For example, although 248 students were enrolled in Nauset’s
10th grade in October 1999, only 227 took MCAS in 2000.)

• In a third recognized district (Boston),
the percentage of students “lost” during their October-
to-May tenth-grade year increased from 23.5% in 1998 to 30.0%
in 1999.  The rate of “lost” students dropped
back to 21.1% in 2000.  (For example, although 4,340 students
were enrolled in Boston’s 10th grade in October 1999, only
3,432 students took MCAS in May 2000.)  The lower rate
of “lost” students in 2000 may not be so encouraging
as it seems;  with more stu- dents retained in grade in
Boston’s ninth grade, fewer students are entering 10th grade
with their peers.

“Missing students” are a concern in a high stakes testing
context.  High stakes testing can pressure schools to push
out low-scoring students or transfer them to vocational school, or
pressure low-scoring students to leave school or parents to remove
vulnerable students from the testing pool and “protect”
them in private, parochial, or home school.  Whatever the
reason, a change in the population of students tested can have
an impact on school test score patterns and signal a weakening
in schools’ “holding power,” especially for the
school’s or district’s most vulnerable students.

Overall score gains may mask
increasing or unchanging percentages of students scoring in the
lower ranges and wider achievement gaps between high and low scoring

Sometimes, test score patterns can shift more students into top
MCAS levels, creating the illusion of school wide improvement. 
But schools must educate more than the “top” students.
If scores aren’t improving at the other levels, the school
may be “triaging” students — selecting the “most
promising” or “most deserving,” and leaving the
others behind.  The result is that scores become polarized,
with more students scoring at the top and more at the bottom.

In one award-winning Massachusetts district (Nauset), 2000 test
scores indeed showed more students scoring in the advanced category
on MCAS in all subjects than in 1998.  From 1998 to 2000,
the percentages of students in the advanced category went from
9% in 1998 and 1999 to 19% in 2000 in English, and from 12% to
19% to 28% in math.  However, while the percentage of students
scoring at the “top” is growing, the “middle”
is either shrinking or remaining unchanged, and the “bottom”
is increasing.

• Students in the “failing”
category:  Despite the drop in the percent of failing students
from 1999 to 2000, the percentage of Nauset students “failing”
went up from 1998 to 1999, and more 10th graders fell into
the “failing” category in 2000 than in 1998.

> In 1998; 10% failed English, 19% failed in math.
> In 1999, 16% failed English, 27% failed in math.
> In 2000, 11% failed English, 22% failed

• In the combined “failing”
and “needs improvement” categories: 

> In 1998, 39% were either failing or needs
improvement in English; 48 were either failing or needs improvement
in math.
> In 1999, 45% were either failing or needs improvement in
English; 56% were either failing or needs improvement in math.
> In 2000, 40% were either failing or needs improvement in
English; 49% were either failing or needs improvement in math.

Despite increases in the percentage of Nauset
students in the “advanced” category, the percentages
of students in the failing or needs improvement categories remain
virtually unchanged. The best that can be said is that after an
increase in the lower-scoring categories from 1998 to 1999, the
percentage of the school’s students scoring in the failing
or needs improvement category has almost returned to the 1998
levels.  When data is disaggregated by special education
status, Nauset now has a lower percentage of students with disabilities
scoring at the advanced level and a higher percentage of students
with disabilities scoring at the failing or needs improvement
level than in either 1998 or 1999.

School improvement should be for all students. 
The bifurcation of test scores, like the loss of students between
October and May, may signal that the most vulnerable students
do not have equal access to the benefits of school reform, or
even that whatever “reforms” the school is undertaking
may be putting the most vulnerable students at a disadvantage. 



MCAS scores are an inadequate means of assessing
school improvement (or lack of improvement) or of engaging schools
in an authentic accountability process.  FairTest’s
review of the data highlights the ways in which MCAS score gains
may owe more to luck than quality and may fail to reflect genuine
achievement for all students.   Despite research that warns
of the inaccuracies of such exercises, the Massachusetts Department
of Education and pro-MCAS business groups continue to cite schools
and districts as “exemplary” on the slimmest possible
evidence.  In doing so, they do a disservice to the schools
recognized and the fine schools that go without recognition, and
to parents, students, and communities.

The Department of Education or private groups
should not base rewards solely on the basis of test scores or
changes in the scores. School improvement strategies and curriculum
changes should not be made on the basis of score changes that
may be caused by factors other than educational improvements.
Schools that show increasing numbers of “lost” students
or widening gaps between high-scoring and low-scoring students
should not become “models” for other schools.

To avoid the pitfalls of test-based school
evaluations, policy makers must shift their focus from short-term
test-score changes to developing an accountability system that
utilizes multiple measures and includes a range of indicators
to determine school progress.  FairTest supports the authentic
accountability plan proposed by the Coalition for Authentic Reform
in Education (CARE) proposed a comprehensive accountability system
that uses a balance of local and state assessments to describe
school improvement and student progress.  The system consists
of four integrated components: local assessments developed by
local schools based on state curriculum frameworks; a school quality
review process involving periodic intensive onsite visits by teams
of external reviewers; limited standardized testing in literacy
and numeracy; and, annual reporting by schools to their communities.