Part Six: Internal documents show the makers of the new SAT knew the test was overloaded with wordy math problems – a hurdle that could reinforce race and income disparities. The College Board went ahead with the exam anyway.
Despite warnings, College Board redesigned SAT in way that may hurt neediest students
NEW YORK – In the days after the redesigned SAT college entrance exam was given for the first time in March, some test-takers headed to the popular website reddit to share a frustration.
They had trouble getting through the exam’s new mathematics sections. “I didn’t have nearly enough time to finish,” wrote a commenter who goes by MathM. “Other people I asked had similar impressions.”
The math itself wasn’t the problem, said Vicki Wood, who develops courses for PowerScore, a South Carolina-based test preparation company. The issue was the wordy setups that precede many of the questions.
“The math section is text heavy,” said Wood, a tutor who took the SAT in May. “And I ran out of time.”
The College Board, the maker of the exam, had reason to expect just such an outcome for many test-takers.
When it decided to redesign the SAT, the New York-based not-for-profit sought to build an exam with what it describes as more “real world” applications than past incarnations of the test. Students wouldn’t simply need to be good at algebra, for instance. The new SAT would require them to “solve problems in rich and varied contexts.”
But in evaluating that approach, the College Board’s own research turned up problems that troubled even the exam makers.
About half the test-takers were unable to finish the math sections on a prototype exam given in 2014, internal documents reviewed by Reuters show. The problem was especially pronounced among students that the College Board classified as low scorers on the old SAT.
A difference in completion rates between low scorers and high scorers is to be expected, but the gap on the math sections was much larger than the disparities in the reading and writing sections. The study Reuters reviewed didn’t address the demographics of that performance gap, but poor, black and Latino students have tended to score lower on the SAT than wealthy, white and Asian students.
In light of the results, officials concluded that the math sections should have far fewer long questions, documents show. But the College Board never made that adjustment and instead launched the new SAT with a large proportion of wordy questions, a Reuters analysis of new versions of the test shows.
The redesigned SAT is described in the College Board’s own test specifications as an “appropriate and fair assessment” to promote “equity and opportunity.” But some education and testing specialists say the text-heavy new math sections may be creating greater challenges for kids who perform well in math but poorly in reading, reinforcing race and income disparities.
Among those especially disadvantaged by the number of long word problems, they say, are recent immigrants and American citizens who aren’t native English speakers; international students; and test-takers whose dyslexia or other learning disabilities have gone undiagnosed.
“The math section is text heavy. And I ran out of time.”
“It’s outrageous. Just outrageous,” said Anita Bright, a professor in the Graduate School of Education at Portland State University in Oregon. “The students that are in the most academically vulnerable position when it comes to high-stakes testing are being particularly marginalized,” she said.
College Board CEO David Coleman, the chief architect of the redesign, declined to be interviewed, as did other College Board officials named in this article.
But earlier this year, Coleman talked about the importance of creating a test that was fair to students for whom English is not their native language. In a speech in March about ensuring “equal access” to higher education, Coleman talked about the SAT’s past emphasis on obscure vocabulary words and “what that meant for English-language learners in this country. It would determine who went to college.”
Now, the new exam’s wordy math questions – not the obscure vocabulary – have created a similar and unfair barrier for the very group Coleman identified, some academics say.
“The problem is going to mostly affect English-language learners,” said Jamal Abedi, a University of California Davis professor who specializes in educational assessments. He said such students could perform poorly “not because of their lack of math content knowledge but by the language burden.” Abedi said he was among a group of scholars who recently consulted with the SAT’s archrival, the ACT, about whether English-language learners deserve special accommodations on test day.
English-language learners account for a substantial share of public school students in the United States – 9.3 percent in 2013-2014, or 4.5 million kids, according to the National Center for Education Statistics.
Putting math questions in context makes sense to some testing specialists. Daniel Koretz, a professor of education at the Harvard Graduate School of Education, said some professors would want to know whether students can follow university-level math lectures. “Some students will be disadvantaged, yes, but that’s not necessarily the wrong thing to do,” Koretz said.
But Portland State’s Bright said the College Board went too far. Vocabulary and reading “are tested elsewhere on the exam,” she said. The word problems in the math sections, Bright said, ought to be more succinct.
Admissions officers at many U.S. colleges and universities, especially elite institutions, use SAT scores as a means of weeding out candidates. A low math score may cripple an applicant’s chances of getting into a selective school.
College Board spokeswoman Sandra Riley declined to provide statistics indicating how many students have been able to fully answer the math sections on the redesigned test. A College Board analysis of 10 tests given this spring, she said, found that completion rates for English-language learners “are similar to what we see for all students.”
Riley also said the rate of completion of the math section for the SAT given in March “met our goals and on average is equal to or higher than the completion rate for the math section of the old SAT.”
All things being equal, the completion rate across every section of the exam should be higher than it was for the old SAT, testing specialists say.
That’s because the old exam penalized students for wrong answers, encouraging them to leave questions blank rather than hazard a guess. The new test contains no penalty for guessing wrong, and the College Board encourages students to answer every question. As it says on its website, “On the new SAT, you simply earn points for the questions you answer correctly. So go ahead and give your best answer to every question — there’s no advantage to leaving them blank.”
To understand how the College Board remade the all-important exam, Reuters reviewed hundreds of pages of internal documents, including emails between officials formulating the new SAT; private comments from academics hired to vet potential test questions; items from exams that have already been administered; and hundreds of questions developed for upcoming SATs.
The not-for-profit organization has posted a 210-page document online that explains the approach to remaking its signature product. But the vast majority of the documents Reuters examined have never been made public. They offer an unprecedented look at how a standardized college entrance exam comes together.
Among the documents examined was the College Board timing study done two years before the release of the new test. That study describes, section by section, the percentage of students who were able to finish a prototype of the redesigned exam.
Perhaps most revealing is the confidential feedback of academics who, for a modest stipend, help the College Board review proposed test questions for fairness and accuracy. Some were scathing in their criticism of the new test. They warned early and often that many questions were long and convoluted, and distracted from the skills the math sections should be assessing, documents examined by Reuters show.
In a nearly 5,000-word letter from August 2014, one reviewer told College Board officials that he had “never encountered so many seriously flawed items” in the 20-plus years he had been screening math material for the organization.
That reviewer – a lecturer at the University of Wisconsin-Milwaukee named Dan Lotesto – wrote of the math questions: “Why so many items with vocabulary issues, especially for ELL students?” ELL stands for English-language learners.
Lotesto proved prescient about the situation the College Board now faces. It would be difficult, he wrote, to create questions that fairly assessed students and were “mathematically and linguistically ‘tight’ as well as doable.”
Contacted by Reuters, Lotesto said he stands behind the 2014 letter but declined to comment further. He said he still reviews potential material for the SAT.
Concerns that the College Board has created an SAT with flawed math sections come as the organization struggles on another front: its inability to safeguard the content of the exam.
Last month, federal authorities launched a criminal investigation after Reuters obtained about 400 unpublished questions from the redesigned SAT exam. Some testing specialists called the security lapse one of the most serious in the history of college-admissions testing.
On August 26, federal agents raided the home of Manuel Alfaro, a former top College Board executive who was later dismissed and has now become an outspoken critic of the organization. Alfaro, who helped remake the math sections, was fired by the College Board in early 2015, almost two years after he says he was wooed to the organization by CEO Coleman.
The new SAT was created on a tight deadline.
The exam’s competitor, the Iowa-based ACT, had surpassed the SAT in popularity in the United States in 2012 and was winning state contracts.
Soon after, in October 2012, Coleman took over the College Board. He quickly set a demanding goal for his team: roll out a completely redesigned test in about two years, by March 2015. The challenge was enormous. Developing a single exam’s worth of questions for an existing SAT format can itself take two years.
Internal documents show the growing competitive threat posed by the ACT was one reason for his urgency. In an email to senior College Board executives in August 2013, Coleman wrote that “we must deliver the revised SAT by 2015 … We cannot afford to let ACT continue to shape the state market.”
Although the launch was eventually delayed by a year, the push was on.
The original formula for the new SAT took shape in early 2013. That’s when the College Board created draft “content specifications” for the exam, according to an internal document dated April 30. The document’s author is Cyndie Schmeiser, the organization’s chief of assessment and top deputy to CEO Coleman.
The document shows that test developers planned to put a “heavy emphasis on students’ ability to apply math to solve problems in rich and varied contexts.” That meant there would be a greater focus on word problems in the math sections. The shift was intended to align the new test with the Common Core, a set of learning standards that many states have adopted and that Coleman helped create.
“Both of our committees have expressed concern about the quality of the items that are being brought to the SAT Committee for review. These items have included many that are not mathematically sound...”
More detailed specifications were circulated within the College Board the next month. They called for half of all the math items – about 30 questions – to be “contextualized.” These would be the kind of word problems Schmeiser described in the earlier document. Of those 30 questions, the specifications called for 14 – nearly half – to be “heavy,” having more than 60 words. The remaining 16 would be “medium” – between 40 and 60 words – or “light,” shorter than 40 words.
Problems with those plans became evident in February 2014. That month, the College Board conducted the national study that tested student volunteers at select schools, using a prototype version of the new SAT. The study, marked “confidential and restricted,” has never been made public.
Known as the “rSAT Prototype Form Study,” the paper includes the proportion of students able to reach, within the test’s time constraints, the end of each section of the reconfigured exam. About 5,600 students were given the prototype SAT – which included the reading, writing-and-language, and math sections.
The study indicates that the students were clustered into groups – high, medium and low scorers – based not on their individual scores on the old SAT but rather on the overall scoring average of the school where they took the test. That means that, if the average score of students at a school participating in the study was 1350 or lower on the old SAT, all the test-takers at that school were classified as low scorers. The maximum score on the old test was 2400.
Only 47 percent of all students finished one part of the math section – the 20-question portion during which calculators aren’t allowed. And just 50 percent completed the longer 38-question math section, which allows the use of a calculator. That compared with 83 percent of students who were able to complete the reading section and 80 percent able to finish the writing-and-language section.
According to the study, the inability to finish the math sections disproportionately affected particular students: those from schools where average scores on the old SAT were categorized as “low” or “medium.” On the calculator section of the prototype test, for example, 78 percent of students from high-scoring schools reached the end. Only 41 percent of students from low-scoring schools finished. The gap was smaller on the reading test: 95 percent of students from high-scoring schools finished, versus 73 percent from schools where scores were low.
The study’s results were reviewed by the test development staff in May 2014. By August 2014, the group decided to reduce the number of long-worded math problems to help more students finish the exam, documents show.
THE WEARY ALBATROSS
The College Board’s timing study wasn’t the only sign of trouble with the word problems in the math sections.
External, independent reviewers who evaluate potential SAT questions for accuracy and bias were spotting potential problems as early as September 2014, around the time College Board officials had decided to reconfigure the math sections.
During a meeting of the committee that assessed the math questions, one reviewer tore into a problem that asked students to calculate the time it took a car driver to apply the brakes. It began with a table listing nine speeds, followed by a lengthy description of that table. In all, the item exceeded 200 words.
“It’s outrageous. Just outrageous. The students that are in the most academically vulnerable position when it comes to high-stakes testing are being particularly marginalized.”
The lead-in, the reviewer wrote, “requires A LOT of reading. Do we want that?”
According to one confidential roundup of reviewer comments, the academics also seized on word problems that had little to do with the math skills supposedly being tested. One question involved a bird on a long flight.
“I am distracted by the issue of the albatross sleeping,” wrote one reviewer, who wasn’t named in the summary of comments. “If it doesn’t sleep at all on the journey, this should be stated. Otherwise the question is unanswerable because it asks for the average speed that the bird ‘flew,’ which implies only time in motion and not time resting.”
Academic reviewers were complaining about other kinds of quality problems, too.
In a conference call with College Board staff on March 13, 2014, Deborah Hughes Hallett of the Harvard Kennedy School and Roxy Peck of California Polytechnic State University spoke out.
“Both of our committees have expressed concern about the quality of the items that are being brought to the SAT Committee for review,” they said during the call, according to a transcript. “These items have included many that are not mathematically sound, that have incorrect answers, or that are not accurate or realistic in terms of the given context.”
Hallett did not respond to inquiries for this article. Peck, in written response, called the redesign “a huge undertaking” that “required the authoring of a large number of mathematics items in a short period of time.”
Bill Trapp, a senior director in the design effort, was among those who received the detailed critique written by Dan Lotesto, the reviewer in Wisconsin. Lotesto wasn’t an outlier, Trapp wrote to fellow College Board executives.
“Many of these comments reflect what we have seen from current committee members,” Trapp wrote.
“NEW SAT, NEW PROBLEMS”
In early 2015, about a year before the test would debut, the College Board offered the public a preview of the types of questions the new exam would feature.
The sneak peek prompted a column on The Atlantic magazine’s website by James Murphy, a tutoring manager for the Princeton Review.
“I am distracted by the issue of the albatross sleeping. If it doesn’t sleep at all on the journey, this should be stated. Otherwise the question is unanswerable....”
In “New SAT, New Problems,” Murphy focused on a set of sample math questions. His conclusion was summed up by the article’s secondary headline: “The questions, particularly those in the math sections, could put certain students at a disadvantage.”
Murphy wrote that “the most vulnerable students are those who live in low-income areas or don’t speak English as a first language.” They would struggle, he reasoned, with the math section’s new emphasis on word problems.
The day after Murphy’s piece ran, the College Board set out to look for data to “refute the author’s claims,” according to a Jan. 21, 2015, email reviewed by Reuters. The email was sent to members of the math development team by Laurie Moore, a College Board executive director.
Over the next six hours, College Board officials exchanged emails about the Murphy piece. By then, they appear to have set aside their initial concern that Murphy had misrepresented the new exam. Instead, they worried he may have understated the problem.
At issue: the length of the math word problems. In her email, College Board executive Moore asked about the length of the word problems in the math sections of four new SAT practice tests the College Board planned to release soon to the public.
Alfaro, still employed by the College Board at the time, responded that, of the questions in context, “about 45% are heavy and the rest are light/medium.” Heavy questions exceeded 60 words in length.
Alfaro’s assessment appeared to surprise his boss, Sherral Miller.
“Wow,” Miller wrote in reply. “We had changed that to 10% heavy in the specs given the timing studies. How did we get 45% of them being heavy?”
Miller was referring to the planned revisions to the word-count mix that the College Board had resolved to make the previous summer. But the College Board never followed through on its plan to reconfigure the exam, despite the timing study’s findings and the reviewer complaints. Instead of 10 percent of the math questions being “heavy,” greater than 60 words, almost half remained that long, according to the January 2015 emails.
It’s unclear why the College Board failed to address the issue. The organization wouldn’t make the project’s top leader, Schmeiser, available for an interview.
In a written statement to Reuters, Alfaro – one of the key executives involved in remaking the math sections – said the College Board didn’t have the time or resources to rectify the mix of long, medium and short questions.
The College Board had already delayed the SAT’s launch by a year, and had promised universities and school districts the new exam would debut in March 2016. In early 2015, just 14 months remained until launch. The pool of questions for the new SAT wasn’t large enough and could not “be reconfigured overnight to support new specifications that call for significantly different item distributions,” Alfaro said.
College Board spokeswoman Riley said Alfaro’s “claims about our internal processes for developing the math section of the SAT are inaccurate.” She referred to Alfaro as “a former short-term employee, who after being dismissed from his job demanded millions of dollars in payment and threatened to do damage to the College Board.”
Alfaro was involved in the development of the math sections for 21 months, or more than half the period from the start of the redesign till the new test’s debut.
“HEAVY” WORD COUNTS
To determine whether the new math test was modified to include fewer lengthy word problems, Reuters analyzed the six practice SAT exams posted online by Khan Academy, the official practice-testing partner of the College Board. The College Board says on its website that questions on those tests are “reviewed and approved by the people who develop the SAT,” and Khan calls them “real, full-length SAT practice tests.” Two of the exams, in fact, were administered as actual tests earlier this year before being released as practice material.
Reuters counted the number of words in each of the contextualized math questions on the six exams. At least 45 percent of the math word problems in each exam exceeded the “heavy” threshold of 60 words. That means none of tests met the College Board’s revised target specifying that just 10 percent of the questions be heavy – the “specs” cited by Miller in her January 2015 email. In fact, the math sections of all six exams were consistent with the prototype – the test that, according to the 2014 study, fewer than half of students could finish.
Alfaro, who was fired shortly after the January 2015 email exchange, wrote about the issue on the social network LinkedIn on August 27, a day after the FBI raided his apartment. He alleged the College Board failed to follow its own specifications for word counts in the new math sections.
The formulation of the new exam leaves the College Board with a dilemma, testing specialists say.
The new math sections could be reconfigured to contain fewer long questions, in line with the revised specs. But that would create a new problem. Future scores from revised exams may not be comparable to scores from the earlier versions with the lengthier questions, said Koretz, the Harvard professor.
“There’s no good solution when you come up with a problem like this,” he said.
By Renee Dudley
Photo editing: Barbara Adhiya
Edited by Blake Morrison