The dirty secret of the standardized testing industry is the breathtakingly low quality of the tests themselves. I worked in the educational publishing industry at very high levels for more than twenty years. I have produced materials for all the major textbook publishers and most of the standardized test publishers, and I know from experience that quality control processes in the standardized testing industry have dropped to such low levels that the tests, these days, are typically extraordinarily sloppy and neither reliable nor valid. They typically have not been subjected to anything like the validation and standardization procedures used, in the past, with intelligence tests, the Iowa Test of Basic Skills, and so on. The mathematics tests are marginally better than are the tests in ELA, US History, and Science, but they are not great. The tests in English Language Arts are truly appalling. A few comments about those:
The new state and national standardized tests in ELA are invalid.
First, much of attainment in ELA consists of world knowledge–knowledge of what–the stuff of declarative memories of subject matter. What are fables and parables, in what ways are they similar, and in what ways do they differ? What are the similarities and differences between science fiction and fantasy? What are the parts of a metaphor? How does a metaphor work? What is American Gothic? What are its standard motifs? How is it related to European Romanticism? How does it differ? Who are its practitioners? Who were Henry David Thoreau and Mary Shelley, what major work did each write, and why is that work significant? What is a couplet? terza rima? a sonnet? What is dactylic hexameter? What is deconstruction? What is reader response? the New Criticism? What does it mean to begin in medias res? What is a dialectical organizational scheme? a reductio ad absurdum? an archetype? a Bildungsroman? a correlative conjunction? a kenning? How are Realism and Naturalism related, and how do they differ? Who the heck was Samuel Johnson, and why did he suggest kicking that rock? What does K, in The Trial, know of the charges against him? What do the cast shadows in The Allegory of the Cave represent? Why did the Party, in Orwell’s 1984, introduce Newspeak? And so on. The so-called “standards” being tested cover ALMOST NO declarative knowledge and so miss much of what constitutes attainment in this subject. Imagine a test of biology that left out almost all world knowledge and covered only biology “skills” like–I don’t know–slide-staining ability–and you’ll get what I mean here. This has been a MAJOR problem with all of these summative standardized tests in ELA since their inception. They are almost entirely content free. They don’t assess what students ought to know. Instead, they test, supposedly, a lot of abstract “skills”–the stuff on the Gates/Coleman Common [sic] Core [sic] bullet list, but as we shall see below, they don’t even do that.
Second, much of attainment in ELA involves mastery of procedural knowledge–knowledge of what to do. E.g.: How do you format a Works Cited page? How do you plan the plot of a standard short story? What step-by-step procedure could you follow to do that? How do you create melody in your speaking voice? How do you revise to create sentence variety or to emphasize a particular point? What specific procedures can you carry out to accomplish these things? But when making their list of “skills” kids need to have, the authors of these “standards” didn’t think that concretely, in terms of specific, concrete, step-by-step procedural knowledge. Instead, in imitation of the lowest-common-denominator-group-think state “standards” that preceded theirs, they chose to deal in vague, poorly conceived abstractions. The “standards” being tested define skills so vaguely and so generally that they cannot, as written, be sufficiently operationalized, to be VALIDLY tested. They literally CANNOT be, as in, this is an impossibility on the level of building a perpetual motion machine or squaring the circle. Given, for example, the extraordinarily wide variety of types of narratives (jokes, news stories, oral histories, tall tales, etc.) and the enormous toolkit of procedural knowledge that it takes to be able to produce narratives of various kinds (writing believable dialogue, developing a conflict, characterization via action, characterization via foils, showing not telling, establishing a point of view, using speaker’s tags properly, creating suspense, etc.), there can be no single question or prompt that tests for narrative writing ability IN GENERAL. This is a broad problem with the standardized ELA tests. Typically, they ask one or two multiple-choice questions per “standard.” But what one or two multiple-choice questions could you ask to find out if a student is able, IN GENERAL, to “make inferences from text” (the first of the many literature “standards” at each grade level in the Gates/Coleman bullet list)? Obviously, you can’t. There are three very different kinds of inference–induction, deduction, and abduction–and whole sciences devoted to problems in each, and texts vary so considerably, and types of inferences from texts do as well, that no such testing of GENERAL “inferring from texts” ability via one or two questions is even remotely possible.
Let’s consider another example of this sort of invalidity, just so you can be sure that you clearly understand the issue. At Grades 9 and 10, the Gates/Coleman bullet list of skills for ELA contains 80 standards (including progressive language standards from other grade levels), as well as breakdowns of some of the writing standards into sub-standards. A typical standardized bubble test in ELA will consist of about 30 multiple-choice questions and one or two short writing prompts. So, there are fewer test questions than there are standards, meaning that some of the standards are not tested, and those that are, are tested by 1 question. (When testing companies assign to writers the job of creating questions for these tests, the assignment is typically to create one or two questions for each in a list of standards). Now, here’s a problem: each of the “standards” is a statement of a very broad skill. Here’s an example (I’ve chosen, here, from the Gates/Coleman bullet list one of the few “standards” that isn’t utterly vague and abstract):
Consult general and specialized reference materials (e.g., dictionaries, glossaries, thesauruses), both print and digital, to find the pronunciation of a word or determine or clarify its precise meaning, its part of speech, or its etymology.
Now, imagine that you are a question writer for one of these tests. You get a list of 30 standards (a small portion of the standards for the grade level) and are told to write one question for each standard to test whether the student is GENERALLY PROFICIENT in the standard. Try this. I’m serious; literally, try doing this. Try writing ONE multiple-choice question that will validly test whether a student is proficient in this standard IN GENERAL. The ONE multiple-choice question will have to test the student’s ability IN GENERAL, to find word pronunciations, to determine word meanings, to clarify word meanings, to find the part of speech of words, to find the etymologies of words, and to do this in both digital and print media and in both general and specialized reference materials, including dictionaries, glossaries, and thesauruses. Obviously, no one multiple-choice question would be able to test for ability to do all this IN GENERAL. The question would be an invalid, woefully incomplete and random measure of one instance of one part of the general “skill.” And so it would be for each question for each “standard.” Of course, all this invalidity in the testing for proficiency in each “standard” can’t add up to overall validity, so, the tests do not even validly test for what they purport to test for. It’s as though I claimed to be able to test whether someone has the knowledge of French, of French culture, and of international law and diplomacy to be an excellent U.S. ambassador to France by asking if he or she had ever eaten gougères.
Third, nothing that students do on these exams even remotely resembles what real readers and writers do with real texts in the real world. Ipso facto, the tests cannot be valid tests of actual reading and writing. People read for one or both of two reasons—to find out what an author thinks or knows about a subject and/or to have an interesting, engaging, significant vicarious experience. The tests, and the curricula based on them, don’t serve the instructional purpose of helping students to do either. Imagine, for example, that you wish to respond to this post, but instead of summarizing what I have said and then agreeing or disagreeing with it and explaining why, you are limited to explaining how my use of figurative language (the tests are a miasma) affected the tone and mood of my post. See what I mean? But that’s precisely the kind of thing that the writing prompts on the Common [sic] Core [sic] ELA tests do and the kind of thing that one finds, now, in ELA courseware. This whole testing enterprise has trivialized responding to texts and therefore education in the English language arts generally by pushing out the important stuff and concentrating on a few of the random, minor, incidental aspects of literary works and literary experiences. It’s as though I offered a course in how to sail that consisted ENTIRELY of learning how to polish the brightwork and memorizing the names and functions of parts of the rigging. The modeling of curricula on the all-important tests has replaced normal interaction with texts with such random, freakish, contorted, minor, incidental, scholastic fiddle-faddle. English teachers should long ago have called BS on this.
Fourth, a lot of attainment in ELA is not about explicit learning, at all, but, rather, about acquisition via automatic processes. So, for example, your knowledge (or lack thereof) of explicit models of the grammar of your native tongue (from the primitive folk grammar enshrined in the Gates/Coleman “standards” to a sophisticated contemporary scientific model like that of minimalist program syntax) has almost nothing to do with your internalized grammar of the language or with the acquisition of that grammar, which was acquired unconsciously. For example, you have intuited, have acquired elaborate and specific rules governing the order of precedence of adjectives in English, but unless you learned English as a second language in second language classes, you did not acquire these rules through direct, explicit instruction. And the same is true for ALL BUT AN INSIGNIFICANT AMOUNT of the actual grammar that you employ when you speak, listen, read, and write. But the ELA standardized tests and the “standards” on which they are based were conceived in blissful ignorance of this (and of much else that is now known about language acquisition). Imagine new “standards” for naval operations that warned ship captains not to sail off the edge of the world. Well, the Gates/Coleman bullet list is like that; the language “standards,” in particular, are prescientific.
Fifth, standard standardized test development procedures require that the testing instrument be validated. Such validation requires that the test maker show that results for the test and for particular test items or test item types correlate strongly with other accepted measures of what is being tested. No such validation has been done for any of the new generation of state and national standardized ELA tests. None. And, given the vagueness of the “standards,” none could be. Where is the independent measure of proficiency on Common Core State Standard ELA.11-12.4b against which the items on the state and national measures have been validated? Answer: There is no such measure. None. So, the tests fail to meet a minimal standard for high-stakes standardized assessments–that they have been independently validated.
The test formats are inappropriate.
The state and national tests consist largely of objective-format items (multiple-choice and so-called evidence-based selected response items, or EBSR). On these tests, such item formats are pressed into a kind of service for which they are, generally, not appropriate. They are used to test what in EdSpeak is called “higher-order thinking.” The test questions therefore tend to be tricky and convoluted. The test makers, these days, all insist on all the multiple-choice distracters, or incorrect responses, being “plausible.” The student is to choose the “best” answer from among a list of plausible answers. Well, what does plausible mean? It means “reasonable.” In other words, on these tests, many reasonable answers are, BY DESIGN, wrong answers! So, the test questions are often, or even typically, extraordinarily complex and confusing and tricky–extraordinarily difficult for kids to answer, because the “experts” who designed these tests didn’t understand the most basic stuff about creating assessments, for example, that objective question formats are generally not great for testing so-called “higher-order thinking” and are best reserved for testing straight recall. Given the complexities of interpretation (consider, for example, the arguments over the meanings of Biblical passages), use of these inappropriate formats, coupled with the sloppiness of the test-creation procedures, results in question after question where there is, arguably, no correct answer among the answer choices given or one or more choices that are arguably correct. Often, the question is written so badly that it is not, arguably, answerable given the actual question stem and text provided, as opposed to what the writer of the question thought that he or she was asking. I did an analysis of the sample released questions from a recent FSA ELA practice exam and demonstrated that such was the case for almost all the questions on the exam, so sloppily had it been prepared. But I can’t release that for fear of being sued by the scam artists who peddle these tests to people who aren’t even allowed to see them. Hey, I’ve got some great land in Flor-uh-duh. Take my word for it. Available cheap, really cheap, but not available for inspection.
The tests are diagnostically and instructionally useless.
Many kinds of assessment—diagnostic assessment, formative assessment, performative assessment, some classroom summative assessment—have instructional value. They can be used to inform instruction and/or are themselves instructive. The results of the high-stakes standardized tests are not broken down in any way that is of diagnostic or instructional use. Teachers and students cannot even see the tests to find out what students got wrong on them and why. The results always come too late to be of any use, anyway. So, the tests are of no diagnostic or instructional value. None. None whatsoever.
The arbitrary setting of cut scores invalidates use of the tests for comparison across student populations.
States typically post the percentage of students achieving at various levels, but those levels are set by arbitrary choices of cut scores. I once graphed the cut scores for New York ELA and Mathematics standardized tests over several decades. The cut scores jumped around DRAMATICALLY–imagine gerbils on methamphetamines. In some years, the cut-off for passing in mathematics was only barely above what students would have got if they had chosen their answers RANDOMLY. The only possible conclusion: The state and its co-conspirators, the testing companies, chose their cut scores according to the results that they want to see in a given year. If this were a year in which the state department was making a pitch for some new Magic Elixir in education, the state education authorities would want low scores and set their cut scores high. See, we really need this “Reform” because the scores are so low. If this were a year in which the state department wanted to show that its Magic Elixir was working, they would set their cut scores low. Better stay the course with that Magic Elixir! And the geniuses at the U.S. Department of Education Deformation, formerly the USDE, would accept this con prima facie.
The tests have enormous opportunity costs.
I conservatively estimate that, nationwide, schools are now spending a third of the school year on state standardized tests. That time includes the actual time spent taking the tests, the time spent taking pretests and benchmark tests and other practice tests, the time spent doing test prep materials, the time spent doing exercises and activities in textbooks and online materials that have been modeled on the test questions in order to prepare kids to answer questions of those kinds, and the time spent on reporting, data analysis, data chats, data walls, proctoring, and other test housekeeping. That’s lost instructional time–all of it.
The tests have enormous direct, incurred costs.
Typically, the US spends about two billion per year under direct contracts for state standardized testing. The PARCC contract by itself was worth over a billion dollars to Pear$on in the first three years, and you have to add the cost of SBAC, the tests produced by AIR, and the other state tests to that. But that’s the least item on the cost ledger for standardized testing. No one, to my knowledge, has accurately estimated the cost of the computer upgrades that were (and continue to be) necessary for online testing of every child, but those costs vastly exceed the amount spent on the tests themselves. (In many schools, media centers aren’t available for much of the year for kids to do research or other work on the computers because they are taken over for testing or practice testing.) Then add the costs of test prep materials and staff doing proctoring and data chats and so on. Then add the costs of new curricula that have been dumbed down to be test preppy. Billions and billions and billions. This is money that could be spent on stuff that matters—on instructional materials and supplies, on classroom libraries, on making sure that poor kids have eye exams and warm clothes and notebooks and food in their bellies, on making sure that libraries are open and that schools have nurses on duty to keep kids from dying. How many dead kids is all this testing worth, given that it is, again, invalid as assessment and of no diagnostic or instructional value?
Note to those who work in state government: want to save many millions, annually, in your education budget–money that can be put to constructive use? Then kill the standardized testing.
The tests dramatically distort curricula and pedagogy.
The tests drive how and what people teach and much of what is created by curriculum developers. These distortions are grave. In U.S. curriculum development today, the tail is wagging the dog. To an enormous extent, we’ve basically replaced traditional English curricula with test prep. Where before, a student might open a literature textbook and study a coherent unit on The Elements of the Short Story or on The Transcendentalists, he or she now does random exercises, modeled on the standardized test questions, in which he or she “practices” random “skills” from the Gates/Coleman bullet list on random snippets of text. There’s enormous pressure on schools to do all test prep all the time because school and student and teacher and administrator evaluations depend upon the test results. Every courseware producer in the U.S. now begins every ELA or math project by making a spreadsheet with a list of the “standards” in the first column and the place where the “standard” will be “covered” in the other columns. And since the standards are a random list of vague skills, the courseware becomes random as well. I call this the “Monty Python and Now for Something Completely Different” approach to curriculum development. The era of coherent, well-planned ELA curricula is gone. I won’t go into detail about this, here, but this is an ENORMOUS problem. As a textbook developer for many decades, I watched this devolution occur. Many of the best courseware writers and editors I know have quit in disgust at this. The testing mania has brought about devolution and trivialization of our methods and materials. (My advice: throw out your appalling Common Cored literature and composition textbooks–crap like the Pear$on My Perspectives program–and replace them with ones written twenty years ago. The older ones are FAR more coherent and substantive.)
The tests are abusive and demotivating.
Our prime directive as educators should be to nurture intrinsic motivation in order to create independent, life-long learners. The tests create climates of anxiety and fear. Ask any teacher what it’s like for his or her students come standardized test time, for the elementary school students actually throwing up before or during the test to the high-school students threatening suicide because if they don’t pass this test, they won’t graduate. And, of course, science and common sense teach that extrinsic punishment and reward systems like this testing system are highly DEMOTIVATING for cognitive tasks. See this:
The summative standardized testing system is a backward extrinsic punishment and reward approach to motivation. It reminds me of the line from the alphabet in the Puritan New England Primer, the first textbook published on these shores:
The idle Fool
Is whip’t in school
The tests have shown no positive results; they have not improved outcomes, and they have not reduced achievement gaps.
We have been doing this standards-and-standardized-testing stuff for more than two decades now. Richard Rothstein, the economist and statistician, has shown that turning our nation’s schools into test prep outfits has resulted in very minor increases in overall mathematics outcomes (increases of less than 2 percent on independent tests of mathematical ability) and NO IMPROVEMENT WHATSOEVER in ELA. Simply from the Hawthorne Effect, we should have seen some improvement. Rothstein also showed that even if you accept as valid the results from international comparative tests, if you correct for the socioeconomic level of the students taking those tests, US students are NOT behind those in other advanced, industrialized nations. So, the rationale for the testing madness was false from the start. The issue is not “failing schools” and “failing teachers” but POVERTY. We have a lot of poor kids in the US, and those kids take the tests in higher numbers than elsewhere. Arguably, all the testing we’ve been doing has actually decreased outcomes, which is consistent with what we know about the demotivational effects, for cognitive tasks, of extrinsic punishment and reward systems. Years ago, I watched a seagull repeatedly striking at his own reflection in a plate glass window, until I finally drove him away to keep him from killing himself. Whatever that seagull did, the one in the reflection kept coming back for more. It’s the depth of stupidity to look at a clearly failed approach and to say, “Gee, we should do a lot more of that.” But that’s just what the Gates-funded disrupters of U.S. education–those paid cheerleaders for the Common [sic] Core [sic] and testing and depersonalized education software based on the Core [sic] and the tests are asking us to do. Enough.
In state after state in which the new generation of standardized tests has been given, we have seen enormous failure rates. In the first year, fewer than half the students at New Trier, Adlai Steven, and Hinsdale Central–often touted as the best public schools in Illinois–passed the new PARCC math tests. In New York, in the first year of PARCC, 70% of the students failed the ELA exams and 69% the math exams. In New Jersey, 55% of students in 3-8 failed the new state reading test, and 56% the new math test. The year after, Florida delayed and delayed releasing the scores for its new ELA and math exams. Then they announced that they were going to release only T-scores and percentiles but were still working on setting cut scores for proficiency. LOL. Criterion-based testing, as opposed to norm-referenced testing, is supposed to set absolute standards that students must meet in order to demonstrate proficiency. I suspect that what happened that year in Florida–the reason for the resounding silence from the state–is that the scores were so low that they couldn’t set cut scores at any reasonable level without having everyone fail.
Decades of mandated federal high-stakes testing hasn’t improved outcomes and hasn’t reduced achievement gaps. NAEP results improved a tiny, tiny bit in the first years of the testing because when you teach kids the formats of test questions, their scores will improve slightly. Then, after that, NAEP results went FLAT. No improvement, whatsoever, for a decade and a half. But the testing has had results: it has trivialized ELA curricula and pedagogy and wasted enormous resources that could have been used productively elsewhere.
Why the Scam ELA Tests Persist
So, if the federally mandated ELA tests are as bad as all this, why do they persist? Why do so many journalists, politicians, and bureaucrats still insist upon the annual administration of the snake oil? Worse yet, how is it possible that even some English teachers (some; I describe these as the few, the proud, the incompetent) still take the tests at all seriously? Well, even though the tests don’t actually validly measure attainment of the so-called “standards,” and even though they don’t cover much of what reading and writing in English is about, they are written in English, and kids who can read at all do somewhat better on them than those who can’t. So, the tests act as a crude, but very, very, very crude sorter, well aligned with ZIP Codes and therefore gratifying to upper-middle-class parents with influence. Of course, ANY test written in English that you gave to all kids–one on dirigible driving or growing hydroponic basil, for example–would be just as good as a crude sorter. But given how bad the tests are at sorting, and given that we already have much better–more accurate and more informative–means for doing that, such as grades in English classes, assigned by teachers, based on lots and lots of far more directed and specific measures, and given the terrible costs of the standardized tests and of the stakes attached to them, why persist, at all, in the folly of giving them? By the way, grades given in classes by high-school teachers have always been better predictors of college success than is the SAT in its various incarnations, including the new Common Cored one, which Coleman should have named the SCAT (the Scholastic Common-Cored Achievement Test).
The test makers are not held accountable.
All students taking these tests and all teachers administering them have to sign forms stating that they will not reveal anything about the test items, and the items are no longer released, later, for public scrutiny, and so there is no check whatsoever on the test makers. They can publish any sloppy crap with complete impunity. I would love to see the tests outlawed and a national truth and reconciliation effort put into place to hold the test makers accountable, financially, for the scam they have been perpetrating. I would love to testify about the test items before such a committee. Revelation to the public of what the test makers have been up to would be, should be, a national scandal.
Anyone who supports or participates in this testing is committing child abuse. Have you proctored these tests and seen the kids squirming and crying and throwing up? Have you seen them FURIOUS afterward because of the trickiness of the tests? I have.
Standardized testing is a vampire. It sucks the lifeblood from our schools. Put a stake in it.
NB: I would love to be able to post, here, analyses of the sample release questions from ELA tests by the major companies, but I can’t because I would be sued. However, it’s easy enough to show that most of the questions are so badly written that AS WRITTEN, they don’t have a single correct answer, have more than one arguably correct answer, or are unanswerable.
It’s time to make the testing companies answerable for their rapacious duplicity and for stealing from an entire generation of kids the opportunity for humane education in the English language arts.
For more of my writing about Education “Reform,” go here: https://bobshepherdonline.wordpress.com/category/ed-reform/
I’ve written a lot about Ed Reform over the years, and especially about its effects on U.S. pedagogy and curricula. If you don’t want to wade through all that, good places to start are here: