Chapter 8
From Testing Malaise and School Accountability to Neo-Vygotskian Approaches (1981-2002).
Paul F. Ballantyne
Starting in 1981 a considerable professional malaise regarding the content and procedural aspects of ability testing was in evidence. [172] The content aspects (such as how to define human intelligence, and how to guide the design of better ability tests) had long been under debate and were now regarded by some as unresolvable. The procedural aspects, however (such as how to best apply or interpret the results of already existing standardized tests in accordance with recent legal and professional guidelines), where under renewed disciplinary, legal, and media scrutiny. This testing malaise became particularly acute with regard to higher-profile issues such as: (1) addressing the professional implications of ongoing class and racial bias of existing tests, and (2) obtaining expert opinion on the newer, largely extradisciplinary, debate over the fairness of existing higher educational admissions tests.
Between 1981-1983 it appeared the use of standardized testing (at least in the arena of higher education) would be severely curtailed due to the efforts of testing critics who had already exposed ETS tests (like the SAT and GRE) as a specialized form of fraud (Nairn, 1980). Had the expected downsizing of test use by universities actually occurred (voluntarily or by way of legal action), it would have necessitated a brief uncomfortable period of disciplinary adjustment and theoretical housekeeping in psychology (Mayr, 1991). This adjustment would then have been written up in the history books as a logical outcome of hard won battles for social justice during the previous era of equality (1964-1980) -in which all of the necessary federal protections, funding arrangements, and legal precedents had been set down to promote both fair treatment in the workplace and equality of opportunity to education. [173]
What was supposed to be a short period of apprehension, however, turned out to be a protracted period of increasingly general discomfort. The initial malaise over testing, that is, did not dissipate but grew steadily worse as the federal government (under Reagan and Bush) ushered in the so-called era of public school accountability (1981-2002). Despite both the veracity of past testing critiques and costly Equality era litigation (which stunted the growth of vocational and educational testing during the mid-late 1970s) an unanticipated expansion of standardized tests in public schools went forward unabated by debate over the ethics of applying a free market model of competition to the nation's schools. ETS testing for higher education was also thereby granted a new lease on life.
The predominant mid-1980s disciplinary reaction to this politically mandated Accountability era was what Paul Kline's Psychology Exposed (1988) has called tactical eclecticism. Comments on the significance of ongoing "race differences" in test performance or on the origins of human intellect itself, and especially on the ethics of expending testing in public education became intentionally tentative, subdued, ambiguous, and noncommittal. The logical and disciplinary contradictions between the content aspects of testing history and the ongoing procedural-administrative use of tests (to assess students, schools, and teachers) was disturbing to be sure, but what to do about it in the wider politically correct climate of accountability was treated as an entirely different matter. [174]
While grappling with these disciplinary dichotomies, Linn (1986) concluded:
"Although testing has served...important educational functions, we are still a long way from reaching Scarr's (1981) goal of ensuring that testing is 'always...used in the interests of the children tested'...This goal implies the simultaneous pursuit of excellence and equity. Achieving this goal will require a better scientific understanding of [human intellectual development] and the construct validity of our measures of those processes" (p. 1159).
Another important analytical bridge between the two sets of testing discourses was provided in Thorndike & Lohman's A Century of Ability Testing (1990). As is usual with such insider histories, that book (for the most part) highlighted the procedural aims and accomplishments of past testing traditions. [175] Yet Thorndike & Lohman also ended their celebratory account by recognizing the potential of Neo-Vygotskian approaches for expanding the content and procedural aspects of ability testing.
While such attempts at inclusion from within the testing subdiscipline were the exception up to that time, by the late-1990s legal and legislative events again intervened to create a crisis which impacted upon both aspects of testing. That is, when various states began to outlaw Affirmative Action (the primary institutional means of ensuring ethnic diversity on college campuses) -originally adopted as a corrective device for the bias of standardized admissions testing- the existing testing malaise become further exacerbated and generalized. As Richards (1997) put it, the sustained disciplinary cult of impotence had finally caught up with us.
With a few notable exceptions (e.g., the interactionist accounts in Sternberg & Detterman, 1986; and various under-recognized Neo-Vygotskian approaches), the expected serious and communal act of theoretical housekeeping (regarding the content and procedural aspects of testing) was tactfully deferred until after the Accountability era played itself out. [176] Having now arrived at that historical juncture, we must ask: Were the sporadic revised interactionist accounts of human intellect (in 1980-90s book-length anthologies) and new consensus on testing procedures (as outlined in special issues of American Psychologist) sufficient to address the current disciplinary crisis of relevance? The answer I think is no.
The saving grace is that alternative solutions have been put forward. The best of these have not only criticized the interactionist attempt at historical synthesis as unconvincing and inadequate to the requirements of a truly democratic society but have also put forward prescriptions for improving the state-of-the-art of testing itself. The classical Vygotskian approach to human intellect, in particular, is utilized here to indicate where the next wave of assessment techniques is likely to come from and to what theoretical or practical aims they might be addressed.
In contradistinction to insider aims of testing for testing sake, or of limiting discussions to administrative utility (rather than content), these aims will surely include: (1) outlining a truly developmental approach to intellectual assessment; (2) establishing a lawful approach to vocational selection or training policies; (3) setting up a Constitutional approach to ethnic diversity in higher education; and (4) providing defensible means of curricular assessment in public education. Interactionism and its long-standing system of government sanctioned psychometric testing procedures has failed us on all four of these fronts.
Chapter Overview
Section one details
how the proposal and implementation of the so-called era of public school accountability
was driven by the free market corporate-oriented ideology of the federal government
under presidents Reagan and Bush (1981-1991) and not by the historical facts
of psychometric testing outlined in prior chapters. The impact of The Nation
at Risk (1983) and the Draconian 1988 amendments to the Elementary and
High School Education Act (1965) are both outlined. Other state and
federal developments including: (1) funding for Charter school programs during
the
Section two highlights more subdisciplinary and disciplinary concerns. The strengths and weaknesses of revised (a.k.a. dynamic) interactionism as a description of human intellect are outlined. It is pointed out that under revised interactionism, so-called expert opinion (as evidenced in the content of 1980s-90s anthologies) had hardly changed from the traditional (yet highly problematic) interactionist position of forty years earlier. In other words, the genes plus environment fallacy was still held and was reflected in both: (1) individualist (rather than social or societal) portrayals of human intellect; and (2) additive (rather than transformative) empirical testing procedures being utilized. The approaches of Zigler (1986), Snyderman & Rothman (1988); Flynn (1984-94) and Gould (1996) are all mentioned in this regard.
Finally, in section three, the utility of Vygotsky & Luria's cultural-historical account of mentality is highlighted not only as a means to (1) better understand the origins of the "Flynn effect" of rising average test scores; but also as (2) an indication that we must explicitly adopt a here-to-fore under-utilized transformative (Neo-Vygotskian) account of human mentality if we are ever to improve upon contemporary standards of ability testing.
Section One:
Cultural and Testing Malaise during the 1980s & 1990s
This section covers a period in which: (1) the public schools were used as a political scapegoat of an ailing American society; and (2) in which the progressive spirit of the Elementary and High School Act (1965) was compromised along the regressive lines of conservative corporate-oriented ideology in order to mandate an unjust and expensive era of high-stakes testing. Given the so-called era of accountability was dictated by political ideology and not the proven empirical worth of existing standardized testing procedures, we should know something about the federal administrations that brought it into being before outlining the means by which it was implemented.
From Carter to Reagan: malaise to the self-deceit of easy illusion
As the 1980 presidential
election approached, the country was in a state of profound gloom. [177] This national malaise played into the hands of
the right of center Republican Ronald Reagan who campaigned against both Jimmy
Carter and against the Democrats as the party of failed solutions (Neustadt, 1990). Reagan's skill at confidently proposing
simple minded solutions to complex societal and economic problems was first
learned during a long career in the entertainment industry and as a public relations
officer for General Electric. This glowing public persona had served him
well during two terms as the Governor of California and (in 1980) it was just
what a sizable majority of the voting public wanted in their next president.
To this group of voters, the substance of Reagan's campaign (aside from his
promise of no new taxes) didn't matter. Given the choice between ineffectual
leadership in
Reagan's conservative
convictions, however, and his so-called grassroots citizen politics were particularly
appealing to the religious right. Reagan, who was both divorced and only
nominally religious, garnered considerable support among a movement of well
organized fundamentalist Christians who believed the nation to be in a state
of moral degeneration. [178] In August, 1980, Reagan appeared in
Reagan promised to restore American government to its pre-depression era of simplicity (i.e., when individualism, self-reliance, and a free market economy were the American way of life). [179] The two preceding decades, in particular, Reagan argued, were a period of egalitarian excess and of excessive cuts to national defense industries (Amaker, 1988; Johnson, 1991). According to the gospel of supply-side (a.k.a. trickle-down) economics, tax cuts and government downsizing would promote increased investment which would in turn stimulate economic growth and higher tax revenues (Canto, et. al. 1983). An anti-tax message and a plan to roll back welfare was at the core of his successful 1980 campaign. [180]
Reagan on education
In the area of education Reagan was an opponent of the so-called public school monopoly. On the occasion of the release of his policy statement on education A Nation At Risk (1983) Reagan said: "Our educational system is in the grip of a crisis caused by low standards, lack of purpose, and a failure to strive for excellence. Our agenda is to restore quality to education by increasing competition and by strengthening parental choice and local control" (April, 26, 1983).
This so-called Excellence in Education movement stressed the use of high-stakes testing in order to assess the performance of schools along the corporate model (Cuban, 1993). Much like a failing business, the poor performing schools would be shut down and reorganized. The particulars on how the Accountability era was implemented will be covered shortly but it should be pointed out up front that the rhetoric of excellence and the succeeding reality of testing in the schools were two different things. That is, during a period of wider fiscal belt-tightening and social program cutbacks, State and local tax revenue fell dramatically and federal help for education was not forthcoming. School building maintenance, curriculum modernization, and serious funding for addressing the rise of heavy drug use (and gang violence) in schools were all neglected (Wilson, 1985; Fukuyama, 1999).
Proposal and implementation of high-stakes testing (1983-2002)
Traditionally,
American public schools had educated citizens to live in a democracy.
They were the melting pot in which immigrants embraced the American dream and
they were at the forefront of the struggle for equality (Spring, 1994).
But Reagan now blamed Civil Rights enforcement for hurting basic education over
the prior twenty years: "The schools were charged by the Federal Courts
with leading in the correcting of long-standing injustices in our society: Racial
segregation, sex discrimination, lack of opportunity for the handicapped.
Perhaps there was simply too much to do in too little time..." (
Reagan's argument at this time was that American schools do not need vast new sums of money as much as they need a few fundamental reforms. In accordance with this rationale, the federal government scaled back its funding role in education shifting the burden of reform to state and local authorities. Thus, in the mid-1980s public schools were asked to compete in a business driven world where their corporate bottom line was performance on standardized tests. This policy shift was proposed and implemented in three steps: (1) publication of The Nation at Risk (1983) which announced an educational crisis; (2) Draconian 1988 amendments to the Elementary and High School Education Act of 1965 (which gutted the spirit of this Equality era legislation); and (3) implementation of standardized ETS tests in the schools. Reagan was long gone from the political scene by the time the pedagogical results of this testing boom were finally available.
A Nation at Risk
Near the end of Reagan's first term, a landmark government report called A Nation at Risk: The imperative for education reform (1983) was widely distributed and covered extensively in the press. It was authored by the specially formed National Commission on Excellence in Education and funded under the auspices of Education Secretary Terrence Bell. This report was a marvel of alarmist propaganda and mobilized both military and corporate analogies throughout:
"Our Nation is at risk....while we can take justifiable pride in what our schools and colleges have historically accomplished...the educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation.... If an unfriendly foreign power had attempted to impose on America the mediocre educational performance that exists today, we might well have viewed it as an act of war.... We have, in effect, been committing an act of unthinking, unilateral educational disarmament... Knowledge, learning, information, and skilled intelligence are the new raw materials of international commerce and are today spreading throughout the world as vigorously as miracle drugs, synthetic fertilizers, and blue jeans did earlier.... Learning is the indispensable investment required for success in the information age we are entering..." (COEE, 1983).
By referring to a supposed "unbroken decline" in average 1963-1980 era SAT scores (Wirtz, & Howe, 1977), Risk claimed that the Equality era schools (by adopting "minimum requirements" (see Pipho, 1978) and by making gratuitous course choice options available to students (instead of sticking to the traditional "standards" and "basic curricula" of the past) had wasted away the so-called "competitive edge" of American schools achieved during the prior Sputnik era. [181] This appeal to a long decline in SAT scores was somewhat disingenuous because it flew in the face of the 1980 response of NEA to the College Board's original (1977) claim about the span and reasons for the decline. It was also eventually pointed out that the de facto SAT score decline was concentrated primarily in the 1970s, a time of tremendous cultural turmoil in America and had begun to subside near the end of that decade (Stedman & Kaestler, 1991; Fukuyama, 1999).
When one considers both the political contingencies behind the formation of the COEE and timing of the report, it is not surprising that the real culprits of educational erosion (including a crumbling school infrastructure and increased drug use among students) would not be addressed by the new federal mandates. For instance, while the issue of poor textbooks was mentioned in Risk it was paired with a claim that the fault for this lay in the liberal dumbing-down of the curriculum (rather than the lack of funding for modernizing such teaching resources). With regard to the timing of the report, the committee had clearly been given an 18 month period (from August 26, 1981 to April 26, 1983) to produce a report that would be in the households of America just in time for Reagan's upcoming 1984 re-election bid.
Despite proclamations in the opening lines of the report to the contrary, the schools would indeed now be used a political "scapegoat" upon which to blame the ills of the American economy. The advocacy in Risk of a back-to-basics curriculum emphasis was also very much in keeping with the Reagan administration's nostalgic and overly simplistic approach to genuinely complex educational issues. Finally, the single mention in the report of the "twin goals of equity and high-quality schooling" must be viewed as merely rhetorical because it was a wholly, anti-equality, regressive, reform mandate that was being set into motion. While this report brought the issue of education to the forefront of political debate in the 1984 election, it also gave political impetus for the adoption of standardized testing technologies even as those tests were becoming known to be of dubious utility.
According to the logic of the report, higher test scores translated into smarter workers, a growing economy, and superior international competitiveness in the global economy. Thus Risk recommended that: "Standardized tests of achievement...should be administered at major transition points from one level of schooling to another and particularly from high school to college or work....and administered as part of a nationwide (but not federal) system of State and local standardized tests" (COEE, 1983).
Both educators and elected State officials would be held responsible for accomplishing this federal reform agenda. This policy statement became a veritable New Testament for the modern accountability movement both during and after Reagan's second term (see fig. 56).

Figure 56 Reagan meets with the COEE in 1984 after his re-election. Members of the committee included: 4 University or College Presidents; 1 School Board President; 1 Bell Telephone Chairman; 1 Commissioner of Education; 3 High School Principals; 1 President of the Foundation for Teaching Economics; 1 President of the National School Boards Association; 3 University Professors (Physics, Mathematics, Languages); 1 private consultant; 1 member of Virginia State Board of Education; 1 Former Governor; and 1 Superintendent of Schools for the State of Minnesota (photo from Tyack et al., 2001).
Title 1 Amendments (1988):
The era of high-stakes testing was taken one step farther in 1988 when the accountability section of the Elementary and Secondary Education Act (1965), called Title 1, was re-written to impel States to adopt standardized tests as a means of measuring the results of school reform (Madaus, 1994). This was a fundamental corruption of the progressive intent of the Act and of Title 1, so a brief elaboration is necessary.
The 1965 Act was
originally used as a financial 'carrot' to the 'stick' of the Civil Rights Act
(1964) which had impelled integration of the public school system (Heubert,
1999). Title 1, in particular, was written up to help financially strapped
schools (especially those with high concentrations of poor and minority children)
to show their need by way of norm-referenced monitoring (such as average socio-economic
status of parents in their school district; percentages of visible minority
students; or
In 1988, the loose Title 1 accountability provisions were sharpened to require standardized testing. Local schools were now required to develop desired outcomes for Title 1 funds with results to be measured by standardized test scores. Schools failing to meet test score objectives (set by the state) were required to submit a program improvement plan to federal authorities. Thus, the progressive funding incentive of Title 1 had become a Draconian barrier to funding access particularly for those needy and failing schools for which it had originally been written.
Implementation and costs
Standardized testing in schools has assumed an even greater dominance since the time of the Nader report (1980). The adoption of educational competency tests, in particular, however, was partly a continuation of a pre-1981 trend. From 1976 to 1980, the number of states requiring some form of minimum competency testing (MCT) increased from 8 to 38 (Learner, 1981). This rise was merely the first sign of what was to come later (Linn, 1986). By the end of 1981, just about half the states had adopted mandatory public education testing programs, and by 1998 all but two did (see Sacks, 1999; McGinn, 1999).
In the new Accountability era, each local school and each district would have to prove to the taxpayer that its schools deserved the state and federal funding it was receiving. The implications of implementing higher standards was first felt at the local school level and then at the district and state levels as a formalized state by state test comparison system was established. In turn, when the results of these comparisons became known in each school district and in each state, calls for further school reform and for further school choice were forthcoming.
In the years immediately following A Nation at Risk (1983), 35 states adopted highly political school discipline and austerity mandates. Tougher standards on the local school and district levels initially translated into get tough grade policies requiring minimum GPA requirements for participation in so-called extracurricular activities such as music and sports programs. This emphasis on the narrow band of academic education was a highly characteristic public relations bracing-tactic on the part of public school administrations in preparation for the eventual coming of age of standardized school by school testing.
After 1988, test scores from nearly every school and district in the country were collected and compared annually under the new Title 1 Evaluation and Reporting System (TIERS), which permits comparison of results by State, region (urban, suburban, rural), type of school (private, public, charter), and level (elementary, middle school, high school). In 1997, $200 million was spent annually on these public school testing programs alone (Sack, 1999, p. 12) and by 2001, this monetary cost had risen to $500 million.
Aside from the monetary testing costs, there have also been curricular, social, and individual costs. For one thing, test preparation and test taking began monopolizing about one month of each school year (Frederiksen, 1984). With test scores being published in the newspapers and school budgets contingent on those scores, property values also began to be linked to the results of testing. Further, the individual costs for those who failed to meet the new tests standards for graduation from one grade to the next were considerable. While individual school averages would be bumped up by such retentions (due to practice effect and extra test coaching for retained students) it is notable that Sacks (1999) suggests that many retentions are still made on the basis of district or state test scores (typically in the third, eighth, and 12 year) rather than actual attained school course work grades. Such practices have led to numerous court challenges (see Sacks, 1999).
Early challenges to the 1979 Florida State (MCT) test determining the award of high school diplomas (e.g., Debra P. v. Turlington, 1981; 1983) provide an example. The 1981 ruling required that a massive study of the "instructional validity" of the State Student Assessment Test-Part II (SSAT-II) be conducted. The resulting evidence was accepted in 1983 by the Court of Appeals as a fair test of what was actually being taught in Florida classrooms (Linn, 1986). The score cut-offs for the Florida test were also under a highly politicized revision during this period (resulting in a decrease in test "failures" on their MCT from 6% in 1979 down to 1.4% in the graduating class of 1983). The ethics of such educational testing was also under some consideration by Messick (1980; 1981, 1984). This pattern of legal, empirical, political, and ethical consideration of MCT test implementation would soon be repeated in other states too (see Sacks, 1999).
Despite these procedural advances in psychometric test validity assessment and the related politically guided damage control criterion adjustments, both the narrowing effect on the curriculum (in educational districts which teach to the test) and the negative psychological impact of grade retentions themselves, came to be increasingly criticized in the media by both educational theorists and by student activist themselves (see Linn, 1994; Kreitzer & Madaus, 1995; McGinn, 1999).
Ongoing Politics of School reform
During the mid-late 1990s there was further pressure put on the public school system by way of calls for expanded private school voucher systems (Doerr et al., 1996; Dwyer, 2002); from highly publicized so-called competition from charter schools (Nathan, 1996; Finn et al., 2000); and from calls for public support of religiously based home schooling. The main consideration here is whether these new school choice options actually posed a serious threat to the public school system. The short of the story is that they were not a serious threat. Instead, these choice options tended to be adopted on a limited term basis and applied topically in areas where (or toward student populations for which) the public school system had already failed. [182]
By the 2000-2001 school year, for instance, 90 percent of school-age children still attended traditional (although now highly test oriented) public school institutions. Students using publicly funded vouchers (in Milwaukee, Cleveland, and Florida) constituted only .03% percent of the national school-age population and 2.5 % percent were schooled at home. Similarly, public schools numbered over 90,000 and charter schools numbered 2100, with only 173 of these being run by for-profit companies (Tyack & Anderson, 2001).
In the 2000 election, both George W. Bush (Republican) and Al Gore (Democrat) advocated a continuance of the standardized testing movement in public schools. Most notably however, the political minefield of higher education admissions policy was downplayed by both candidates because it was hard to gage how voters felt on affirmative action.
Upon winning the election, Bush then announced on C-SPAN that: "Educational excellence for all is a national issue and of this moment is a presidential priority... Children must be tested every year in reading and math... Not just in the third grade and the eighth grade [as was done under the Clinton administration], but in the third, fourth, fifth, sixth, seventh, and eighth grade..." (January, 23, 2001).
By January of 2002, Bush had signed into law the Elementary and Secondary Education Act (or as he prefers to call it the "Leave No Child Behind Act" of 2002). The act, while increasing federal funding to education by some 40 percent, also mandates the testing of every school child in America (in reading and math) from the third to the eighth grades. Since that time, the New York Times Magazine (April 7) has run a front page cover story on the growing "Class War Over School Testing" which, among other things, indicates that a middle-class backlash against testing may soon be underway on the grounds that generalized school testing is actually pulling down the educational standards of better funded school districts which will now be forced to teach to the test (Traub, 2002).
While the recent increase in federal funding is welcomed by all, the debate over school testing itself will surely continue as the breadth of its application expands yet again. Educators and psychologists now have an unavoidable obligation to provide the public with informed opinions regarding the past results and likely impact of continued testing on the quality of public education.
Analyzing the Results and Prescriptions of the Testing Boom
The assumptions and prescriptions of A Nation at Risk (1983) have had free hand in the ensuing years. Has this unexpected era of testing boom actually raised the pre-Risk level of academic achievement in public schools while ensuring equality of opportunity as promised? The results have clearly been mixed and any analysis of these results requires bearing in mind the underlying issue of the mission of public education (and public higher education in particular). To this end, after looking at Sack's (1999) critical analysis of the results of Risk, the recent tenuous status of SAT, GRE, and affirmative action based admissions policies for higher education will be outlined.
It will be argued that the past costs of testing (described above) have by far outweighed the meager observed benefits (described below). Further, it is noted that the assumptions of the post-Risk school testing initiatives and those of the longer-standing (but currently ailing) higher educational admittance tests are virtually identical. This fact is used as an indication that both of these forms of testing will eventually peter out over the next decade or so. That is, even under the new conditions of improved federal funding, it may be that the considerable state-funded efforts currently going into the blanket application of standardized testing initiatives would be better spent on ensuring the equitable allocation of the kinds of basic (infrastructure, curricular, teaching ratio, library, and technological) resources that have historically been shown to be required for actually promoting quality public education.
Did testing produce improved performance?
As noted at the
outset of this chapter, groups that had historically lagged behind in access
to quality public education had already been afforded specific legislative attention
during the Equality era of public schools (1964-1980). So, by 1983 the
public school system was clearly doing a better job in this respect than any
time in the past (Heubert, 1999). But in that
year, the past emphasis on equality of access was swept away in favor of concerns
over the lagging economic competitiveness of
Peter Sacks in Standardized Minds (1999), has critiqued the assumptions and actual pedagogical outcome of the Risk agenda. First of all, if there were deep problems with American educational attainment and skills in the early 1980s, one might expect that by 1994 (during a period of economic boom), the reforms would have brought about higher levels of educational attainment. But the numbers do not bare this expectation out.
First of all, according to the National Education Goals Panel: In March 1979, 85.6% of Americans (age 25-29) had completed high school. By the mid-1990s, this number had risen only 1% (see NAEP, 1996). Similarly for higher education, in 1979, 3 in 10 Americans (age 25-29) had obtained a bachelor's degree and this percentage had not changed by the mid-1990s (Sacks, 1999). [183] As Sacks points out, therefore, contrary to the logic of the American school bashers, "the U.S. economy [was] hardly at the brink of ruination because of a dysfunctional education system" (p. 86) and in any case the schools certainly were not subsequently credited with the improved economy of the mid-1990s.
Secondly, given the costly imposition of accountability testing programs under Title 1 amendments, one would expect to see an improvement in measured performance if the implied equation in Risk of smarter students and measurable testing results is to hold up. According to the periodically gathered National Assessment of Education Progress (NAEP, 1996) data, however, such an equation doesn't hold up. [184] Math proficiency and science achievement comparisons from 1973, 1982 and 1992 remain steady suggesting that: (1) prior to Risk, there was not a "crisis of mediocrity;" and (2) that the imposition of expensive testing programs did not improve overall public school academic performance in any case (Sacks, p. 84).
Further, NAEP data also indicates that up to 1995, the four lone states which had not yet adopted a high-stakes accountability testing program still met or surpassed the national 1994-95 average requirements for math and science achievement. Hence, the politically charged accountability rhetoric in Risk may have been a complete red herring in the first place and adopting high-stakes tests alone do not necessarily lead to higher overall educational quality for the nation (Sacks, 1999; pp. 88-93; Lemann; 1999a).
At the state level, one valuable lesson learned over the past few years is that the use of off-the-shelf standardized tests (e.g., the Stanford 9; Iowa Test of Basic Skills) as a means of assessing, accrediting, and monetarily rewarding public schools has not produced the desired gains. The main advantage of these commercially available tests is that they are cheap to administer and score. Harcourt Brace' Stanford 9, for instance, costs only $6 per student to administer. The disadvantage is that they: (1) were originally constructed (and have been successively revised) so as to produce results which fall along an ideologically loaded and discriminatory "normal" statistical curve (with non-discriminating items being routinely thrown out); and (2) by virtue of the fact that they are intended for a national marketplace, they do not necessarily reflect the content of the courses taught in the state schools.
This situation may change as high profile states such as California with the nation's largest public school system (and which tied Louisiana for last on the 1995 NAEP) switch over to state specific (ostensively instructionally valid) tests but only time will tell if the 20 million per year (per state) price tag for doing so will produce the desired ends (see Merrow, 2002). I suspect, however, that (despite the current administration's support of a further expansion in testing) there will eventually be a political backlash against it on the grounds that the costs outweigh the observed gains.
A clear indication of the likelihood of a backlash against public school testing is to be found in the renewed controversy surrounding the issue of standardized testing for higher educational access. Indeed the ongoing Accountability era testing boom in public schools has modeled itself on a system of higher educational testing that was on the verge of failure prior to being given a short-term lease on life by way of Risk. The politically charged cycle of events there will likely be repeated in the arena of public school accountability so some degree of elaboration is necessary.
Tenuous status of the SAT; GRE; and Affirmative Action
The case for abolishing the SAT and GRE tests as techniques for higher educational admittance (e.g., Crouse, & Trusheim, 1988) is finally starting to become well known. James Conant (one of the architects of the modern American educational system) had considered the adoption of the SAT and GRE for college admissions to be a great equalizer. He viewed the tests as an objective uncontaminated measure of merit. That is, as a fair and reliable measure of ability to succeed in higher education (Lemann, 1999b). They are not (e.g., see Steele, & Aronson, 1995; Steele,1997; Spencer et al., 1999 on stereotype threat in intellectual test performance).
Conant also believed
that standardized tests would function to cancel out the economic advantages
that parents traditionally pass on to their children by sending them to better
schools (which had been favored by pre-W.W.II higher institutions of learning).
They don't. Instead, incredible energies and money are now annually put
into preparing well off students for these tests so that the haves in
Ninety-seven percent of students writing the SAT now use some form of test preparation. But hidden within that universe of testing preparation is a world of socio-economic difference. Those with sufficient economic means pay 500 dollars an hour for private tutors, those of moderate means pay for group prep courses or purchase CD ROMs (from Princeton Review or Kaplan), and those without means attend self-help study groups or free courses at their local high school (which themselves vary in quality from region to region).
While it is still claimed that the SAT can predict 15 percent of freshman grades in the first year of college (and similarly that the GRE can predict 11 percent of the variance in first year graduate course grades) it is no longer claimed by anyone that they have any predictive validity to select out those students who will complete a degree or go on to make real world contributions in any given area of expertise. Instead, it is now realized that these standardized admissions tests function to reproduce the class system (from generation to generation) not turn it on its head (Lemann, 1999a). Better educated and prosperous parents produce children who score better on standardized tests
Ironically, these relatively recent realizations are in part due to a the political fallout surrounding a conservative backlash (from 1985 onward) that has successively struck down former State run Affirmative Action initiatives (which had tended to utilize different test score cut-offs for students of different ethnic backgrounds). These realizations also result from the ongoing institutional search for finding other means of promoting ethnic (and class) diversity on college campuses. As Sacks (1999) put it: "In one of the great ironies of recent American history, the very existence of affirmative action itself -and the recent legal and popular attacks on it- will force the hands of educational decision makers regarding the real utility of gate keeping tests. Only then will the prevailing merit system be officially rendered obsolete" (Sacks, 1999, p. 284).
In July of 1995, public acrimony over a two-tiered gatekeeping system moved California Board of Regents to a vote to phase out (over three years) all consideration of ethnicity and gender in admissions. A year later California voters passed Proposition 209 which officially ended Affirmative Action in public agencies and higher education. By 1998, at UC-Berkeley, the numbers of freshman class black students admitted had dropped 60 percent. Hispanic admissions were off 40 percent. At UCLA, UC-San Diego, Davis, Irvine, and Santa Barbara, black admissions had dropped anywhere between 14-46 percent. Hispanic admissions had also declined from 9-33 percent (Sacks, 1999).
Ward Connerly and Terence Pell (of the Center for Individual Rights based in Washington, DC) have sued public universities in the states of Texas, Michigan, and Washington, claiming that affirmative action programs discriminate by applying different test score standards to different races. [185] Their fundamental argument is threefold: (1) that these universities are faced with real disparities in standardized test scores (along racial lines); (2) that any institution using standardized tests in its admissions process has got to find a way around those disparities if it desires ethnic diversity; and (3) that the typical way to do so is by the illegal use of racial preferences (Connerly, 2000).
In Hopwood v. State of Texas, plaintiffs including Cheryl Hopwood sued because the University of Texas School of Law systematically passed White students over in favor of ethnic minorities with lower LSAT scores. The court ruled (in March 1996) that the School of Law's use of race as a factor in its admissions equation was prohibited by the U.S. Constitution. The weight of the Hopwood decision was then felt in other legal challenges. The University of Michigan's law school (Grutter v. Bollinger, et al.) and then their undergraduate (Gratz v. Bollinger, et al.) College of Literature, Sciences, and the Arts soon faced a lawsuits that essentially duplicated that brought in Texas. Similarly, in November 1998, Washington State voters approved a bill banning consideration of race in hiring or college admissions decisions. Using racial classifications to achieve diversity or proportional representation no longer passes constitutional muster nor does it draw clear voter support.
New attempts at diversity
The hunt is now on to find ways that will ensure ethnic diversity on campus without breaking the recent anti-affirmative action rulings. Some of the solutions have been half-steps (combining both standardized test scores and non-test criteria) and others have foregone the standardized testing dilemma altogether.
Berkeley, for instance, now admits half of the freshmen class on pure academic criteria (grades, course difficulty, and SAT scores). But in choosing the other half of the class, admissions officers consider other factors (a student's activities, community service, and past ability to overcome life obstacles). For this reason, they require each applicant to submit an essay concerning their personal background and planned academic intent. While the ethnicity and gender blanks of the applications are occluded, there are many ways in which this information is revealed in the essays and it remains to be seen whether this new half-step procedure will pass legal muster. Other half-step programs which both predate and follow the Hopwood decision (including that of Texas) are covered in Sacks (1999) -who argues that in most cases they are not particularly effective in ensuring proportional representation of students from Black and Mexican ethnic backgrounds.
As for the issue of why universities should bother to strive for ethnic diversity, Bowen & Bok's The Shape of the River (1998) studied the impact of affirmative action at 28 selective universities and found evidence indicating the dramatic success of minority students. The leadership role that black matriculants are contributing in civic and community life, in particular, is quite telling. The ratios of black metriculants leading civic organizations outstrips that of Whites by 2 to 1 in some areas. Bowen & Bok argue, therefore, that choices during the freshmen admissions process have to be based not on who has achieved a given test score result at this point in their life but on the basis of which set of applicants will contribute most to the quality of education at the institution to the larger purposes of society (i.e., to the need of society for diverse leadership). As Nairn (1980) and many other sources since that time have indicated, both the SAT and GRE provide no such predictive validity for future success.
Sacks (1999) has argued that the small percent of institutions that have forgone the traditional test based entrance exams have been successful in promoting diversity on campus and decidedly not at the expense of high academic standards. The overall degree completion rate, for instance, tends to remain comparable after the abandonment of standardized entrance tests. While the percentage of test optional 4 year institutions was still small (about 12% in 1997), he correctly predicted the trend toward broader admissions criteria would continue into the near future.
Other higher profile attempts have also been made to circumvent the testing bias dilemma by allowing high school grade ranking to be the predominant criterion for student acceptance. In Texas, a Uniform Admissions Policy (SB 588) was signed in 1997 which required public universities to automatically accept any Texas student ranked in the top 10% of their graduating class. The bill meant that the highest performing students from struggling urban schools in San Antonio would be on an equal footing with those from elite schools in Dallas. The bill also left room for each campus to extend its admissions to the top 25 percent of each high school class.
Sacks (1999) notes that the plan seems to be working: At the University of Texas, for instance, of the 12,000 new students in the 1998 class 4,000 were ranked in the top 10. Admissions for black students who scored in top 10 percent of their class rose from 87 percent in 1995 (prior to the effects of Hopwood) to 97 percent in 1998. In other words, the traditionally low SAT scores for this group of applicants no longer posed a barrier (p. 295).
Similarly, the One Florida Program adopted by executive order in November 1999, (partly as a means to pre-empt planned legal action by Connerly) outlawed race, ethnic and gender preferences in state contracting, college admissions, and some state hiring. Governor Jeb Bush thereby eliminated racial quotas in admissions and guaranteed college placement for the "top 20 percent" of Florida high school seniors. Initial Fall 2000 enrollment figures provided by Florida's 10 public universities and the Board of Regents indicated that an additional 1,234 African-American, Hispanic, Asian and Native American students had entered the university system as compared with the Fall 1999 numbers. [186]
It remains to be seen whether legal cases will be forthcoming from this new high school percentile-based approach but initial indicators are that they have certainly surpassed the pure standardized test based admissions policies of the past in terms of promoting both ethnic diversity and equality of access for female versus male students. This latter point is especially important because it solves a well-known, long-standing gender bias weakness in test based admissions policies. That is, while males tend to outperform females on the SAT (by about 40 points on average), the actual attained freshmen (and end of degree) GPAs of females who are accepted tend to be higher than those of males (Lemann, 1999a).
Most recently, the ETS has responded to the current 20% drop in their college admittance market share by revising the structure of the SAT exam. In June of 2002, they proposed dropping the analogy section altogether and introducing a 25 minute written essay section (so as to better assess critical reading skills). The reply from Bob Schaeffer (of FairTest) was immediate. These cosmetic changes (proposed to come on stream by 2005) will likely exacerbate the already existing biases of the SAT. In other words, the use of the "revised" SAT will continue to discriminate against the poor, against minorities, and against both older and women applicants.
Striking at the very heart of the role of higher education and of educational access itself Schaeffer summed up:
"The issue isn't the old SAT versus the new SAT or even the alternative, the ACT. The real question is why any college needs to use a test. And there are already 391 colleges and universities in this country that don't require test scores to admit substantial numbers of their applicants. Some of the most competitive schools....believe that every child can learn and...that every child can show it in real academic work, [and] not largely [by] filling in bubbles and writing one formulaic essay in three hours on a Saturday morning. The high school record is much, much richer. It includes lots of tests, lots of essays, and all kinds of other information" (Schaeffer, The News Hour with Jim Lehrer, July, 2, 2002).
Before returning to these rather weighty and politicized issues, we must first address the more disciplinary consideration of how well the 1980s-1990s interactionist consensus on testing (within psychology) did at: (1) informing our understanding of what human intellect is and how it develops; and (2) providing specific provisions or alternatives for concrete professional issues (such as the effectiveness of school reform or adjustments to admissions policies).
Section Two:
Strengths and Weakness of Revised Interactionism
During the early 1980s, critiques of the older IQ testing tradition such as Gould's Mismeasure of Man (1981) began to achieved the status of conventional wisdom within academic circles. The main point of Gould's book was that brain primacy theory and assumptions about racial hierarchy of mentality had historically been brought directly into the intelligence testing tradition of the past. In an era in which the intellectual descendants of these tests were making a resurgence it was incumbent upon middle-of-the-road psychologists to delicately grapple with the content aspects of human intellect in a manner that would not rip the discipline apart at the professional seems between supporters and detractors of testing.
Occasionally, the reticence to make pronouncements went far beyond political correctness to become either: (1) an incredulous denial of past testing endeavors as reflective of anything of ontological reality (e.g., Mensh & Mensh, The IQ Mythology, 1991); or (2) a social constructivist account of the historicity of past views on human intellect as somehow insoluble in principle (e.g., Richards, Race, Racism and Psychology, 1997). While the vast majority of professionals fell into neither of these two radical anti-ontological camps, it had become politically incorrect in academic circles to acknowledge or to emphasize past racial differences in test performance nor to postulate what they might actually mean (see E. Hunt, 1995a).
I say "academic" circles here because there had now emerged a professional schism between the professorial/clinical specialties of the APA on the one hand and the experimental/psychometric specialties of the APS on this and other matters. In 1988, a group of psychometric and research-oriented psychologists founded their own organization, the American Psychological Society (APS). This was the professional refuge toward which proponents of psychometric tests and other reductive accounts of animal and human intellect gravitated. [187]
Also in that year, a considerable divergence between the professional persona of the so-called anti-testing consensus and the private beliefs of psychologists was noted. Snyderman & Rothman's The IQ Controversy, the Media and Public Policy (1988), referring to the results of their questionnaire, indicated that "whatever the conventional wisdom holds," the vast majority of survey respondents continued to believe that: (1) intelligence can be measured; (2) genetic endowment plays an important role in individual differences in IQ; and (3) IQ was an important determinant of general success in American society. The latter two of those beliefs have already brought into considerable doubt. My point in mentioning them here is that under such a professional climate of tactical eclecticism (and de facto fractionation of affiliation), no clear or sustained effort at theoretical housekeeping on the issue of defining and measuring the content aspects of distinctly human intellect was likely to occur.
There were, however, sporadic attempts to reach consensus on human intellect; both within psychology (e.g., Sternberg & Detterman's important anthology What is Intelligence?, 1986) and outside it (e.g., two interdisciplinary anthologies intended to counteract psychometric racism by Fraser, 1995; and Kincheloe, et al., 1996). By considering the content of such attempts, in combination with more individual extradisciplinary efforts (e.g., Flynn, 1984-1990; Gould, 1996) we can tease out the strengths and limitations of the new so-called revised (a.k.a. dynamic or social) interactionism of the Accountability era.
I will argue that the limitations of the resulting professional consensus on testing procedures remained vulnerable to the biogenic arguments (e.g., Anderson, 1992; Rushton, 1990, 1995; Herrnstein, & Murray, 1994) for precisely the reasons mentioned in the Snyderman & Rothman survey conclusions. In other words, while the both the historicity of human intellect and its extra-individual (socio-societal) origin were beginning to be recognized from 1981 onwards, the methodological and professional implications of these realizations had not yet been fully recognized, elaborated, disseminated, or implemented.
Weakness of Accountability era anthologies
It is now clear that Darwin's mental evolution continuity view error was repeated by the 1950s-70s interactionist account of human intellect. Interactionism (like racism itself) is a fundamentally conservative position. As such, its more liberal proponents unintentionally continued throughout the 1980s & 1990s to buffer scientific racism and other forms of biopsychological reductionism from direct attack. That is, in both dynamic interactionist views on intelligence (circa 1980s) and subsequent movements toward recognizing an historically situated social intelligence (circa 1990s), the postulation of nature plus nurture was retained thereby allowing (by far) too much room for biogenic proponents to maneuver.
Defining intelligence: The new interactionism
The issue of how far psychology had come in understanding human intellect was faced head-on by an anthology published in the form of Sternberg & Detterman's What is Intelligence? (1986). The editors opened the anthology with the statement that: "theories in this volume identify three main loci of intelligence -intelligence within the individual, intelligence within the environment, and intelligence within the interaction between the individual and the environment" (p. 3). The brief contributions of the volume, therefore, whether advocating psychometric "g" (e.g., Eysench, Jensen); multiple "types" of intellect (e.g., Gardner, Schank, Sternberg, Scarr); or whether refusing to make pronouncements on such ontological questions (e.g., Horn, Humphreys, Hunt, Zigler), were therefore unified in their thoroughgoing interactionist rather than transformative approaches to human intellect (see also Gardner, 1983, 1985, 1987, 1993).
Edward Zigler's (1986) contribution is especially important because while attempting to outline a rather progressive developmental stage approach to human intellect, he also exhibits an untenable ontological agnosticism by: (1) emphasizing the arbitrary nature of definitions (as being not 'right or wrong' but merely more or less 'useful'); and (2) by retaining the historically problematic IQ measure in his diagram of differential intellectual development. Zigler's diagram on intellectual levels (p. 86) is very important because it graphically portrays the hidden additive (and individualistic) assumptions of so-called dynamic interactionism (see fig 57).

Figure 57 Zigler's depiction of intellectual development. The vertical arrow represents time's passage. Horizontal arrows represent "events which effect the individual" (indicated by a pair of vertical lines). "Cognitive" development is depicted as an ascending spiral, in which the numbered loops represent successive stages of "intellectual growth" (from Zigler, 1986; see also Kimble et al., 1980/1984). Despite his commendable attempt to depict a stage approach, Zigler's retention of the IQ measure (at the base of the diagram) is highly problematic because it had already been abandon by the Consortium of Longitudinal Studies (1983) in their analysis of the success of Head Start and seems to indicate (though unintentionally) that a fixed mental capacity is being implied. Although the conflation of IQ scores with both mental capacity and the number of achieved levels of mental development was also very common between 1965-mid-1980s, Zigler, an important player in Head Start for decades, would later openly lament his own part in such lapses (see Zigler, & Muenchow, 1992).
The retention of the IQ measure is stated here as problematic because it had already been abandon by the Consortium of Longitudinal Studies (1983) in their analysis of the success of Head Start. This having been said, however, Zigler's accompanying textual qualifications on the above diagram comes closest of all the 1986 anthology contributions to elaborating what a valid assessment of an individual's intellect might entail:
"A valid assessment of an individual's functioning would consist of a variety of measures, including a test of formal cognitive ability (such as the standard IQ test or Piagetian model of cognitive functioning), and achievement measure (e.g., the PIAT) and some indicator of motivational and emotional variables (such as self-image or locus of control)" (p. 151).
When I first saw Zigler's diagram, I was both appalled and elated. The inclusion of an apparently stable IQ measure as correlated with the amount of developmental levels achieved by an individual, for instance, was most alarming. According to the diagram, the 150 IQ child had completed 9 levels of development by age 20 whereas the 66 IQ child had only completed 4. What is implied in this depiction of a stable IQ attribution? No answer is given by Zigler, but I suggest that what is implied is the very argument which Head Start was set up to counter-argue (i.e., fixed mental capacity). Even a comparison of Zigler's diagram with the former additive mental ladder of the 1920s and 1930s shows that the older diagrams at least allowed for some variance of IQ scores to be attained by individuals across the life-course.
Similarly disturbing is the depicted equal width of the various (undefined) developmental levels depicted. In my opinion, other older diagrams of developmental stages (e.g., Arnold Gesell's spiral mental growth cycle) had it more right than Zigler by depicting each new developmental stage (or rather mental mile-stone) as expanding the horizons of children as they pass through them. Thus, the diagrams used in educational texts for many years had correctly depicted successive stages of mental "growth" as wider than the previous ones (e.g., Lindgren, 1956, p. 49). We will return to this issue of depiction shortly.
Despite the limitations of Zigler's diagram, however, I was also elated by his explicit mention of motivational, emotional, and social assessment as being necessary because this meant that I could eventually include such concerns in my own developmental hierarchy of intellect without raising too many eyebrows. Indeed, it is necessary to do so in order to demonstrate the difference between the individual, social, and societal mechanisms of rising from one intellectual level to another. Before moving on to that account, however, we must first return to the issue of how far the so-called dynamic and social interactionism went in describing these causal mechanisms of mental transformation.
A little while after the Sternberg & Detterman anthology, the third chapter of Snyderman & Rothman (1988) nicely outlined the logic of the interactionist account as it is played out in 1980s empirical terms -including contemporaneous arguments over the assumption of covariance of human intellect with IQ score (pp. 80-81). That chapter also provides forward-looking summaries of so-called dynamic meta-analysis evidence (from longitudinal studies, twin studies, semi-historical comparisons, and cross-race studies) which would become predominant in the testing subdiscipline thereafter. All of these points intentionally work toward their argument that empirical rigor of such studies had increased (p. 92), but also (unintentionally) toward demonstrating the ongoing hegemony of the continuity view of mind in even the best proponents of mainstream interactionist psychology.
In short, what has always been missing in these interactionist accounts is a serious account of the role of societal-historical existance in not only providing a context for higher mental processes to be expressed in human activity but also in the very formation and divergence of our higher mental processes (a.k.a. intellectual stages) from the sort of lower mental processes we share with animal mentality. All the same, by at least considering the empirical tools and theory production methods of the past, and by expanding them to include an emphasis on longitudinal (a.k.a. life-span developmental) and motivational approaches, the dynamic interactionist account has clearly made an important half-step beyond the unreflective (ahistorical) traditional interactionist account.
It came to be recognized that what was needed, was not merely further recitation of data, nor further bows to measurement technologies produced in related disciplines, but rather a comparative historicized re-analysis of the practical applications of the past knowledge products of the testing industry. Yet a firm argument regarding the theoretical and methodological basis upon which to guide this re-analysis was not forthcoming from within the loosely affiliated dynamic/social interactionist camp. Thus, 1980s-1990s interactionist interpretations of the available longitudinal data (including those appealing to social influences, motivation, and social-historical context for intellectual growth) did not surmount but rather merely postponed resolution of the nature and nurture debate.
Revised interactionism versus scientific racism
How well did the de jure revised interactionism deal with blatant forms of scientific racism and statistical methodolatry during the Accountability era? The answer is not all that well. That is, while the best interactionist theorists and extradisciplinary commentaries of the era were successful in exposing the duplicitous procedural tactics used by Philip Rushton and by Charles Murray in their accounts of ongoing race differences in IQ, great difficulties were encountered when attempting provide viable theoretical counter-arguments and methodological options on the content end of the testing debate. [188] The new claims that the updated biogenic hypothesis was somehow data driven are as disingenuous as those of Jensen, so we will dispense with a description of those claims. The main issue here is: What kinds of counter-arguments and methodological alternatives were put forward in their place?
Specific contributions to two of the most recent anthologies intended to counteract psychometric racism (Fraser, 1995; Kincheloe, et al., 1996) are highly enlightening on this issue. First of all, Stephen J. Gould's comments in The Bell Curve Wars (Fraser, 1995) are indicative of the fact that he had not yet worked out the difference between social interaction and a truly transformative approach to mental evolution. In both that volume and in the second edition of Mismeasure of Man (1996) Gould appropriately criticizes Herrnstein & Murray's The Bell Curve (1994) for failing to distinguish between statistical and cultural bias in intelligence test measurements. [189] But in the anthology, he elaborates by stating that we "do not yet know the answer" to the question as to "whether blacks average 85 and Whites 100 because society treats blacks unfairly -that is, whether lower black scores record [social biases]" (Gould in Fraser, 1995, p. 18). This flaccid statement in the realm of mentality is hardly the kind of stand one would expect from an outspoken proponent of punctuated equilibria (and its implications) in the realm of organic evolution.
Having become embroiled in the interactionist's false subdisciplinary dichotomy between the "g vs. specific abilities" debate, Gould (1995; 1996) then throws in his chips in against Thurstone's g factor and for Guilford's (1947, 1952, 1959, 1966a&b, 1967, 1971) multiple intelligence model (and hence with the thoroughgoing interactionism of Guilford and Howard Gardner). Yet this fifty-year-old "one intellect vs. many" debate was far from the most central or timely methodological issue at stake in contemporaneous test interpretation.
In other words, given the increasing engagement of Gould's "punctuated equilibria" account (Eldrege & Gould, 1972; Gould, 1980a&b) in anthropological circles around the issue of the organic growth of hominid brainsize (see Falk, 1992; Lewin, 1993) -an account which postulates a transformative (i.e., episodic step-wise) rather than strictly gradual growth role of tool use and language along the lines of Russel Wallace- it is highly ironic that Gould's revised Mismeasure of Man (1996) does not explicitly extend the concept of punctuated equilibria from the organic to the mental realm. In failing to do so clearly, he has apparently repeated Darwin's (1872) mental continuity error. No account of gradual changes in mentality leading to qualitative shifts in mental kinds is to be found in either work and Gould (1996) makes only one highly tangential reference to punctuated equilibria itself.
Similarly, Kincheloe & Steinberg's commendable effort to characterize the "hopeful" socio-cultural counter-argument (in their introduction to Measured Lies: The bell curve examined, 1996) also falls flat because they seem to lump into one (albeit heterogeneous) camp the proponents of both interactionism and socio-cultural (transformative) analysis:
"We argue in this book that there is reason for hope. Ignoring literally scores of studies that document the benefits of educational intervention, Herrnstein and Murray would rob the poor and non-white of future promise. An entire school of psychological analysis has emerged over the last two decades that views the development of higher orders of thinking around sociocultural interaction (Bohm and Edwards, 1991; Gardner, 1983, 1991; Hultgren, 1987; Kincheloe, 1993; Lave, 1988, Raizen, 1989; Vygotsky, 1978; Walkerdine, 1984, 1988; Wertsch, 1991; Wexler, 1992)" (p. 36).
While the above reference to Vygotsky (1978), and to sociocultural analysis itself is highly encouraging, the label of "sociocultural interaction" is not because Vygotsky (as described below) belongs to a group of thinkers who provided a truly transformative (rather than additive or merely social interactionist) approach to human mentality. Examples of why this kind of loose appeal to sociocultural analysis of individual differences (and to mental levels) is problematic have already been given but one final example may be useful.
Despite its claims to be dynamic and developmental, the new revised interactionism is ironically left completely vulnerable to the outright separation of higher intelligence from development contained in the work of Mike Anderson (1992) who suggests (along the lines of Jensen) that an innately given "speed of processing" (operationalized as reaction time differences) is the basis of all observed individual differences. [190] Why are the sociocultural interactionists vulnerable to this speed of processing argument? Because their accounts are consistent with (or complacent toward) Anderson's main premise: That "lower level theories" of intellect requiring little or no appeal to knowledge (e.g., reaction time tasks in Jensen, 1982; Eysenck, 1986) explain the regularities in the test score data (i.e., no racial differences on these lower tasks) and that the "high level theories" of intellect, which make direct appeal to "cognitive" processes (e.g., Hunt, 1980; Sternberg, 1985) explain exceptions to those regularities (i.e., that blacks consistently score lower than Whites and that Asian immigrants have recently out-performed Whites on those same tests or that individuals may perform better on the mathematical vs. verbal sections of such tests).
The traditional interactionist approach to human intellect (by assuming an unreflective genes plus environment stand) treated mere empirical descriptions as if they were explanatory. This approach to data was subsequently brought directly into the so-called cognitive science variants of interactionism (of the 1980s and 1990s) which tended to be unapologetically reductionist. [191] Anderson (1992), as one of these proponents, argues that (apparently inborn) speed of processing underlies all subsequent "development" of higher intellectual functions and he then mobilizes a convoluted array cognitive science based experimental data to purportedly support his views.
The one kernel of truth in Anderson's argument is that both the "lower" and "higher" theoretical camps regarding human intellect have (historically and at least implicitly) assumed that genetic endowment is (to some degree) responsible for providing a basis (and upper limit) upon which both the growth of intellect and effectiveness of educational interventions occur. This shared additive assumption was the platform upon which both Rushton and Murray made their mark in the 1980s and 1990s and no unequivocal reply to this shared view appeared in the resulting anti-racist anthologies.
It is crucial that we now explicitly escape the conceptual and practical bounds of such additive analysis (i.e., nature "plus" nurture). Only then can we investigate the developmental aspects of intellectual growth by way of concretely outlining the typical patterns of transformation of lower mental processes into higher mental processes. That is, only then will we have an explanatory understanding of human intellect and be able to produce tests which measure those processes. The great difficulty in this, of course, is to find a way of doing so that is not associationist, reductive, or even interactionist (see fig. 58).

Figure 58 Bell Curve Controversy. Replying to Herrnstein & Murray's (1994) statistically veiled Mental Darwinist book, eminent figures from various fields mobilized divergent sets of interactionist concepts with no underlying unified conclusions being reached as to what should be done about improving ongoing testing technologies and interpretation (photo from Fraser, 1995). Unless a serious and sustained effort at theoretical housekeeping is carried out to promote transformative (rather than additive) approaches to ability testing, further resurgences of the biogenic account (disingenuous or otherwise) are sure to follow.
By way of demonstrating its utility in explaining the so-called Flynn effect of raising average intelligence test scores, I will argue shortly that only the transformative account of mentality explicitly abandons this long-held associationist/additive methodology for an emergent evolutionary mental ladder approach to intellectual capacity. It does this by explicitly requiring careful reference to the typical three-fold (phylogenetic, ontogenetic, and socio-historical) pattern of quantitative expansion and qualitative transformations in human mentality. This methodology guides research by viewing the higher rungs of the mental ladder not as merely added to fundamentally unchanging lower rungs but rather as transforming (both in the process of their development and afterwards) those lower rungs into something qualitatively different.
The proximate historiographic lesson for the present subsection, however, is simply that while successful (i.e., explicit and convincing) counter-arguments to biogenic accounts were not forthcoming from within the diverse revised interactionist camp, important methodological half-steps were at least hinted at by testing outsiders James Flynn (political scientist) and Stephen J. Gould (evolutionary biologist). These included: (1) a recognition of the historicity of change in human intellectual performance across generations; and (2) an appeal to the concept of punctuated equilibria to support a theory of mental levels. We will now concentrate on elaborating the first of these half-steps (namely historicity of human intellect) primarily because it deals specifically with the subdiscipline of psychometric ability testing; but both are important.
The Flynn Effect (Historicity of IQ scores recognized)
One part of the sociocultural message (i.e., the historical origin of human intellect) was hinted at by the work of Flynn (1984, 1987, 1990). Indeed the "Flynn Effect," the recognition that average IQ test score performance is rapidly rising relative to the standardized samples periodically used to update them -both in the United States (Flynn, 1984) and other technologically advanced nations (Flynn, 1987)- has obtained considerable disciplinary cache. For instance, the theme of the historicity of test performance and the societal origins of the Flynn effect was touched on in the APA Task Force report "Intelligence Knowns and Unknowns" (Neisser et al., 1996) and was then the subject of both a well-known follow-up article by Neisser (1997) appearing in American Scientist (see fig. 59) and an interdisciplinary anthology called The Rising Curve (Neisser, 1998).


Figure 59 Rising scores on the IQ tests (from Neisser, 1997). The above panel indicates that if children of 1997 were to take the 1932 Wechsler test, their average would be somewhere around 120. Alternately, if the generation of 1932 took the present test, their average IQ would have been 80 (with up to one quarter of them being assessed as mentally deficient). The lower panel indicates that the largest Flynn effects appear on tests of visual reasoning (such as the Raven's Progressive Matrices). The increase in these so-called culturally reduced tests are roughly twice the rate on that of broad spectrum tests like the WISC or WAIS. The Dutch data, depicted here, shows a 21 point difference between 1952 and 1982 (which extrapolated back to the early 1930s produces a 35 point increase). That is, the average 19 year old in the Netherlands is now producing scores that would have been 2 standard deviations above the mean for their grandfathers. Moving beyond the considerable limits of Flynn's initial accounts, Neisser (1997) puts forward an important visual stimulation hypothesis to explain the cultural-technological origins of these differential changes in average test performance gains.
The considerable theoretical and procedural limitations of Flynn's 1980s research have tended to be glossed over in subsequent psychology textbook accounts, and the 1998 anthology (despite its sincere efforts) has left many issues to be resolved clearly if a viable future course for the testing subdiscipline is to be mapped out. Ongoing uncertainty about the subdisciplinary implications of Flynn's work, for instance, led Deary (2001) to conclude that: "If there was a prize to be offered in the field of human intelligence research, it would be for the person who can explain the Flynn effect of rising IQ" (p. 112).
As for one of these theoretical limitations, Neisser (1997) has correctly fingered Flynn's initial belief in psychometric "g" as highly problematic and (as shown below) this belief clearly pervades the basic assumptions, data handling procedures, and conclusions within Flynn's (1984 and 1987) accounts. It must also be said up front, however, that the still more prevalent theoretical assumption of the individualistic nature of human intellect -as something inside the test subject's head- has gone virtually unquestioned both within Flynn's work and in the subsequent reviews. This omission itself has profound importance for the future course of intellectual assessment. Indeed, no true explanation of the Flynn effect is possible without throwing out this obsolete assumption explicitly. I will argue, therefore, that by combining both the visual stimulation hypothesis of Neisser (1997) and the cultural evolutionary argument of Greenfield (1998) with the (only recently made available) socio-cultural methodology of Vygotsky & Luria (1930/1993), we can get fairly close to explaining the Flynn effect in its fuller significance.
Flynn's account
It is likely the more conservative psychometric aspects of Flynn's approach that have been instrumental in their ready inclusion into the psychological literature. Flynn (1984) started with the seemingly mundane assumption that any statistically reliable method of assessing IQ test performance over time requires that the initial standardization samples (used to establish test norms) be to some degree representative of the American test taking population (as they were in the years sampled). In his meta-analysis of past test score comparison studies, he carefully selected out the best 73 studies (representing 7,500 subjects ranging in age from 2-48 years) in which two or more Stanford Binet and/or Wechsler tests were given to the same group of subjects. In order to investigate whether a pattern of test performance gains could be ascertained between the years 1932-1978, Flynn reasoned that improved performance over time would be reflected in subjects finding earlier test norms easier to exceed than later ones (which is, in part, the rationale for periodic restandardization of tests test norms in the first place).
To control for the varying de facto quality of particular test versions and for the degree in which the tests were in fact representative, however, Flynn provides a brief consideration of the history of successive standardizations of SB and Wechsler tests noting (among other things) that the SB-(1932) was standardized on Whites only and that Wechsler-Bellevue Form 1 (normed 1935-38) had drawn on a sample of New York area residents (pp.30-32). Following from this, Flynn's survey compared the selected testing data against two sets of norms: (1) against the norms in which the tests taken were originally standardized; and (2) against a "uniform scoring convention" which he worked out to control for various forms of statistical bias and "confounding variables" in past tests or standardization samples. The latter entailed statistical transposition of "mixed-race" or "minority" data into so many standard deviations above or below the "White" mean.
The data and conclusions of Flynn's 1984 research must therefore be recognized as relating to a fictitious ethnically amorphous generalized American mind (i.e., a hypothetical statistical conglomerate rather than any actually ontologically existing category of mentality or actually existing test population). This having been said, interesting results were obtained and Flynn attempted to move beyond both contemporaneous subdisciplinary (operationist) agnosticism and the primarily correlational concerns of past IQ test follow-up accounts (e.g. Owens 1953; Campbell,1965) by actually teasing out a descriptive account of: (1) to what degree; and (2) which parts of the tests were now being performed better. He also attempted (somewhat less successfully) to link this pattern of changing test scores to observable patterns of changes in contemporaneous modern society. [192]
The overall finding that Americans did a "better and better job" on IQ tests over a period of 46 years (amounting to "an American IQ gain" of 13-15 points between the period 1932 to 1978), with a consistent linear "year by year" performance gain of between .25-.440 points (p. 32), is the one picked up in subsequent textbooks. It is arguable, however, that the most important results of Flynn's research are in the finer details regarding which aspects of the tests were being performed better.
Firstly, Flynn (1984) indicates that the highest gains over time on the Wechsler scales were concentrated on the symbol coding subtests (which went unchanged between the WISC-1947-48 and WISC-R -1971-73). Flynn (1987) also indicates that even for post-1950 adult data, Wechsler performance subtest gains were greater than verbal subtest gains in various nations, "sometimes by as much as 16 points" (p. 186).
No clear statement is made with regard to SB tests in isolation but Flynn (1984 and 1987) comments on the "divergence" between overall full IQ gains (including SB tests) and contemporaneous declines on SAT-Verbal test scores, denoting to the charitable reader that his initial, rather historically naive, belief in a unitary "g" is beginning to waver. For instance, noting that overall IQ score increases occurred during a period of lowering of SAT-Verbal scores (mid-1960s and late-1970s), Flynn (1984) asks: "[H]ow can school children gain so much in overall intelligence [i.e., full Wechsler or Binet IQ scores] and make so little progress in terms of enhanced vocabulary? What causal factors could increase intelligence and somehow with-hold their potency from the world of words [as indicated by SAT-V scores]?" (p. 46).
Flynn (1984) eventually turns to "environmental factors" which differed between generations (including socio-economic status, increased test-wiseness, and educational improvements) as a starting point for further analysis, concluding that while a case for the "malleability of IQ performance within times of normal environmental change" has been made, the reference to such variables only takes us about "half-way" in accounting for the magnitude of observed overall gains (p. 48). Flynn (1987) then elaborates on this point claiming that "higher levels of education contribute 1 point, SES may contribute 3 points, and ...test sophistication perhaps 2 points" (p. 188) to these overall increases.
Importantly, his "Massive IQ gains in 14 nations: What IQ tests measure" (1987) attempts to further tease out the differential magnitude of score gains on various kinds of IQ tests, eventually arguing that existing IQ tests do not test human intelligence per se but rather a little understood "correlate" with a weak causal link to intelligence called "abstract problem-solving ability" (p.188). He then elaborates on this point. Learned strategies of problem solving picked up at home and in school play a critical role in how well one does on IQ tests but the effect of each is actually differentially reflected in test gains for different kinds of tests.
That is, respectively, the relevant required "learned content" for various test are as follows: (1) Wechsler tests -elementary academic skills; (2) military test batteries (such as the ASVAB) -simple arithmetic, word knowledge and paragraph comprehension picked up at the elementary or middle school level; and (3) SAT-Verbal -advanced academic skills gained from high school English courses (1987, p.189). Thus as the assessment of young adults moves from tests of problem-solving ability with a "moderate reliance on elementary academic skills" (Wechsler tests), to one with "heavy reliance" on at least elementary academic skills (ASVAB), and then to one with heavy reliance on "advanced academic skills" (SAT), the overall cohort test scores (with respect to earlier norms) changed "from gains to no gains and from no gains to losses" (p. 189). That is, claims Flynn, high school students in 1981 "did not have higher [general] intelligence than their counterparts in 1963, they merely had higher APSA" (p. 189).
The above is hardly earth-shattering news for those aquatinted with the historical details of both past psychometric test development and the actual utilization of the Wechsler and military tests in American society, yet it is indeed a testament to the pervasive reification of the unitary view of human intellect that Flynn even felt compelled to elaborate (at length) upon those points. His (1987) disclaimer regarding the non-necessity of a "commitment to the unitary theory of intelligence" (p. 188) further indicates his demur from that problematic approach.
Once again, it must be emphasized that the most important aspects of Flynn's 1987 meta-research survey lay elsewhere. That is, in the issues surrounding the observation that the largest tests score gains were not found in the forms of "crystallized" formal knowledge demanded by the so-called learned content aspects of the above named tests (i.e., the sorts of things that people accumulate throughout their lives: general knowledge, vocabulary, mathematical skill) but rather in other aspects of testing which have been characterized as "fluid intelligence" (i.e., which make the subjects demonstrate decontextualized problem-solving ability on the spot). These latter kinds of tasks include both the performance subtests of the Wechsler scales (which were initially constructed so as to minimize the influence of differential educational attainment in test subjects), and also specially constructed so-called "culturally reduced" tests such as the Raven's Progressive Matrices; Norwegian matrices, Belgian Shapes test, Jenkins and Horn tests (all covered in Flynn, 1987).
In particular,
the strongest data presented in Flynn's 1987 investigation concerns the historical
rise in relative performance on the Raven's Progressive Matrices a test of abstract
visual reasoning (first published in 1938 by Spearmen's student John C.
Raven). Formally thought to be a good indication of psychometric "g,"
Raven's matrices data was carefully collected in the
It is here that Flynn exhibits a problematic confusion between mental evolutionary assessment (which entails comparisons of individual or group intellectual abilities -including generational differences therein); and cultural evolutionary assessment (which entails comparisons between human preliterate, illiterate, and literate individuals, groups, or societies -including the generational differences therein). Put quite simply, the two sorts of analyses are nested but not identical. Further, measurable progress in one (especially by way of the traditional individualistic psychometric tools) is not necessarily reflective or indicative of progress in the other.
The confusion becomes particularly manifest while he discusses the Raven test results; where Flynn seems to indicate that the intellectual abilities of Dutch inductees could not have been raised as much as the observed tests score suggest. He points out, for instance, that test scores alone would suggest that 25% of the Dutch teenagers "qualify as gifted" because those with estimated IQs of 150 and above have increased by a factor of almost 60. For him this meant that if these tests were truly measuring general intelligence the "result should be a cultural renaissance too great to be overlooked" (p.187). But his search of Dutch education journals from the 1960s onward indicated no mention of a great increase intellectual achievements by newer generations. For Flynn, this lack of predictive validity of test scores in the cultural realm of life-accomplishment indicated that the Raven (and probably other IQ tests including the Wechsler scales) do not measure intelligence but "merely" abstract problem solving ability.
As indicated earlier, lack of predictive validity for life-achievements has also been shown in the American context for the SAT, GRE, and other forms of generalized ability tests (including various vocational aptitude batteries). But does this mean that great leaps in mental and cultural evolution have not occurred in the 20th century? Or is it simply evidence of a weakness in the kinds of tests that traditional psychometrics has produced? The answer to that question depends upon one's definition of human intellect itself. More specifically, I would argue that the false disciplinary conundrum which Flynn's research has drawn out into the open is one produced by not only his initial generalized view of human intellect, but is also a result of his ongoing unequivocal commitment to the individualistic nature of human intellect (i.e., as something inside the head of an individual).
While Neisser (1997) would later apply metaphoric ice to the tender Achilles heal of Flynn's argumentation (i.e., psychometric "g"), the faltering limb within which that argumentation is located (i.e., the individualist definition of human intellect) remained untreated. That is, in an attempt to surmount the logical contradictions between overall psychometric test increases and the real world contingencies of reasonable intellectual assessment, Neisser suggests the following "many" intelligences versus the "one" intellect argument:
"Flynn's argument that real intelligence cannot have gone up as much as scores on the Raven assumes that there is a...unitary quality of mind not unlike Spearman's g. Abandoning that assumption, we may think instead that different forms of intelligence are developed by different kinds of experience. The [differential test gain] paradox then disappears: We are indeed very much smarter than our grandparents where visual analysis is concerned, but not with respect to other aspects of intelligence" (Neisser, 1997, p. 447).
My point here is that without a careful, explicit, retirement of the individualist approach to intellectual assessment itself, the differential test gain paradox will not in fact "disappear" and ensuing disciplinary progress on testing issues are likely to be carried out through mere limps instead of leaps. This having been said, it must also be emphasized that the most historically important aspect of Neisser's (1997) article itself is not his stand on the 'one versus many debate,' but rather how Neisser actually pushed causal analysis of the Flynn effect one step further by proposing a visual stimulation hypothesis as a means of linking the observed pattern of test score rises with 20th century technological advances.
Explaining the Flynn effect (visual stimulation and cultural evolution)
Starting with a description of Flynn's observations on differential test gains, Neisser (1997