Standardized tests are tests that are administered under controlled (or “standardized”) conditions – specifying where, when, how, and for how long test takers may respond to questions. The test questions provide a way to gather, describe, and quantify information that assesses performance on particular tasks to demonstrate knowledge of specific topics or processes. Standardization is important to compare individuals or groups and involves a consistent set of procedures for designing, administering, and scoring the test. The aim of standardization is to ensure that test takers are assessed under the same conditions, assuring that their test scores have the same meaning and are not influenced by differing conditions. Such standardized tests occur over the life course, with a range of uses including determination of school readiness, achievement throughout the schooling progress for students, accountability for districts, schools, teachers, and students, capabilities for college, and achievement as employees in the workforce.
Standardized tests, as a part of the wider educational, psychological, and sociological testing and assessments, have a long history within the United States. They represent one of the most important contributions of behavioral and social science to society, even though tests have been used in a myriad of proper and improper ways (AERA 1999). Their history is deeply rooted in a United States culture that is: empirically oriented and data driven; focused on change, which is assumed to be progress; embraces a belief that evidence can provide general guidance for efficient action; and straddles the choices that give individuals certain advances versus choices that serve the larger society (Baker 2001).
As described by Standards for Educational and Psychological Testing (1999) – an authoritative document on standards for measurement -there are four important facets of testing standards: (1) technical standards for test construction and evaluation; (2) professional standards for test use; (3) standards for particular applications; and (4) standards for administrative procedures. For a standardized test to be technically adequate, it should meet standards of validity and reliability, whether the test is norm referenced or criterion referenced.
Reliability is the degree to which the results of an assessment are dependable and consistently measure particular student knowledge and/or skills. Reliability also refers to the consistency of scores over time, across different performance tasks or items intended to measure the same thing, or consistency of scores across different raters. That is, reliability statistics can be computed to measure (1) item reliability – the relationship between individual test items intended to measure the same knowledge skills; (2) test/ retest reliability – the relationship between two administrations of the same test to the same student or students; or (3) rater reliability – the extent of agreement between two or more raters. If assessments are not reliable, they cannot be valid.
Validity refers to both the extent to which a test measures what it is intended to measure and the appropriate inferences and actions taken based on the test scores. If a math test can only measure a subset of the domain of math skills, how confident are we that students are good at math if they perform well on a math test? How confident are we that the proficiency level accurately portrays proficiency in mathematics? Within the current policy environment of the United States, if an assessment is to be valid, it should be aligned with the standards it is intended to measure and it should provide an accurate and reliable estimate of the students’ performance relative to the standard.
In addition to the importance of standardized tests being reliable and valid, they can be either norm referenced or criterion referenced. A criterion referenced test is linked to specific performance standards or learning objectives. One interprets scores on criterion referenced tests based on the degree to which students demonstrate achievement of the specific learning standards and not how students perform compared to other students. On a criterion referenced test, it is possible that all students (or no students) will perform well on the specific learning objectives or standards. Of course, the percentage of students who will perform well on specific learning objectives depends on how ambitious those performance standards are (Linn 2003).
In contrast to criterion referenced tests, norm referenced tests are tests that compare student performance to a larger group. Typically, this larger group, or norm group, is a national sample representing a large and diverse cross section of students that allows comparison of a particularly student’s performance to the performance of others. The scores on norm referenced tests allow comparisons between the norm group and particular students, schools, districts, and states. All of these tested groups can be rank ordered in relation to the norm group. Thus, norm referenced tests are typically used to sort students rather than measure proficiency of specific learning objectives or standards.
Standardized testing has played a number of important roles in educational settings. These tests have been used for placement in instructional groups (e.g., ability groups or tracks), measuring achievement, assisting in making career and postsecondary educational choices, determining acceptance of applicants to colleges and universities, and monitoring the performance of educational systems.
Intelligence testing to guide ability grouping was one of the early uses of standardized testing (Cronbach 1975). The perceived need for ability grouping arose as two factors – students staying in school longer and the large waves of immigration to the US at the turn of the twentieth century – created a wider range of student ability in high school classrooms. These changes had an impact on college bound students whose progress was held back, according to some ability grouping proponents, by the large number of students who did not seem to be academically gifted.
Following from the work of Binet in development of what were called ”mental tests,” Terman developed a screening tool to identify students who were viewed as not prepared for the intellectual challenges of typical schooling with such labels as ”feebleminded” or ”retarded” (Resnick 1982). These early tests were administered individually to students to determine whether they should be removed from normal instruction. Wholesale use of intelligence testing was introduced by the military during World War I, when tests were developed to identify potential officers. The successful use of standardized testing by the military encouraged further development of tests and non-military use, such as determining placement of students in homogeneous instructional ability groups (Resnick 1982). In the 1950s, there was a resurgence of intelligence testing for the purpose of grouping as a result of implementing the comprehensive high school with differentiated tracks (Linn 2000).
A second use of standardized testing has been the measurement of student achievement levels in a variety of academic domains. Examinations had long been used to determine student progress and set standards for high school graduation, but as the number of students increased it became the necessary to establish standardized criteria. The National Education Association adopted recommendations to standardized evaluation in 1914. At the time of World War I, there had been a rapid increase in the number of achievement tests, numbering more than 200 available for use in the primary and secondary schools (Resnick 1982). Later, a related use of achievement testing was the implementation of minimum competency testing for high school graduation in the 1970s and early 1980s (Linn 2000, 2001).
Standardized testing played a third role as it was used by school guidance departments to assist students in job or career selection and in making decisions about attending postsecondary institutions. Testing in this area included assessing student aptitudes, interests, and skills to guide decision making between career and educational options. One aspect of this innovation was the move to keeping cumulative student records to document continued individual development (Resnick 1982; Linn 2001).
Determining whether students were academically prepared for college and university entrance is a fourth use of standardized tests. In 1899, the College Entrance Examination Board was created to ”establish, administer, and evaluate examinations, in defined subject areas for entrance to participating colleges” (Resnick 1982: 187). After World War I, the Scholastic Aptitude Test (SAT) was developed to provide a standardized test that was not based on a specified curriculum, such as one from a college preparatory school. The focus on aptitude rather than curriculum was seen as being more equitable. In addition, the SAT introduced the use of multiple choice rather than essay type questions. Performance on the SAT and the American College Test (ACT), which was introduced in 1957, became a major component in the decision to accept students into most postsecondary institutions in the US.
A final important role of standardized testing, and possibly one of the earliest, was to compare schools and monitor their performance. As early as the 1840s a set of common questions was used in Boston to determine student progress. The result of this testing had little effect on students or teachers, but provided the Superintendent with a way to hold schools within the district accountable to common standards of student and teacher performance (Resnick 1982). This practice of using student achievement tests to hold schools accountable grew and continued through the rest of the nineteenth century, and is certainly prevalent today (Linn 2001).
New interest in the use of standardized testing occurred as a result of the 1965 federal Elementary and Secondary Education Act (ESEA). Standardized achievement tests became the means of monitoring and evaluating the use of these funds (Linn 2000; Koretz 2002). The 1983 A Nation at Risk report on the state of American education added a new impetus for the use of standardized testing in evaluating the performance of schools. While the testing arising from the ESEA focused on educational equity, the new emphasis after A Nation at Risk was overall performance of the American educational system relative to international education systems.
Most recently, the 2001 ESEA reauthorization, the No Child Left Behind Act (NCLB), increased the importance of standardized testing to new levels in the US. This wave of standardized testing has moved the focus to establishing content standards, the setting of performance (or proficiency) standards for all students, and the addition of high stakes assessments for schools, educators, and, in some jurisdictions, students (Linn 2000, 2003; Linn et al. 2002).
In the foreseeable future, there are several avenues of research that are currently underway or likely to be carried out.
First, research should continue to examine reasonable projections for schools making adequate yearly progress toward learning objectives. The current federal law of NCLB increases the testing requirements and establishes accountability standards for states, districts, and schools in that they need to make measurable adequate yearly progress (AYP) for all students and subgroups of students defined by socioeconomic background, race/ethnicity, English language proficiency, and disability. There is currently wide variation in the rigor of both standards and tests so that students measured to be proficient vary widely from state to state. Over the next few years, researchers could continue to analyze data from different states to examine which schools make large gains on state assessments to understand what ambitious, yet reasonable, goals might be established for AYP (see Koretz 2002; Linn et al. 2002; Linn 2003).
Second, research needs to focus great attention to the tradeoffs that schools and teachers deal with under NCLB by examining how instructional resources are devoted to students at different points in the achievement distribution. For example, by focusing educators on the task of bringing all students to a minimum level of proficiency, it is possible under NCLB that schools will divert attention and resources from students who already meet this standard. In addition, schools may divert resources away from students who are so far below the standard because schools perceive little chance of bringing them to the proficient level. However, such consequences are not inevitable. It may be possible to avoid negative distributional effects if schools instead make more efficient use of their resources, but additional research is needed to address this important issue.
Third, researchers should continue to examine how school principals and teachers actually use test score results for improvement (Goldring & Berends 2006). Schools are typically inundated with data and many teachers and principals are not trained in statistics and measurement to thoroughly understand how to use test score results for improving the conditions of schools and classrooms. Further research into the capabilities and capacity of schools to use data in effective ways for improving students’ test scores would be beneficial for accountability systems that require shared responsibility (Linn 2003).
Finally, researchers should explore different ways to use tests to hold schools accountable. The current research suggests that test based accountability does not always work as intended, but there is no adequate research base to offer a compelling alternative to policymakers and educators. Koretz (2002: 774) describes the current situation as one in which ”the role of researchers is like that of the proverbial custodian walking behind the elephant with a broom. The policies are implemented, and after the fact a few researchers are allowed to examine the effects and offer yet more bad news.” Alternative accountability approaches would expand beyond just tests to examine a mix of incentives for teachers, changes in instructional practice, quality of examining standardized test score gains and growth for students in addition to proficiency levels, and alignment of instruction to standards to tests (see Porter 2002). Together, empirical analyses of these elements incorporated into various programs, policies, and interventions may provide not only alternatives, but also better information about the system of student learning.
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA) (1999) Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC.
- Baker, E. L. (2001) Testing and Assessment: A Progress Report. Educational Assessment 7(1): 1-12.
- Cronbach, L. (1975) Five Decades of Public Controversy Over Mental Testing. American Psychologist 30: 1-14.
- Goldring, E. & Berends, M. (2006) Leading with Data: A Path to School Improvement. Corwin Press, Thousand Oaks, CA.
- Koretz, D. (2002) Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity. Journal of Human Resources 37(4): 752-77.
- Linn, R. L. (2000) Assessments and Accountability. Educational Researcher 29(2): 4-16.
- Linn, R. L. (2001) A Century of Standardized Testing. Educational Assessment 7(1): 29-38.
- Linn, R. L. (2003) Accountability: Responsibility and Reasonable Actions. Educational Researcher 32(7): 3-13.
- Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002) Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001. Educational Researcher 3-16.
- Porter, A. C. (2002) Measuring the Content of Instruction: Uses in Research and Practice. Educational Researcher 31(7): 3-14.
- Resnick, D. (1982) History of Educational Testing. In: Wigdor, A. & Garner, W. (Eds.), Ability Testing: Uses, Consequences, and Controversies, Part II: Documentation Section. National Academy Press, Washington, DC, pp. 173-94.