The past few weeks have brought several more reminders of why it’s critical to develop fair and defensible ways to evaluate teachers—and the programs that prepare them. The first was a ruling by a California superior court judge in the Vergara case that the state’s teacher tenure system discriminates against minority and low-income students. (I spoke about the case on the Diane Rehm show, and may return to the topic in a future blog post.) The second reminder came in the form of the latest release of so-called data on the quality of teacher preparation programs from the National Council on Teacher Quality—a project in which GW and many of the nation’s leading teacher preparation programs chose not to participate. (The flaws of the NCTQ data and rating methods are described in my blog post of June 18, 2013.) The third reminder concerns the Department of Education’s plan to press ahead with a system to rank teacher education programs in institutions of higher education, which has already engendered considerable debate.
Without question, teachers and teacher preparation programs must be held accountable for performance, just as we do for physicians, engineers, and other professions with life-changing impacts. Our society, which so deeply believes in the promise of education, deserves measurable evidence of how teachers and teacher educators are doing. Both the Vergara ruling and the NCTQ ratings highlight why it’s so important to find the right measures—to get the right data and get the data right.
This point was emphasized in Evaluating Teacher Preparation Programs, a 2013 report of the National Academy of Education for which I had the honor of being the lead author. I referenced this report in my thank-you letter to Secretary Duncan, who had so generously spent an afternoon with us on campus in March to encourage students to go into teaching. As I said in my note to the Secretary, the NAEd report argues for a balanced approach to evaluating teacher education programs that includes measures of inputs (such as faculty qualifications, quality of courses and student teaching, selectivity in admissions) and outcomes (such as job placement, retention rates, and teaching performance of program graduates as measured by the learning in their classrooms). The report suggests that designers of accountability systems for teacher preparation programs should consider the strengths and limitations of various measures, their intended and unintended incentives, and the anticipated benefits and potential risks to the teaching profession and to the education community generally.
One measure of performance has received considerable attention of late—the impact of individual teachers on the achievement of their students, often gauged through so-called “value-added” models, or VAMs. The research community has weighed in on the plusses and minuses of VAMs (here, here, and here). And this spring, Secretary Duncan announced plans to move ahead with new rules that would require states to develop rating systems for teacher preparation programs that emphasize outcome measures, including VAM measures of graduates’ impacts on student achievement. These ratings, along with other data, would be used to determine which programs are eligible for federal TEACH grants.
In preparing the NAEd teacher preparation report, our committee reviewed evidence on the effectiveness of VAMs in estimating teachers’ impact on student achievement and found it to be mixed. Questions remain about the validity of VAMs as indicators of program quality, and we therefore recommended against placing too much weight on them when making critical decisions about program accountability.
Recent research raises new questions about the validity of VAMs for decisions about teacher performance or the quality of the programs that prepared them. A study by Polikoff and Porter finds weak to nonexistent relationships between state-administered VAM measures and the content or quality of teachers’ instruction, leading the researchers to question the usefulness of VAM data in evaluating teacher performance. Condie, Lefgren, and Sims find that a large number of teachers are misranked by the typical VAM; they estimate that using VAMs in teacher retention policies will improve student outcomes, but not by as much as a policy of promoting teacher specialization across subjects. Bitler and colleagues used a VAM model to assess the effects of teachers on something they can’t change—students’ height. Though this may seem flippant, preliminary results presented at the AERA conference indicate that the magnitude of the teachers’ “effects” on height is nearly as large as their effect on math and reading achievement—which led the authors to question “the extent to which VAMs cleanly distinguish between effective and ineffective teachers.”
So while the Administration is right to focus on strengthening the capacity of teacher preparation programs to supply high-quality teachers, the evidence continues to suggest caution about placing too much stock in VAMs within systems for evaluating these programs. But a missing piece in many of the critiques of VAMs is the “counterfactual”: how good or bad are other measures of teaching quality? For example, conventional wisdom and the increasingly strident rhetoric of some educators who are wary of standardized testing tend to favor classroom observations. In fact, though, observational methods have their own potential pitfalls. As shown in research on workplace discrimination generally, extraneous factors and unconscious biases can undermine the reliability and validity of judgments about performance. In the specific case of classroom assessments of teaching, Pianta and colleagues caution that “to draw any conclusions from observational data, the instruments we are using must be subjected to extensive testing and evaluation” and note that “the research community is just beginning to subject classroom assessment tools to that type of useful scrutiny.”
A Brookings study released in May shows that compromises to the validity and reliability of classroom observations can stem from the distribution of students of varying academic proficiency. The evidence suggests that teachers in classrooms with relatively higher-performing students tend to fare better in observational assessments than teachers working with lower-performing students, independent of the effects the teachers actually have on the rate or magnitude of gains in their students’ learning. A legitimate question that arises from this research is whether an over-reliance on classroom observations might have the unintended effect of perpetuating disparities in the allocation of teaching resources and put unfair negative pressure on teachers who are working in the toughest environments. As Whitehurst et al. note, “We should not tolerate a system that makes it hard for a teacher who doesn’t have top students to get a top rating.”
The bottom line, then, is that any measure of teaching quality is prone to be imperfect, and no single measure should ever be the sole basis for important decisions such as hiring, retention, promotion, or firing of teachers. The NAEd report reiterates that important principle—echoing decades of research and advocacy by the professional measurement and assessment community.
Accountability is a perfectly American concept—and what better time to remember that than on the eve of Independence Day! But finding approaches to what might be called sensible accountability (as wisely suggested in the recent work of Greg Duncan and Richard Murnane) is one of our greatest contemporary challenges.
July 3, 2014