
Challenges in Identifying High Leverage Practicesby Julie Cohen  2015 Background: The current accountability climate prioritizes identifying features of “effective teaching.” One approach has to been to outline a set of “high leverage practices,” defined as teaching moves that are researchbased, have the potential to improve student achievement, and support students in learning central academic concepts. But which practices qualify as “high leverage,” and based on what criteria? This article raises several issues involved in identifying “high leverage” teaching practices based on their relationships with different types of student outcome measures. Purpose: The study addresses the following research questions: What practices are associated with student achievement gains on high and lowstakes assessments? How do teachers use these teaching practices in their classrooms? Population: Participants in this study were 103 fourth grade teachers from a single district who volunteered to have their classroom instruction recorded as part of the Bill and Melinda Gates Foundation’s Measures of Effective Teaching (MET) project. Research Design: The research involved analysis of multiple data sources: video records of practice, qualitative observation notes, quantitative measures of math teaching, and student outcomes on two math assessments, one high stakes comprised entirely of multiplechoice questions and the other low stakes and focused on openended problem solving. Data Analysis: A standardized observation tool designed to code language arts instruction, PLATO, was modified for reliable scoring of math teaching. More than 300 math lessons were scored using four PLATO scales. Logistic regressions were used to look at relationships between PLATO scores and teacher valueadded measures computed from high and lowstakes student assessments. A stratified, purposive sample of lessons was analyzed qualitatively. Findings/Results: Scores on two practices, modeling and procedural strategy instruction, predicted valueadded based on the highstakes state test, but had no relationship with valueadded on the lowstakes test. Qualitative analyses demonstrate instruction was explicitly oriented toward success on the state test. Teachers taught testtaking strategies and modeled how to eliminate “silly” answers listed in multiplechoice format. Two other practices, orchestrating classroom discourse and conceptual strategy instruction, had no relationship with valueadded on either test. Scores on these two scales were positively skewed, with very few instances of highscoring instruction. Conclusions/Recommendations: Discussion focuses on potential limitations of labeling teaching practices as “high leverage” based on their relationship with highstakes standardized assessments and the importance of sampling teachers with a full range of enactment of highleverage practices. Although there is agreement that teachers are one of the most important withinschool factors for student learning (Chetty, Friedman, & Rockoff, 2011; Glazerman et al., 2010), there is little consensus about what evidence is needed to identify more or less “effective” teachers (Ball & Rowan, 2004; Bell et al., 2012; Hill, Kapitula, & Umland, 2011). In particular, there is heated debate about the degree to which teaching quality should be assessed primarily based on process criteria, such as measures of teaching practice, or outcome criteria, such as student achievement gains. Understanding teaching quality raises numerous conceptual and empirical questions about the relationship between these multiple measures, and how to best delineate what constitutes highquality instruction. One approach to parsing the specific facets of quality teaching is to identify a set of “core” or “highleverage practices” (Ball & Forzani, 2009; Grossman & McDonald, 2008).^{1} These have been defined as teaching moves that are researchbased, have the potential to improve student achievement, and support students in learning central academic concepts. But which practices qualify as high leverage, and based on what criteria? As McDonald, Kazemi, and Kavanagh (2013) argue, we still need to develop a clear process for determining what counts as a highleverage practice. If quality teaching is in fact our collective goal, then we need to build a more robust literature that addresses the degree to which potential candidates for highleverage practices are associated with a wide range of student outcomes and illustrates how these practices are used in classrooms (Valli, Croninger, & Buese, 2012). Using data from the Bill and Melinda Gates Foundation’s Measures of Effective Teaching (MET) project, this study raises several key issues involved in identifying highleverage teaching practices. It highlights potential limitations of labeling teaching practices as high leverage based on their relationship with highstakes standardized assessments. It also underscores the importance of sampling teachers who demonstrate the full range of practice enactment, both high and lowquality instantiation. We may not be able to determine if a practice is in fact high leverage in a sample with little that qualifies as highquality instruction. The study addresses the following research questions: o What is the relationship between a set of potentially “highleverage practices” and teacher valueadded models (VAMs) computed with high and lowstakes student assessments? o How do teachers use these teaching practices in their classrooms? BACKGROUND AND FRAMEWORK RELATIONSHIP BETWEEN TEACHING PRACTICES AND STUDENT ACHIEVEMENT GAINS Though the term “highleverage practice” is relatively new, the search for teaching variables that impact student achievement is conceptually similar to the processproduct model prevalent in the 1960s–1980s (cf. Berliner, 1986; Brophy & Good, 1986; Dunkin & Biddle, 1974). Studies in this empirical tradition identified a number of important teaching behaviors associated with higher levels of student performance on standardized assessments, including “time on task” (Stallings, 1980), “wait time” (Rowe, 1986; Tobin, 1987), and “stating the lesson’s objective." Brophy & Good (1986) found that teachers with higher levels of student achievement connected new academic material to students’ prior knowledge and consistently assessed and provided feedback during students’ independent practice activities. Good and Grouws’ seminal study of the Missouri Mathematics Program (1975, 1979) identified and then tested the causal impact of several teaching practices on student achievement in mathematics. They found that effective mathematics teachers focused on the meaning of the mathematics and promoted student understanding through discussion. Although processproduct research contributed a great deal to our understanding of effective teaching, it was critiqued for assuming perfect alignment between the test used to measure learning (the “product”) and the teaching (the “process”) and the incompleteness of a single assessment as the measure of learning (Shavelson, Webb, & Burstein, 1986; Good & Grouws, 1979 is a notable exception to the tradition of using a single assessment measure). Though research on teaching moved away from the processproduct model to focus on studies of teacher motivation and cognition, the current accountabilityfocused policy climate, in which the only goals targeted are those that can be measured, has fueled a return to studies of the relationship between standardized teaching variables and standardized student outcomes variables (Mehta, 2013). Though quantification models and methods have become more sophisticated or “highly elaborated” (Espeland & Stevens, 1998)—measures of student achievement growth are now captured in statistically sophisticated valueadded models (VAMs)^{2}—many of the original limitations remain. The standardized assessment still functions as the measure of learning, albeit with numerous statistical controls, and the teaching is still assumed to be aligned with that measure. The notion that effective teaching practices can be isolated based on their relationship with student outcomes is clear in articles with titles such as, “Identifying effective classroom practices using student achievement data” (Kane, Taylor, Tyler, & Wooten, 2010). Largescale research projects such as the MET project similarly favor teacher valueadded measures as the dependent variable of interest, suggesting that all measures of teaching are validated through their ability to predict student achievement gains. However, any single specific outcome is an inherently incomplete measure of student learning. Moreover, critics assert that highstakes tests might assess a narrow range of learning outcomes, and gains on such outcomes could stem from teaching that reduces the complexity of learning tasks to align with tested content and format (i.e., multiplechoice questions). Such teaching practices, referred to as “teaching to the test,” do not comport with most definitions of “good teaching” (Koretz, 2002; Luna & Turner, 2001; Popham, 2003). A number of studies demonstrate that teacher valueadded estimates fluctuate substantially depending on the student outcome measure, problematizing the notion that effective teaching is a uniform construct readily quantifiable with a single statistic (Brophy & Good, 1986; Corcoran, Jennings, & Beveridge, 2011; Konstantopolous, 2014; Lockwood et al., 2007; Papay, 2011; Sass, 2008). It makes logical sense that the selection of an outcome measure would influence the measurement of a teacher effect. Different tests focus on different content, use different formats, and require differing levels of cognitive demand on students. While none of the studies that highlight variability in teacher effects by student outcome measures focus on classroom instruction, a natural extension of the argument that teacher effects vary by assessment type is that the teaching practices associated with student achievement gains would similarly vary based on the content, format, and cognitive demand of the test (Good & Grouws, 1979). In other words, teacher effects may respond differently to various measures of student achievement because different tests require skills that stem from distinct types of teaching. MEASURES OF TEACHING PRACTICE Valueadded models (VAMs) tell us nothing about how to improve classroom teaching or student learning; they are a statistical tool designed to look at outcomes rather than mechanisms (instructional practices, a teacher’s interactions with students). As a result, districts are increasingly using multiple measures, including classroom observations, for assessing teaching (Gordon, Kane, & Staiger, 2006). These standardized observation protocols all have an underlying theory of quality instruction, including elements such as responsiveness, clarity, precision, and developmental appropriateness. For example, Charlotte Danielson’s widely used Framework for Teaching (2007) was designed based on constructivist principles. The Classroom Assessment Scoring System (CLASS) (cf. Pianta et al., 2008) grew out of developmental theory suggesting that warm, positive, supportive interactions between children and adults are the primary mechanism for student development and learning. Hill and colleagues (2008) developed the Mathematical Quality of Instruction (MQI) scoring system to correspond with earlier work on the importance of teachers’ mathematical knowledge for teaching (cf. Hill, Rowan, & Ball, 2005), and the need for clarity and precision in instructional explanations. This study focuses on several teaching practices featured in the PLATO protocol, a standardized classroom observation system originally developed to assess instructional quality in English language arts (ELA), but that was modified for this research to reliably score math instruction as well. The entirety of the PLATO protocol includes 13 instructional features organized around four factors: instructional scaffolding, the representation and contextualization of content, disciplinary demand, and classroom environment (for more details on PLATO’s theory of instruction, see Grossman, Loeb, Cohen, & Wyckoff, 2013). The practices selected for this study were chosen based on conceptual linkages detailed below and because research suggests these teaching moves are also used to support student achievement in mathematics. A separate portion of this research project (Cohen, 2013) focused on elementary teachers’ use of these practices in both math and language arts instruction. The first practice is teacher modeling, or the visible enactment of student activity: the teacher engages in a highquality example of what students will be asked to do. The second practice is strategy instruction, which involves an explicit discussion of how to approach and engage in academic tasks. A teacher could explain how to check one’s work in a division problem or how to determine relevant data in a word problem. Finally, this study analyzes how teachers orchestrate classroom discussions by assessing the degree to which students have opportunities to publicly share their thoughts, opinions, and reasoning, as well as the ways in which the teacher picks up or elaborates on student contributions by asking questions such as: “Can you tell more about that?” “What made you think that?” and “Where did that number come from and what does it tell you?” The conceptualization of highquality enactment of these practices can be traced to two theoretical models of quality teaching, “cognitive apprenticeship” (Collins, Brown, & Holum, 1991; Collins, Brown, & Newman, 1989) and “authentic pedagogy” (Newmann, Marks, & Gamoran, 1996). Cognitive apprenticeship involves a teacher acting as an expert guide who visibly engages in the same activities and processes as students and makes his or her their thinking explicit. Collins, Brown, and Newmann (1989) described how instruction can communicate “the culture of expert practice” by teaching and modeling the behaviors, strategies, and dispositions of “real” practitioners. When teachers model or provide explicit, strategic instruction about how to approach academic tasks, they make internal metacognitive processes external and accessible to students. In doing so, theoretically, students develop a greater sense of metacognitive awareness, a capacity for reflecting on their approach to academic tasks. The way in which teachers orchestrate classroom discourse can also highlight cognitive processes in a publicly accessible way. In a highquality “instructional conversation” about academic material, the teacher promotes deep, visible engagement with academic content and processes by connecting disparate approaches or ideas, asking clarifying questions, and pushing students to elaborate (Goldenberg, 1992; Tharp & Gallimore, 1988). Teachers can create discourse norms in which students are encouraged to publicly display their thinking around academic tasks. Highquality enactment of these practices also reflects what Newmann, Marks, and Gamoran (1996) term “authentic pedagogy,” which orients students to learning goals that have “value beyond school” (p. 284). Conceived of with this lens of “authenticity,” the goal of these teaching practices is to make visible the kinds of thinking processes that help students develop “the habits and dispositions of interpretation and sensemaking” more than “acquiring any particular set of skills, strategies, or knowledge” (Resnick, 1989, p. 58). Thus, highquality instantiation of these practices might support different types of learning than those detected on multiplechoice assessments. While cognitive apprenticeship and authentic pedagogy have a great deal of theoretical and empirical merit, largescale observational studies, including the Study of Instructional Improvement, Trends in International Math and Science Study, and the MET project from which this data is drawn, have documented that instruction in American classrooms tends to be quite different from these ideals (Kane & Staiger, 2012; Rowan, Correnti, & Miller, 2002; Stigler & Hiebert, 1999). There is limited focus on mental processes, and teachers seldom press for explanations or probe student thinking (Nystrand & Gamoran, 1991; Valli et al., 2012; Webb et al., 2009). Students are rarely engaged in authentic activities similar to those of active practitioners. Instead, teachers tend to prioritize activity completion over student sensemaking because, as Sykes, Bird, and Kennedy (2010) argue, “many of the tools, norms, rituals, and resources in schools are aimed at maintaining teacher dominated discourse, textbook based lessons, and coverage as the main curricular principle” (p. 465). This suggests the potential challenges of finding a sample of teachers in which we might see the full range of enactment of the focal teaching practices. RELATIONSHIP AMONG MEASURES Research looking at the relationships among these multiple measures of teaching quality has come to radically different conclusions. Pianta et al. (2008) find only modest correlations between observation scores and student growth trajectory from first to fifth grade. In contrast, Kane, Taylor, Tyler, and Wooten (2010) found a significant positive relationship between teaching observation scores and student achievement growth. PLATO, the observation protocol used in this study, has been used in several studies of middle school language arts teachers. Some practices, such as strategy instruction and modeling, were effective at differentiating teachers in higher valueadded quartiles, while others, such as the way in which teachers’ orchestrated discussions, did little to differentiate among teachers in different valueadded quartiles (Cohen & Grossman, 2011; Grossman et al., 2013). Hill and colleagues (2011) is one of the few studies that use cases to illuminate the differences among the practices of mathematics teachers with similar valueadded. They show that while many teachers with high valueadded estimates also scored well on the MQI, other high valueadded teachers taught lessons marred by content errors and imprecise instructional explanations. In other words, some teachers who are successful at raising student achievement scores also demonstrate highquality teaching practices, while others do not. The mixed nature of these findings demonstrates a clear need for more studies exploring the relationship between measures of teaching quality. DATA AND METHODS SAMPLE This study draws on data from the MET project that includes digital video of 13,000 lessons from seven districts nationwide (for an overview of the study, see Gates Foundation, 2010). The sample for this study, a secondary analysis of the MET data, includes videos of math lessons from all the participating fourthgrade teachers in one large, urban district: 103 teachers and 309 math lessons.^{3} All the lessons were videotaped from March to May of 2010, prior to the administration of the state assessment. A single district was selected to eliminate additional sources of variation in teaching practices based on district factors, such as curricular mandates, textbooks, and state and district content standards. Using data from the MET study also allowed for a rare opportunity to analyze achievement on multiple assessments: the state’s highstakes math assessment as well as the supplemental assessment given across the MET districts, the Balanced Assessment of Mathematics (BAM), which assesses conceptual understanding of mathematics through openended problem solving. The BAM was developed at the Harvard Graduate School of Education between 1993–2003 and is comprised of all openended questions that press students to justify their solution methods and explain their reasoning (for more details on the version of the BAM administered in the MET project, see Bill and Melinda Gates Foundation, 2010). To better understand whether some of PLATO’s practices are differentially associated with achievement based on assessment format, a district was selected whose assessments were seemingly most different from the supplemental assessments, based on preliminary analysis of the fourthgrade math assessments for the six MET districts. The fourthgrade state math test for the district selected does not require any constructed responses. In math, there were few word problems, and many equations were set up for students (i.e., 180 – 90 = ? [a] 9, [b] 90, [c] 110, [d] 270). Thus this district was selected in part to maximize variation in the student achievement measure. The student and teacher population in this large urban district is predominantly nonWhite and the majority of students qualify for subsidized lunch programs.^{4} More than 85% of the students in the district are AfricanAmerican, and approximately 10% of the students are White. The current teacher evaluation system in this district is based on both classroom observations using a standardized observation protocol and teacher valueadded on the state test. Though teachers in the MET study are all volunteers, they represent a large percentage of fourthgrade teachers in the focal district, and statistics from the Gates Foundation suggest that the sample is representative of teachers in the district as a whole in terms of demographics, preparation, advanced degrees or National Board for Professional Teaching Standards (NBPTS) certification, and years of experience (see Table 1; for more details, see Kane & Staiger, 2012). In terms of teacher effectiveness, the research team from the district has indicated that the fourthgrade teachers in the MET sample represent a normal distribution of teachers, according to the method by which teachers were evaluated at that time (District Liason, Personal Communication). Table . Characteristics of MET Volunteers vs. Other Teachers in Focal District This was a district that exerted a great deal of control over curriculum and instruction. Teachers across the district used a common textbook for math instruction, enVision Mathematics, were required to teach content and skills in the districtmandated sequence, and were subject to periodic curricular “audits” from members of the curriculum and instruction office. The content in the textbook was wellaligned with the state math test. However, teachers in this study presented mathematical content to students in a notably different way than content was presented in the teachers’ guide that accompanied the textbook. Thus, the written curriculum and enacted curriculum were distinct in ways oft noted in the literature (cf. Remillard, 2005). This is a particular place with a particular population of teachers and students using a particular set of curricular materials. Thus, this study is not necessarily reflective of effective teaching in other contexts, but is designed to examine issues that are likely pertinent across different districts. CODING INSTRUCTIONAL PRACTICES PLATO was first modified so that it could be used to reliably score math instruction (Cohen, 2013).^{5} All scales were finalized when a group of five raters achieved 80% interrater agreement. One particularly important modification was that the scale for strategy instruction was divided into two separate scales for procedural and conceptual strategy instruction. Procedural strategies taught students rules, algorithms, or formulae to use in approaching academic tasks. Conceptual strategies taught students to systematically reason and make sense of academic tasks in more flexible and adaptive ways. Thus, the three focal practices became four practices during the course of modification. Raters were told that a strategy was conceptual when the goal was focused on systematically reasoning and making sense of problems. In math this might include a discussion about the relationship among multiple solution approaches such as linking a visual representation of the fraction to the numerical representation. There might also be some discussion of why the procedure works, how to analyze the structure of a problem, how to compare the problem to similar problems, and how to check one’s answer in the context of a problem. In contrast, purely procedural strategy instruction was focused on following a strict set of steps to complete a task. Raters were given an extensive list of procedural strategies and conceptual strategies for a set of varied mathematical topics common to upper elementary classrooms. For example, in a lesson focused on converting improper fractions to mixed numbers, a procedural strategy lesson might include a definition (i.e., “An improper fraction has a numerator that is larger than the denominator”) and some discussion of how to convert to a mixed number by dividing the numerator by the denominator. The same lesson could also include conceptual strategy instruction if the teacher asked students why division was the appropriate operation and/or oriented them to a visual representation of an improper fraction, such as 5/4. The teacher might ask students, “How many wholes are represented here? How many parts are left over? How do we know the denominator from this visual? Why does the denominator stay the same when you convert an improper fraction to a mixed number?” Although attempts were made to clarify the distinction between procedural and conceptual strategy instruction, these are complex constructs and the extent to which they are mutually exclusive is a contested issue (Baroody, Feil, & Johnson, 2007; Hiebert & Grouws, 2007; Star, 2005). The Common Core State Standards (CCSS) (2010) and other reports by the National Research Council (Kilpatrick, Swafford, & Findell, 2001) and the National Council of Teachers of Mathematics (2008) highlight the necessity of developing these dual strands of procedural fluency and conceptual understanding in conjunction. Thus, lessons could be scored as having high levels of both procedural and conceptual strategy instruction. Moreover, one of the affordances of having a fourpoint scale instead of an indicator variable is that raters could assess the degree to which a strategy was conceptual or procedural. Raters also took qualitative notes that explained their reasons for identifying a strategy as procedural, conceptual, or both. A team of raters was then trained to code the teachers’ practices using the PLATO protocol, where scores range from 1 (low) to 4 (high), and to take qualitative notes about the observed instruction.^{6} Raters were certified to begin scoring MET videos when they reached at least 80% agreement with master scores on a set of five videos of instruction. This certification procedure is commonly used in other classroom observation systems (Danielson, 2007; Pianta et al., 2006). Fifteen percent of the lessons were doublescored, and interrater reliability was reasonably high (k > .6). During these reliability checks, the rater’s supporting notes for a lesson (i.e., time stamps, descriptions of strategies taught, content modeling, examples of teacher uptake) were also checked for interrater consistency. Because it is well documented that raters vary in their severity or leniency, no single rater scored more than one lesson per teacher (Bell et al., 2013; Hill, Charalambous, & Kraft, 2012). RELATIONSHIP WITH VALUEADDED MEASURES As part of the MET study, a team calculated teachers’ valueadded coefficients. The model used isolates a teacher effect coefficient based on the standardized test score gains of a teacher’s students, conditional on the composition of her classroom (prior achievement, student demographics, etc.). For the state assessments, teacher valueadded scores were standardized across all teachers within that grade level and district. Because only the students of the MET teachers in a district took the supplemental assessments, teacher valueadded was standardized across the MET teachers in all six districts in a given grade level (for the specifics of model specification, see Kane & Staiger, 2012, p. 41). The scores and qualitative codes for the math lessons in this sample were merged with the MET data, which included these teacher valueadded measures. Valueadded estimates for individual teachers tend to be particularly noisy and unstable for those in the middle of the distribution of valueadded scores (Aaronson, Barrow, & Sander, 2007; Ballou, 2005; Koedel & Betts, 2007; McCaffrey et al., 2009). Perhaps as a result of this noisiness, the literature suggests that it is difficult to find linear relationships between teacher valueadded measures and scores on classroom observation protocols (Hill, Umland, Litke, & Kapitula, 2012; Ruzek, Hafen, Hamre, & Pianta, 2014). Blunter categories, such as valueadded quartiles or quintiles, used in a logistic regression, may make more sense in designating teachers more and less successful. Moreover, these distinctions may be more meaningful in a policy context because personnel decisions (e.g., termination, tenure, or merit pay) tend to be most consequential for teachers who fall at the more extreme ends of the distribution. Therefore, the general analytic strategy used in this study was to run logistic models that looked at the extent to which scores on instructional practices predicted the likelihood of being in one valueadded quintile—20% bands of the total distribution— versus another.^{7} I analyzed the degree to which practice scores predicted a teacher’s being in the bottom 20% of the valueadded distribution versus being anywhere else in the distribution. In other words, I explored whether practice scores signaled an increased likelihood of being designated a “least effective” teacher. I then compared teachers in the top valueadded quintile to everyone else in the distribution to determine if higher practice scores signaled an increased likelihood of being designated a “most effective” teacher. Finally, I analyzed whether practice scores predicted an increased likelihood of being in the top 20% of the distribution versus being in the bottom 20% of the distribution. This is clearly the most extreme kind of comparison, but is often used in the valueadded literature and featured in the MET reports (Kane & Staiger, 2012). All logistic regression models included a vector of controls for average student characteristics in that teacher’s section during the 2009–2010 school year (percent of special education students, percent ELLs, percent of students in the predominant minority group in this district, percent qualified for free and reduced price lunch). The student characteristics control variables were selected based on literature suggesting these factors might influence the nature of the instruction in the classroom and its relationship with student achievement (Goldenberg, 2008; LadsonBillings, 1995; Morrison et al., 2008). Although these variables are controlled for in the valueadded models, they are not controlled for in the practice scores. All models were run with standardized practice scores because valueadded is similarly standardized visàvis the population against which teachers would typically be compared, and because the standard deviation units are more easily interpretable and comparable across the practice scales. The benefits and drawbacks of standardizing practice scores are discussed in the implications. ASSESSMENT ANALYSES The MET dataset also includes data from the Survey of Enacted Curriculum (SEC) conducted by Polikoff and Porter (2012). The SEC uses a “content language” to describe the material covered on a given assessment, which better enables comparative analysis of the specific features of the two math assessments used to measure student achievement. This language includes at least two dimensions: one focused on topics (e.g., geometry or number sense), and the other focused on cognitive demand (e.g., recall, making generalizations). For each subject, topics are divided into general areas (e.g., measurement, number sense) that are then organized into subcategories or more specific topics (i.e., creating bar charts, multiplying fractions). Generalizability studies of the SEC instrument demonstrate high levels of reliability—0.86 for math assessments—indicating that raters tend to characterize test items similarly (for a more detailed description of the SEC, see Porter, Polikoff, Zeidner, & Smithson, 2008). All of the assessments in the MET study were coded by at least four raters. QUALITATIVE ANALYSES While quantitative analyses are useful for understanding broad patterns in the data, research on teaching also needs a clearer understanding of how highleverage practices are used and why might we see differential relationships with student outcomes when valueadded is computed with different assessments. To answer these questions, I used Erickson’s (1985) funneling method of qualitative case sampling. Before selecting specific lessons to watch again, I first tried to gauge the full range of the enactment of practice. I began by reading all the notes and time stamps associated with the four focal practices to provide a “wideangle view” of practice use among teachers in the sample (Erickson, 1985, p. 143). These notes helped provide documentation of the more concrete details about the enactment of practice than the scores alone would provide. I focused on one practice at a time, and organized these data by PLATO score from 1 to 4, so that I had a sense of what was being taught and when it was taught at each of the score points. I took notes and kept track of patterns as I read, beginning with general themes based on the literature, my theoretical framework, the PLATO rubrics, and my own observations having watched hundreds of the videos (Ryan & Bernard, 2000). For example, I noted when lessons might be cases of “cognitive apprenticeship” or “authentic pedagogy” because these were central components of my conceptual framework. Because Kazemi and Stipek’s (2001) notion of “conceptual press” is a helpful way of describing discourse in math classrooms, I noted when the raters provided examples of teacher uptake that seemed like “conceptual press.” The PLATO rubric highlights “recitation formats” as characteristic of instruction at the 2 level for orchestrating classroom discussions. Thus, I noted when raters provided excerpts of discourse that involved recitation or choral response. Finally, having watched so many videos, I knew that many lessons focused explicitly on preparation for the upcoming standardized test. Therefore, I noted lessons that raters characterized as focused on testtaking strategies or where material was presented in multiplechoice format. Based on the first pass at these notes, I created analytic memos documenting the patterns across the observations both within and across score points. These memos were instrumental in identifying the variation in the ways in which these practices were used in classrooms and for collating the recurrent instances of enactment of these practices (i.e., the common ways that teachers evaluated student responses across a wide range of classroom discussions). In particular, I highlighted patterns reflected across the data within the score point, as well as identified more unique or rare cases, based on the notes, to illustrate the ways in which the typical patterns did not necessarily extend across the entire sample of lessons at a score point (Charmaz, 2006; Merriam, 1998; Small, 2009). I then used a stratified purposive sampling plan to select segments for reanalysis (Kemper & Teddlie, 2000; Patton, 2002). I had two distinct purposes in selecting segments as cases. I wanted to represent some range of the practices, so I sampled at least five instructional segments for each score point (14). Because I also wanted to make more general claims about the sample as a whole, I selected these segments to maximize variation in teachers and content coverage (i.e., geometry, word problems, fractions, and multiplication). For each of the four practices, I then watched and transcribed the focal portion of the lesson (i.e., the modeling, discourse, or strategy instruction) for all the lessons before generating another set of analytic memos (Miles & Huberman, 1994). Across the four practices—modeling, conceptual strategy instruction, procedural strategy instruction, and orchestrating discussions—I sampled lessons from 78 of the 103 math teachers (i.e., 75.73%). I then went back to the analytic memos written based on the entire dataset to assess the degree to which the segments selected represented the common patterns and exceptional or interesting cases originally noted. I continued watching and transcribing segments at a particular score point until I felt I had captured both the more consistent patterns and the variation within a score point and subject area. This resulted in watching a disproportionate number of segments at the high ends of the PLATO scales because there was more variation in practice at the higher ends of the scale than at the lower ends of the scale. Thus, it took more cases to “saturate” cells for the higher levels of practices even though most of the scores were clustered at the lower ends of the four scales, with the notable exception of teacher modeling (for a discussion of saturation in qualitative case sampling, see Strauss & Corbin, 1998). This sampling is an example of what Teddlie and Yu (2007) term the “representativeness/saturation trade off” (p. 86). I am attempting to represent the variability in practices, although not the frequency with which the various practices occurred. I am not making claims about the degree to which the “cases” of specific practices are in fact representative, in any statistical sense, of the population of instructional segments. Instead, these video cases are theoretically interesting illustrations of practice, culled in a systematic way from a large dataset. To illustrate the range of practices, I draw on the method of “vignette analysis” in which specific short descriptions are intended to illustrate specific features of the activities and interactions (Erickson, 1985; LeCompte & Schensul, 1999; Van Maanen, 1995). Though these vignettes are brief, they are designed to authentically represent the broader context from which they are drawn, and I characterize teacher and student speech exactly as it was transcribed from the videos. LIMITATIONS This was a volunteer sample of only 103 teachers. Although the MET reports highlight the representativeness of the sample by focusing on characteristics like years of experience, gender, race, and valueadded scores, there is no way to know if the teaching practices exhibited by these teachers are in fact representative of elementary school teachers in this district. The fact that teachers volunteered to participate in this study might suggest they are more confident in their teaching practices than nonvolunteers. Moreover, it is not clear the degree to which these findings would generalize to teachers outside this district. As noted, there were many curricular mandates and instructional structures in place in this district that, while not the focus of this analysis, undoubtedly influenced teaching practices and student achievement gains. That said, many districts use testbased accountability systems for evaluating teachers, and many state assessments are entirely multiple choice and focused on performing mathematical procedures (Polikoff & Porter, 2012), suggesting these findings have broader relevance. There are multiple educational outcomes, including students’ social and emotional outcomes, which are important indicators of a teacher’s effectiveness. In fact, raters consistently noted that these classrooms featured joyful students, few behavioral disruptions, and largely positive and warm interactions among teachers and students. These are indeed critical measures of these teachers’ impact on students. However, in this study, as in many teacher evaluation systems, effectiveness is defined by student performance on standardized assessments. Just as VAMs are inherently limited and incomplete measures of effective teaching, so too are the four teaching moves analyzed here limited and incomplete measures of effective teaching. These practices by no means cover the full range of instructional moves employed by highquality teachers. In fact, the entire PLATO rubric includes 13 elements that research suggests are helpful in supporting student learning. The practices selected may not have been those that either mattered most for student achievement gains or best reflect highquality teaching. However, the focus on a few instructional elements allowed for more extensive analysis of how teachers used these potentially highleverage practices in their classrooms. It is also uncertain the extent to which these 103 teachers demonstrated these profiles of practice across the school year or when they were not being videotaped. Teachers may employ different practices when teaching fractions than when teaching geometry. Although generalizability studies of PLATO (Cor, 2011) suggest that three lessons represent a stable estimate of a teaching practice, it is not entirely clear that teachers would have used similar practices when teaching different material, when video cameras were not in their classrooms, or if they were observed over a longer period of time. It is beyond the scope of this study to investigate in what ways teaching might change when teachers’ instruction was not being captured for research purposes. It is also noteworthy that all of the lessons were videotaped from March to May, and midMay was when the state assessments were administered. This likely shaped the extent to which teachers focused on test preparation in the lessons recorded and may be particularly noteworthy given that VAMs, the metric for effective teaching, are based entirely on student performance on such assessments (Floden, 2001; Valli et al., 2012). Finally, these data do not allow claims about the reasons why teachers use the practices they do, or specific assertions about the ways in which schools, students, or curricular materials act as mediating variables in the relationship between observation protocol scores and VAMs. It is also unclear the degree to which instructional supports or teacher content knowledge influenced the differences in practice enactment, though these are important areas for future research. Despite these limitations, the following findings provide a set of descriptive analyses, both quantitative and qualitative, to enhance our understanding of the relationship between measures of teaching practices and teacher valueadded. FINDINGS STUDENT OUTCOMES ON DIFFERENT ASSESSMENTS Assessment Analyses Figures 1 and 2 represent the SEC data comparing the cognitive demand and content coverage of the two math assessments used in this study: the highstakes assessment (labeled “State Test”) and the BAM administered by the Gates Foundation. The analyses of the assessments in math reveal differences in both the topics covered and the cognitive demand of the items on the two tests. As one might expect with a longer, more extensive assessment, the state math test covers more topics, with a particular focus on number sense, operations, measurement, and geometric concepts. The vast majority of items (98%) asked students to memorize facts, definitions, or formulae, or to perform procedures. Figure 1. Cognitive Demand of Math Assessments Figure 2. Topic Coverage on Math Assessments The BAM involves a series of questions around one mathematical situation, such as maximizing the area and perimeter of an imagined garden based on various constraints. As a result of this different format, the test covers fewer mathematical topics, but in greater depth. The version of the BAM administered to fourthgrade students in the MET study focused on operations, data displays, geometric concepts, and number sense. Though there is clearly some overlap in content coverage on the two assessments, there are also notable differences. Although the majority of the BAM asked students to perform procedures or recall facts or formulae (67%), about a third of the test items targeted “higher order” thinking skills including demonstrating understanding of mathematical ideas and making conjectures, generalizations, or proofs. The two tests thus differ in important ways in terms of the mathematical content students are expected to know and the ways in which they are asked to demonstrate that knowledge. The correlation among teacher valueadded estimates across student outcomes indicates the stability of these assessments of “teacher effects” based on the student achievement measure. In this subsample of MET teachers, the correlation between teacher valueadded on the highstakes state math assessment and the lowstakes BAM was relatively low (r = 0.39). Only 15.7% of the variability on the state math test can be accounted for by knowing a teacher’s valueadded on the BAM. The low withinsubject crosstest correlation suggests that the two valueadded measures are not measuring the same underlying construct of “teacher quality in mathematics,” and clearly demonstrates the importance of how we operationalize learning in assessing effective teaching. Relationships Between Measures of Teaching Practice and Teacher ValueAdded Results from logistic regressions suggest differences by assessment type in the relationship between scores on teaching practices and teacher valueadded measures. There are more systematic relationships between measures of teaching practice and valueadded quintile when the state test is used to compute valueadded. When valueadded is computed using the state math test, scores on modeling and procedural strategy instruction differentiate teachers at the top of the valueadded distribution from those at the bottom (see Model 1 in Table 2). A teacher with a one standard deviation higher score in procedural strategy instruction has odds 6.38 times higher of being in the top rather than the bottom quintile of valueadded based on the state math test (p < .01). A teacher with a one standard deviation higher score in modeling has odds 3.10 times higher of being in the top rather than the bottom quintile of valueadded based on the state math test (p < .05). Table 2. Logistic Regressions: State Math Test The procedural strategy instruction finding seems to be driven by particularly low scores among teachers in the bottom valueadded quintile (see Model 2 of Table 2). A teacher with one standard deviation higher than average procedural strategy instruction scores has odds 42% as great of being in the bottom quintile as a teacher with average strategy scores (p < .05). In other words, teachers with higher procedural strategy instruction scores are, on average, significantly less likely to be in the lowest valueadded quintile. The modeling findings seem to be driven by teachers at the top of the valueadded distribution (see Model 3 of Table 2). A teacher with a one standard deviation higher score in modeling has odds 1.99 times higher of being in the top valueadded quintile than anywhere in else in the distribution (p < .05). In stark contrast, when using valueadded quintiles based on the lowstakes supplemental assessment, which targets conceptual reasoning and openended problem solving, there are no statistically significant relationships between scores on teaching practices and student achievement gains (see Table 3). These differential relationships suggest that modeling and procedural strategy instruction are candidates for effective teaching practices in terms of their impact on student achievement on the highstakes state exam, but not on the BAM. Classroom discourse and conceptual strategy instruction do not have discernable relationships with student achievement gains on either assessment. Table 3. Logistic Regressions: Supplemental Math Test—BAM MEASURES OF INSTRUCTIONAL PRACTICE In all the PLATO scales used in this study, a score of 1 indicates there was no or almost no evidence of the focal practice. A score of 2 indicates “limited evidence” of the practice. A score of 3 specifies “evidence with some weaknesses.” Instructional segments that score 4 provide “consistent, strong evidence of that practice." On average, most of the teachers score in the 1–2 range (out of 4) on all four of these practices, but the distribution of scores do differ by practice. There are more normal distributions for the two practices that have a relationship with the state math test, i.e., procedural strategy instruction and modeling, and more positively skewed distributions for conceptual strategy instruction and orchestrating discussion. There is very little instruction scoring at the highest level, a 4, for any of the practices. Figures 3–6 illustrate the distribution of PLATO scores on the four measures of teaching practices. There was almost no conceptual strategy instruction (Figure 3); far and away the modal score was a 1, and the distribution of scores is very positively skewed (skewness = 1.42). Approximately 20% of segments scored at the 2 level, and approximately 10% of segments scored at the 3 level. Less than 1% of math lessons scored at the highest level.^{8} Figure 4 shows the procedural strategy score distribution. The modal score is a 2, and the distribution of procedural strategy scores is more normal (skewness = 0.191). Many instructional segments score at the 2 level for procedural strategy instruction, but a fair number score at the 1 and 3 level as well (i.e., 27.08% and 28.11%, respectively). Only 3.41% of segments scored at the 4, or highest, level of procedural strategy instruction. The modal score for teacher modeling in segments was a 3 (see Figure 5), the highest modal score of the PLATO practices, and the distribution is fairly normal (skewness = 0.149). Despite the fact that many of the teachers were modeling at relatively high levels, only 9% of instructional segments scored at the 4, or highest, level. In terms of classroom discussion (Figure 6), more than 75% of segments scored at the 1 or 2 level (skewness = 0.429). Only 13% scored at the 3 level, and only two of the almost 600 segments in math scored at the 4 level. Figure 3. Conceptual Strategy Instruction Score Distribution Figure 4. Procedural Strategy Instruction Score Distribution Figure 5. Modeling Score Distribution Figure 6. Orchestrating Discussion Score Distribution To provide a clearer sense of what these descriptive statistics mean in terms of the kind of instruction available to students, I draw on the qualitative data and vignettes of classroom practice. In some cases, these vignettes are tied to various PLATO score points in order to make a broader point about the measurement of potentially highleverage practices. “I do, you do”: Strategies for Replication and Standardized Test Performance Almost half of the math segments, 49.66%, were coded as including some procedural strategy instruction but not any conceptual strategy instruction. Modeling and procedural strategy instruction typically went hand in hand (r = 0.7) (for more information on the correlation among different practices, see Table 4). In lessons that raters coded as purely procedural, students were taught how to approach specific kinds of problems by following an algorithm or a series of steps. Teachers modeled, almost always with the help of students, solving one or more problems by plugging numbers into a formula with little discussion of why they were engaging in specific steps. Table 4. Correlation Among Teaching Practices In one example, Miss Jacey^{9} instructed students how to find a percentage from a fraction using a calculator in the following way: “You type in numerator, press divide button, punch in the bottom number, and then you use only the first two numbers of the answer as your percent.” Miss Tyrell reviewed the formula for finding the area and perimeter of rectangles using the respective formulae Area = Length * Width and Perimeter = 2 Length + 2 Width, before solving two area and perimeter problems on the board while eliciting student participation. She told students, “I am doing a few problems up here, so you will know what to do when you are at your desks. It’s real simple. You’ll be adding up for your perimeter and multiplying for your area, just like I do.”^{10} Many of the strategies taught and modeled were presented as helpful tools for performance on the state multiplechoice test. The motivation for using the strategies was, as Miss Williamson said, “to help you answer questions faster on the [state test].” Mr. Johnson framed a lesson: “We are going to review finding a common denominator ‘cause we all know that will be on the [state test].”^{11} Another teacher, Miss Green, began a lesson by telling students, Today we’re going to master testtaking strategies, so that we get at least 95% accuracy on standardized tests. We know that the tests are timed, so we need to be able to figure out what the problem is saying quickly. And you should always use your calculator to help you figure it out quickly. Material was often presented in multiplechoice format, an approximation of the state test, which was entirely comprised of multiplechoice questions. For example, Mr. Terry presented the lesson as “understanding how a word problem operates.” He provided strategy instruction about how to “attack problems quickly for the [state test]” by isolating numbers, identifying key words for operations, and selecting the right answer choice. Mr. Terry modeled the “problem solving process” using the problem, “Jaylan has 12 stickers. Daphne has eight. How many in all do Jaylan and Daphne have? (a) 8, (b) 12, (c) 4, (d) 20.” Students were told several times to “read the whole problem” but were never told why this would be helpful in understanding or solving the problem. Mr. Terry had students come up to the board and underline the “eight” and “12” in the problem, “the data, the numbers in your problem” and circle “the clue words,” “in all.” He emphasized that the “clue words tell us what you need to do to solve the problem.” After students noted their clue words suggested addition, he said, “Exactly! Our strategy is addition, so we do 8 +12, real quick. Then we scan our answer choices. Is 20 there? Yes it is! So we go ahead and circle that one.” In this lesson, problem solving was modeled as a quick series of steps for solving multiplechoice problems on standardized assessments rather than as a tool for sense making (Schoenfeld, 1992). Conceptual Strategies, Procedures With Connections, and Cognitive Modeling The kind of instruction described above is quite distinct from the teaching practices used in lessons that featured high scoring conceptual strategy instruction (i.e., > 3). In these lessons, teachers highlighted the meaning of processes and helped students determine whether their answers made sense or were reasonable, given the context of problems. For example, Miss Marcus taught students how to use relative size of numbers to “ballpark” or estimate the product of two numbers: 73 * 11. She discussed why it was helpful to mentally multiply 70 by 10 to check one’s solution using the multiplication algorithm. In another lesson, Miss Marcus taught students how to write a fraction as a decimal, in this case 7/10 as .7, underscoring the meaning of place value. As students modeled converting several fractions to decimals, Miss Marcus asked several key conceptual questions: “How do I know that .7 represents seven tenths, since I don’t see a 10 anywhere?”; “Why is this easier to do when you have a 10 in the denominator?”; “Why does the decimal point come before the seven?” Teachers focused on conceptual understanding coupled with enactment of procedures. In a lesson about adding numbers with decimals, Miss Owens had students model solving problems using the addition algorithm, but also pushed students to “explain to the class what [they] [we]re doing while [they] [we]re doing it and why they were doing it.” While a student modeled solving a perimeter problem on the board, Miss Owens asked him, “Can you tell us about your sides—why did you put 9 there? How did you know that went there?” Similarly, in a lesson on adding fractions with unlike denominators, Miss Sullivan taught a procedure for finding the sum of fractions with different denominators: “To find a common denominator, we go through the multiples of 2 and 7 until we come to one that they have in common.” She also pushed students to explain their reasoning throughout the modeling: “Why do we count by 2?”; “Where did the numbers 7, 14, 21, 28 come from?”; “How do we know when to stop counting by 7?”; “What does the word ‘common’ mean?”; “Why do we need to find a common denominator?” In all these cases, the foci of the modeling and strategy instruction were both procedural fluency and conceptual understanding. These teachers helped explain why mathematical procedures worked and pushed students to justify their mathematical solution methods. Orchestrating Discussion: Looking for the Right Answer Many of the segments that scored at the 1 level for orchestrating discussion included students’ working independently, or teacher lecture, which in this district often involved showing an instructional video or reading slides from a PowerPoint. In one lesson that demonstrates how little talk there was in math classrooms that scored at the 1 level, Ms. Walters asked students to line up and show their written work to her one at a time. She evaluated each student’s answer individually:—“Good,” “Nope. Try again”—without a single student speaking a word. The general talk pattern in lessons that scored at the 2 level was what has been termed “IRE” or initiate, respond, evaluate (Cazden, 2001; Mehan, 1979). The teacher would ask a question, typically one where there was a “right” answer expected,^{12} students would respond briefly, and the teacher would quickly evaluate their responses (e.g., “Good!” “Nice work!” “OK”). Students were frequently asked to evaluate one another’s responses by clapping, snapping their fingers, or signaling a “thumbs up” or down depending on whether they agreed or disagreed with the speaker. Several teachers had students chorally chant “Bingo!” when a student got the correct answer, and “Try again!” when the student shared an incorrect answer. Students frequently voted on whether a fellow classmate had the correct or incorrect answer. This focus on correctness and finding the right answer was also a common feature of class talk that scored at the 2 level. Students gave answers to problems or questions, which the teacher evaluated, largely without explanation of why answers were correct or not. Incorrect answers were often met with teacher responses such as, “Can we try that one again?”; “Almost”; “Come on now, start thinking before you come out with answers”; or “Are you sure?” One teacher, Miss Jones, frequently told students, “I’m waiting for the right answer.” In these lessons, teachers largely did not press students to explain why or how they came to an incorrect answer. Higher Scoring Talk In math lessons that scored at the 3 level for orchestrating discourse, there were longer discussions and teachers more frequently pushed for student responses or contributions with questions such as, “What do I need to do here?”; What did you do next?”; or “Where do we put that number?” Rather than press students to explain why or how their solution worked or explain how they made sense of mathematical concepts, the kind of “conceptual press” highlighted in the literature (cf. Kazemi & Stipek, 2001), teachers pressed students about the procedural aspects of their work. I term this “procedural press.” For example, a student, Tre, shared his solution to finding the perimeter of a rectangle that measured 8 inches by 5 inches, he wrote on the board, 8 * 5 * 8 * 5 = 1600 cm. The teacher, Miss Morris, then asked him in a series of turns, “What is the formula for finding the perimeter of a flat surface such as a square or a rectangle?”; “What did you do instead?”; and “Where was your mistake?” Miss Morris then asked the class to describe what the student should do to correct his solution method. This discussion lasted for almost 20 minutes, and there were numerous instances of procedural press. At no point, however, did Miss Morris ask Tre or any of the other students why they chose specific solution methods, a central feature of conceptual press. Only two math segments out of 588 scored at the 4 or highest level. While it is impossible to highlight patterns in such a small sample, a common feature of these two segments was that both teachers picked up on student ideas with extended instructional explanations. What seemed to differentiate these lessons from those that scored at the 3 level was that the teachers did, in the course of discussion, highlight potentially important conceptual information rather than press students on the procedural steps employed to “find an answer.” Both teachers did this by noticing student misunderstandings and providing elaborated, conceptually oriented explanations to clarify central instructional content. DISCUSSION AND IMPLICATIONS This study highlights the importance of careful analysis of the measures used to assess both teaching and learning in building a more robust understanding of effective teaching. Specifically, these data demonstrate differences in teacher valueadded designation depending on the assessment used to measure student learning, and different relationships between teaching practices and teacher valueadded on different student outcome measures. Effective teaching is not a uniform construct, even when narrowly defined by an individual’s impact on student performance on standardized assessments. There is a significant positive relationship between teacher modeling and procedural strategy instruction and teacher valueadded on the state math assessment. However, there is no relationship between those practices and teacher valueadded on the BAM, a test that covered different topics and required different levels of cognitive demand. There are several possible explanations for these differential effects depending on the assessment: the instructional sensitivity of the assessment, the degree to which the four practices assessed here are "high leverage" across learning outcomes, the distribution of both the teaching variables, or district alignment between teaching practices and the highstakes state assessment. Each is discussed, in turn, below. Different assessments are differentially sensitive to classroom instruction, and the lack of relationship between the BAM and these four teaching practices may result from the inability of the test to detect the results of highquality use of these teaching practices. Alternatively, it may be that the BAM is in fact instructionally sensitive, but not to these four instructional practices. Put another way, it is possible that no matter how these four practices were enacted in classrooms, they would not be particularly helpful in raising achievement on the BAM. These are possible, if not very plausible, explanations given that the BAM explicitly focuses on conceptual understanding, and conceptual strategy instruction and conceptual press during discussions have been shown to be associated with higherorder understanding of content in the research on teaching in mathematics (Boaler & Staples, 2008; Good & Grouws, 1979; Webb et al., 2009). RANGE OF TEACHING PRACTICE The lack of relationship between the BAM and the four practices may also be an artifact of the sample and the constrained range of some of the teaching practices. Because statistical relationships are predicated on adequate variance, the lack of a relationship with valueadded may be an artifact of limited variation in the enactment of the practice in the sample. The mean scores for conceptual strategy instruction and orchestrating discourse in mathematics instruction were particularly low, and very few teachers demonstrated consistent, highquality use of these teaching practices. It might be that teachers need to provide more consistent, highquality conceptual instruction in order to actually detect leverage points with teacher valueadded. There was also very little highquality discourse in these math classrooms, and only two segments out of 588 scored at the highest level. Many lessons featured teacher lecture or students responding chorally to teacher questions. When students were allowed to talk, teachers often responded to their contributions with a heavy focus on evaluation rather than press for justification or elaboration. The literature suggests that “evaluative talk moves” are less helpful in developing students’ own explanations (Michaels & O’Connor, 2012; Nystrand & Gamoran, 1991; Webb et al., 2009; Wood, Cobb, & Yackel, 1991). Incorrect answers or misunderstandings were rarely pursued as valuable potential learning opportunities in the manner suggested in the literature (Chazan & Ball, 1999; Kazemi & Stipek, 2001; Rohrkemper & Corno, 1988). There may be thresholds in the PLATO scales such that teachers would need to enact the practices at a sufficiently high level to get traction on achievement. For example, teachers might need to consistently provide conceptual strategy instruction or orchestrate discussions at the 3 level to support student achievement on the kinds of tasks included on the BAM. More research is needed in a sample in which one would anticipate seeing very rigorous classroom discussions or more consistently conceptual instruction to actually discern the relationships between the highestquality use of these practices and teacher valueadded.^{13} While this study does not focus on identifying thresholds in the practice scales, future research could explore this issue more systematically in a sample with more variation in practice scores. TEACHING TO THE TEST Evidence from this study suggests that the scales for procedural strategy instruction and modeling are detecting meaningful differences among teachers at the bottom and the top of the valueadded distribution, when valueadded is calculated using the state test. Thus we might conclude, based only on the quantitative relationships, that these are indeed highleverage practices. Indeed they are successful in predicting student achievement gains, and stem from an observation protocol designed to capture highquality, researchbased teaching practices. That said, the qualitative descriptions make clear an aspect of modeling and procedural strategy instruction invisible in these quantitative findings: The modeling and strategy instruction featured in many of these classrooms was explicitly oriented toward success on the state test. The teachers taught testtaking strategies and modeled how to eliminate “silly” answers listed in multiplechoice format. They presented potentially complex information in a series of clearcut, linear steps, acronyms, and mnemonics for the express purpose of helping students “answer questions quickly” rather than helping them make sense of academic material. In other words, they were teaching to the test. This helps explain why there are relationships between scores on these teaching practices and teacher valueadded only when valueadded is computed based on the state assessment but not on based on the supplemental assessment administered by the Gates Foundation, the BAM. Teaching to the test is frequently discussed in terms of topic coverage, but it likely also includes specific teaching practices. We are just beginning to understand the ways in which different assessments may necessitate different skills and hence teaching practices. What material is modeled and what strategies are taught, and the extent to which they align with material covered on a given assessment, likely influenced the relationship between scores on these teaching practices and student achievement gains. The SEC analyses make clear that the vast majority of this test required lower levels of cognitive demand (i.e., memorizing and recalling information, and performing procedures). Math teachers may not teach conceptual strategies because conceptual reasoning does not seem to be targeted on the state math test. If conceptual instruction is not reflected in student test scores, which in turn are reflected in teacher evaluation systems, then teachers are less likely to prioritize conceptual instruction. As Good, Biddle, and Brophy (1975) warned many years ago, “teacher behavior that leads to the accomplishment of one set of goals may impede the attainment of others” (p. 63). Aligning instruction to a heavily procedural state math test likely influenced the opportunities students had to engage in more open, conceptually rigorous mathematical problem solving. Thus, effective teaching for the highstakes state test may in fact look quite different from effective teaching for a test like the BAM. When assessments are used for highstakes purposes such as personnel decisions, districts can assume they have created an incentive system privileging these outcomes, and likely making them the focus of daily instruction (Koretz, 2002; Shepard, Hannaway, & Baker, 2009; Stecher, Chun, & Barron, 2004; Valli et al., 2012). This may be particularly true in this district, selected in part to maximize variation in student outcomes; the state assessment consisted entirely of multiplechoice questions, and the supplemental assessments required studentgenerated responses. As Lockwood et al. (2007) suggest, the low correlations among teacher valueadded coefficients based on different student outcomes established in the literature would likely be even lower if some of the student outcomes used in those studies included openended measures by which students construct their own answers without the guidance of multiplechoice options. The influence of standardized test performance on teacher practice is raised not as a condemnation but as a caution. Nearly 30 years of standardsbased reform provide us with ample proof that in education, as with other fields, what is measured matters (e.g., Espeland & Stevens, 1998; Mehta, 2013; Porter, 1995). It would be naïve for policymakers to assume that their definitions and measures will ever be inert forms—there to assess but not interact. Rather, we must also acknowledge the very real possibility that when we start defining highleverage practices based on the degree to which they predict teacher valueadded, we may inadvertently end up promoting the kind of instruction reflected in the testoriented vignettes. If standardized test performance is always the “gold standard” by which other measures are validated, we will likely end up with a system even more tightly aligned to performance on those tests. MOVING BEYOND TEACHING TO THE TEST The use of the practices to promote success on the highstakes state test does not discredit the possible characterization of modeling and strategy instruction as highleverage practices or features of good teaching. Indeed, these data suggests that modeling and strategy instruction are helpful for achieving outcomes that are quite consequential in the lives of students and teachers. However, these teachers largely enacted the practices in ways that diverge from the conceptual basis for the scales: instructional scaffolds that make expert thinking visible to support authentic, conceptually rigorous learning goals. So too do Ball and Forzani (2009) define highleverage practices as those that support students’ conceptual development. Thus, modeling and strategy instruction as they were generally enacted in this study might not fit this definition of “highleverage practices.” The fact that many teachers were using these practices for the goal of achievement on the state test does not mean they could not be used to support broader goals. Teachers could use these practices to promote more authentic goals beyond standardized test performance. Indeed, Collins, Brown, and Newman’s (1989) theory of cognitive apprenticeship, one of the central theoretical underpinnings of this study, provides additional examples of how these practices can be used to support what Newmann, Marks, and Gamoran (1996) term “authentic achievement.” Because most states have adopted the Common Core State Standards (CCSS), and different consortia are developing assessments aligned with those standards, this is an important time to focus on our desired learning goals for students and carefully consider how accountability systems do and do not support those goals. Despite these general cautions about testbased accountability systems, some might interpret these findings as evidence for a tightly aligned instructional system working as it should: teachers orienting their instructional practices to the incentive system in place. The problem, according to this interpretation, is the procedural test. Fix the test so that it targets cognitively demanding outcomes, evaluate teachers based on that test, and more cognitively demanding instruction will naturally follow. To a limited extent, this may prove true. How learning is defined and assessed on the new assessments will likely influence the kind of teaching we can expect to see in classrooms in the years to come. This logic, however, is also flawed on multiple counts. First, even as states rush to develop assessments that align with the ambitious new CCSS, we know little about which of the numerous standards will be targeted on which assessments. For instance, the first CCSS for mathematical practice asserts that students should be able to “make sense of problems and persevere in solving them.” How will a standardized assessment measure perseverance? If these standards are not reflected in paper and pencil assessments, to what degree will teachers be motivated to teach to them? We also know little about the specific teaching practices needed to support students in achieving these new outcomes. While we have begun to develop a literature looking at the relationship between teaching practices and teacher valueadded on existing tests, this study suggests that we should expect those relationships to change as we change the student outcome measures. Even if we had a clear sense of just what standards were going to be captured on the assessments and what teachers needed to do to meet those standards, there is no reason to assume that teachers will be able to transform their instructional repertoire in the absence of intensive and sustained support. Teaching to the test may be easier if assessments feature primarily procedural skills, and teachers likely need different types of content knowledge and pedagogical content knowledge to support students in meeting more ambitious outcomes. We have little reason to expect that they can develop such knowledge without outside support. For these reasons, the test alone will not drive the kind of instructional improvement and student learning to which we aspire. Expecting the tests to ameliorate the heavily procedural instruction demonstrated in this study really is expecting the tail to wag the dog. Indeed, the most pressing implication of this study is the need for instructional capacity building. Teachers need support in providing quality instruction, including orchestrating rigorous classroom discussions, providing students with flexible, adaptive conceptual strategies, and engaging in cognitive modeling that makes expert thinking visible. If we consider these practices to be important components of quality teaching, then the distribution of scores and much of the descriptive evidence provided in the vignettes suggest that teaching would benefit from additional support. THRESHOLDS OF PRACTICE AND STANDARDIZING QUALITY An important part of the work may be to identify both conceptual and empirical thresholds in the teaching practices captured in standardized protocols such as PLATO. Quality of enactment is likely nonlinear in the way that a scale might suggest. Instrument developers, districts, measurement experts, and researchers will need to work in tandem to discuss what level of enactment is “good enough” for different purposes, and consider how to support teachers in achieving those levels of practice use. Observation protocols and valueadded measures are inherently different. Valueadded models are normreferenced tools used to compare teachers to each other. By definition, some teachers have to be at the top and some at the bottom. In contrast, observation protocols like PLATO were developed as criterionreferenced measures. Theoretically, all teachers in a study could receive the highest or lowest score on an observation protocol. In an ideal district where teaching quality steadily improved and became less variable, classroom observations would capture this improvement across all teachers, while the coefficients resulting from valueadded models would still generate a group of teachers defined as relatively ineffective. When we begin standardizing these practices, we run the risk of losing the original meaning of the scale (Espeland & Sauder, 2007; Timmermans & Epstein, 2010).^{14} When observation protocols are used for highstakes summative evaluation purposes, they functionally become normative standards, where “goodness” is considered a relative term. For example, a teacher in this study could be two standard deviations above the sample mean for conceptual strategy instruction and still only have a teacherlevel average of 2.2. While this might place the teacher comparatively high relative to the population of teachers in the study, the raw score would still be below the median of the fourpoint PLATO scale. Thus, we may be better served to continue using these scales as criterionreferenced tools, where “good enough” is based on a conceptual point on the scale rather than where a teacher is situated visàvis his or her colleagues. Many important questions remain about how to isolate, measure, and promote quality teaching. This study suggests the importance of carefully analyzing student assessments in understanding the meaning of successful teaching. It also suggests the importance of finding a sample of teachers that reflects the full range of practice to truly understand the degree to which teaching moves are high leverage. Finally, as Bell and colleagues (2012) note, a key element of research on teaching quality concerns values. If our collective definition of “effectiveness” or “success” so heavily emphasizes raising achievement on certain standardized assessments, we risk losing sight of other important elements of what happens in classrooms. Notes 1. There continues to be debate about nuanced, definitional differences between the terms “highleverage practice” and “core practice.” For the purpose of consistency, and because I am focused on the leverage of practices for different kinds of student outcomes, I use the term “highleverage practice” throughout this article. 2. VAMs isolate a teacher effect coefficient by controlling for student prior achievement and characteristics—variables potentially confounded with such an effect. The teacher effect indicates how a teacher’s students performed on an assessment visàvis some comparison group, generally the students of an average teacher. Many argue that while imperfect, VAMs actually differentiate among teachers, unlike other measures of teaching in which the vast majority of teachers receive the highest ratings (Glazerman et al., 2010; Harris, 2009). Despite their differentiating potential, there have been numerous critiques of the precision and stability of VAMs as well as the assumptions underlying the models (Baker et al., 2010; Reardon & Raudenbush, 2009; Rothstein, 2009). 3. Teachers submitted approximately four math lessons, but prior generalizability studies of PLATO (Cor, 2011) suggest that only three lessons per teacher are needed to obtain a stable estimate of teaching practice. As detailed below, PLATO’s scoring procedures are such that each lesson includes multiple 15minute segments. Thus, the final sample included 588 scored instructional segments. 4. Confidentiality agreements between the Gates Foundation and the districts participating in the MET study limit more specific demographic information about the teachers or students. 5. While there are several established mathspecific protocols including MQI, PLATO focuses on a distinct set of practices not assessed in any of the existing observation systems. 6. Although PLATO was used in the broader MET study, all the videos used in this study were rewatched and recoded because math lessons were not coded using PLATO, and to allow for coding of additional instructional features not included in the MET data. For example, all instructional segments were also coded for the specific content taught (e.g., adding fractions or understanding word problems) and curriculum used (if visible or mentioned verbally). Raters noted the individual strategies referenced, prompted, or instructed, the time stamp of the strategy instruction, whether there were content errors or inaccuracies in the presentation of the strategy, and a description of the error, if relevant. In addition to a score for modeling, raters detailed the nature of the task the teacher was modeling, a time stamp of the modeling, and whether the process being modeled was conceptual, procedural, or both. To help give detail to the quantitative codes around orchestrating discussion, raters listed the topic (e.g., students’ solution to math problem), duration, and structure of discourse (e.g., whole group, small group with teacher present), specific examples of teacher uptake and student contributions to discussions. 7. OLS models were also run, but found little in terms of linear relationships between scores on teaching practices and teacher valueadded. The one exception was that procedural strategy instruction was marginally significantly predictive of teacher valueadded on the state math assessment (p < .1). 8. In all the figures, the total numbers of instructional segments equals 588. 9. All names used are pseudonyms. Teachers with different pseudonyms are in fact different. 10. It is important to note that this focus on rote instruction should not be attributed to the curricular materials. While not the focus of this study, I have engaged in some preliminary analysis of the textbook lessons referenced by teachers. The math text includes many “open” problems and highlights conceptual understanding of mathematical concepts. For example, a lesson on area and perimeter suggests that students “explore the way the relationship between area and perimeter is different for a hexagon than it is for a square.” The enacted curriculum, what was visible and analyzed in this study, seems to have differed significantly from the written curriculum. Unfortunately, the reduction of the cognitive complexity of curricular material is well established in the literature (Brown, 2012; Remillard, 2005; Stein, Grover, & Henningsen, 1996). Future studies could look at the differences between the written and enacted curriculum among these teachers in a more focused and systematic way. 11. This may have been an artifact of the timing of observations: March through May. As noted, I cannot be certain the extent to which this heavy emphasis on test preparation would have been as pronounced earlier in the school year. 12. These types of “known answer” questions have also been termed “inauthentic” (Nystrand & Gamoran, 1991). 13. Earlier research using PLATO in middle school ELA classrooms in New York City did find some relationship between the quantity and quality of classroom discourse and teacher valueadded designation when there was a more normal distribution of classroom discourse scores and a different measure of student achievement (Grossman & Cohen, 2011). The different assessment used in New York City is crucial given this study’s focus on student achievement measures. 14. I acknowledge that I too standardized practices in the quantitative analyses. However, I use the vignettes to show the potential risks in this methodology. Teachers with higher standardized scores on the PLATO practices could be providing instruction quite different from “high quality” practice reflected in high nonstandardized scores. The direction and statistical significance of relationships between PLATO scores and teacher valueadded scores are consistent when models are run with nonstandardized scores. Acknowledgment Thanks to Pam Grossman, Hilda Borko, Rich Shavelson, Susanna Loeb, Matt Kloser, and Judy Hicks for their helpful comments and suggestions on earlier versions of this article, as well as Lyn Corno and two anonymous reviewers for their thoughtful feedback. I also thank Kerri Kerr and Steve Cantrell at the Bill and Melinda Gates Foundation for early access to these data. Financial support for this study was provided through the National Academy of Education and Spencer Foundation Dissertation Fellowship and the Stanford University Gerald Lieberman Fellowship. References Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95–135. Baker, E. L., Barton, P. E., DarlingHammond, L., Haertel, E., Ladd, H. F., Linn, R. L., . . . Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers (Vol. 278). Washington, DC: Economic Policy Institute. Ball, D. L., & Forzani, F. (2009). The work of teaching and the challenge for teacher education. The Journal of Teacher Education, 60(5), 497–511. Ball, D. L., & Rowan, B. (2004). Introduction: Measuring instruction. The Elementary School Journal, 105(1), 3–10. Ballou, D. (2005). Valueadded assessment: Lessons from Tennessee. In R. Lissetz (Ed.), Value added models in education: Theory and applications (pp. 272–297). Maple Grove, MN: Journal of Applied Measurement (JAM) Press. Baroody, A. J., Feil, Y., & Johnson, A. R. (2007). An alternative reconceptualization of procedural and conceptual knowledge. Journal for Research in Mathematics Education, 38(2), 115–131. Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. Berliner, D. C. (1986). In pursuit of the expert pedagogue. Educational Researcher, 15(7), 5–13. Bill and Melinda Gates Foundation. (2010). Learning about teaching: Initial findings from the measures of effective teaching project. Washington, DC: Gates Foundation. Boaler, J., & Staples, M. (2008). Creating mathematical futures through an equitable teaching approach: The case of Railside School. Teachers College Record, 110(3), 608–645. Brophy, J. (1986). Teacher influences on student achievement. American Psychologist, 41(10), 1069. Brown, M. (2012). Instruction for struggling adolescent readers and the limited influence of curriculum materials (Unpublished doctoral dissertation). Stanford University, Stanford, CA. Cazden, C. (2001). Classroom discourse: The language of teaching. New York: Heinemann. CIERA Update. Retrieved March 23, 2009 from http://www.ciera.org/index.html. Charmaz, K. (2006). Constructing grounded theory: A practical guide through qualitative analysis. Sage Publications Limited. Chazan, D., & Ball, D. (1999). Beyond being told not to tell. For the learning of mathematics, 19(2), 2–10. Chetty, R., Friedman, J. N., & Rockoff, J. E. (2011). The longterm impacts of teachers: Teacher valueadded and student outcomes in adulthood (No. w17699). Cambridge, MA: National Bureau of Economic Research. Cohen, J. (2013). Practices that cross disciplines?: A closer look at instruction in elementary math and English language arts. (Unpublished doctoral dissertation). Stanford University, Stanford, CA. Cohen, J., & Grossman, P. (2011). Of cabbages and kings: Classroom observations and valueadded measures. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, Louisiana. Collins, A., Brown, J. S., & Holum, A. (1991). Cognitive apprenticeship: Making thinking visible. American Educator, 6(11), 38–46. Collins, A., Brown, J. S., Newman, S. E. (1989). Cognitive apprenticeship: Teaching the crafts of reading, writing, and mathematics. In L. Resnick (Ed.), Knowing, learning and instruction: Essays in Honor of Robert Glaser (pp. 454–493). Hillsdale, NJ: Lawrence Erlbaum.Cor, K. (2011, April). The measurement properties of the PLATO rubric. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Corcoran, S., Jennings, J. J., & Beveridge, A. (2011). Teacher effectiveness on high and low stakes test. Unpublished manuscript. Danielson, C. (2007). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development. Dunkin, M. J., & Biddle, B. J. (1974). The study of teaching. New York, NY: Holt, Rinehart and Winston. Erickson, F. (1985). Qualitative methods in research on teaching. In M. C. Wittrock (Ed.), Handbook of research on teaching (3rd ed.) (pp. 119161). New York, NY: MacMillan Press. Espeland, W. N., & Sauder, M. (2007). Rankings and reactivity: How public measures recreate social worlds. American Journal of Sociology, 113(1), 1–40. Espeland, W. N., & Stevens, M. L. (1998). Commensuration as a social process. Annual Review of Sociology, 24(1), 313–343. Floden, R. E. (2001). Research on effects of teaching: A continuing model for research on teaching. In V. Richardson (Ed.), Handbook of research on teaching (4^{th} ed., pp. 3–17). Washington, DC: American Educational Research Association. Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G. (2010). Evaluating teachers: The important role of valueadded. Washington DC: Brookings Institution. Goldenberg, C. (1992). Instructional conversations: Promoting comprehension through discussion. The Reading Teacher, 46(4), 316–326. Goldenberg, C. (2008). Teaching English language learners: What the research does—and does not—say. American Educator, 33(2), 8–44. Good, T., Biddle, B., & Brophy, J. (1975). Teachers make a difference. New York, NY: Holt, Rinehart and Winston. Good, T., & Grouws, D. (1975). Teacher rapport: Some stability data. Journal of Educational Psychology, 67(2), 179–182. Good, T., & Grouws, D. (1979). The Missouri mathematics effectiveness project: An experimental study in fourthgrade classrooms. Journal of Educational Psychology, 71(3), 355–362. Gordon, R. J., Kane, T. J., & Staiger, D. (2006). Identifying effective teachers using performance on the job. Washington, DC: Brookings Institution. Grossman, P., & Cohen, J. (2011, April). Of cabbages and kings?: Classroom observations and valueadded measures. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship between measures of instructional practice in middle school English language arts and teachers’ valueadded scores. American Journal of Education. 119(3), 445–470.


