ROBERT ROSENTHAL - HARVARD UNIVERSITY

Document Type: 
Collection: 
Document Number (FOIA) /ESDN (CREST): 
CIA-RDP96-00792R000100130006-7
Release Decision: 
RIFPUB
Original Classification: 
K
Document Page Count: 
54
Document Creation Date: 
November 4, 2016
Document Release Date: 
October 28, 1998
Sequence Number: 
6
Case Number: 
Publication Date: 
April 2, 1989
Content Type: 
BRIEF
File: 
AttachmentSize
PDF icon CIA-RDP96-00792R000100130006-7.pdf1.69 MB
Body: 
Approved For Release 2000 J1Q Rc QP96-00792R000100130006-7 Harvard University My talk today is designed in part both to comfort the afflicted and to afflict the comfortable. The afflicted are those of us who work in the softer, wilder areas of our field--the areas where the results seem ephemeral and unreplicable, and where the r2's seem always to be approaching zero as a limit. These softer, wilder areas include those of social, personality, clinical, developmental, educational, organizational, and health psychology. They also include parts of ps;rchobiology and cognitive psy- chology. These softer, wilder areas, however, may not include too much of psychophysics. My message to those of us toiling in these muddy vineyards will be that we are doing better that we might have thought. My message to those of us in any areas in which we feel we have pretty well nailed things down will be that we haven't, and that we could be doing a whole lot better. How Large Must an Effect Be, To Be Important? There is a bit of good news-bad news abroad in the land. The good news is that more sophisticated editors, referees, and researchers are becoming aware that reporting the results of a significance test is not a sufficiently enligh ten ingprocedure Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Ap oared For Releas 2000/081/1, ? heIA RDP96-907~922R000113000 grt7 eqf the ec accompanying e p eve . ad news is a we are I 1 not giii a sure what to do with such a report of the magnitude of the effe. t, for example, a correlation coefficient. There is one bit of training that all psychologists have undergone. From under- graduate days onward we have all been taught that there is only one proper, decent thing to do whenever we see a correlation coefficient--we must square it. For most of the softer, wilder areas of psychology, squaring the correlation coefficient tends to make it go away--vanish into nothingness as it were. That is one of the sources of malaise in the social and behavioral sciences. It is sad and quite unnecessary, as we shall soon see. The Physician's Aspirin Study At a special meeting held on December 18, 1987, it was decided to end prematurely, a randomized double blind experiment on the effects of aspirin on reducing heart attacks (Steering Committee of the Physicians' Health Study Research Group, 1988). The reason for this unusual termination of such an experi- ment was that it had become so clear that aspirin prevented heart attacks (and deaths from heart attacks) that it would be unethical to continue to give half the Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 of tlMIA ~i~ralt~~~sair-c~~~o9~a~t~~Q~f~his research? Was r2 .90, or.80, or .70, or .60, so that the corresponding is would '.Lave been .95, .89, .84, or .77? No. Well, was r2 50, .40, .30, or even .20, so that the corresponding is would have been .71, .63, .55, or.45? No. Actually, what r' was, was.0011, with a corresponding r of .034. Insert Table 1 about here Table 1 shows the results of the aspirin study in terms of raw counts, per- centages, and as a Binomial Effect Size Display (BESD). This display is a way of showing the practical importance of any effect indexed by a correlation coefficient. The correlation is shown to be the simple difference in outcome rates between the experimental and the control "groups in this standard table which always adds up to column totals of 100 and row totals of 100 (Rosenthal & Rubin, 1982b). This type of result seen in the physicians' aspirin study is not at all unusual in biomedical research. Some years earlier, on October 29, 1981, the National Heart, Lung, and Blood Institute discontinued its placebo-controlled study of propranolol because results were so favorable to the treatment that it would be unethical to Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 con inue wannoiaing the life-saving drug from e control patients. n was ApprQvei FQr Releese 2$00/08/10 :CIA-FpP96-0079280001001 g0006 ~. gain the effect size r was .04, and the Leading the magnitude o this a ect? Once again' digits of the r2 were .00! As behavioral researchers we are not used to thinking of is of .04 as reflecting effect sizes of practical importance. But when we think of an r of .04 as reflecting a 4% decrease in heart attacks, the interpretation given r in a Binomial Effect Size Display, the r does not appear to be quite so small; especially if we can count ourselves among the 4 per 100 who manage to survive. Insert Table 2 about here Additional Results Table 2 gives three further examples of Binomial Effect Size Displays. In a recent study of 4,462 Army veterans of the Vietnam War era (1965-1971), the correlation between having served in Vietnam (rather than elsewhere) and having suffered from alcohol abuse or dependence was .07 (Centers for Disease Control, 1988). The top display of Table 2 shows that the difference between the problem rates of 53.5 and 46.5 per 100 is equal to the correlation coefficient of .07. The center display of Table 2 shows the results of a study of the effects of AZT on the survival of 282 patients suffering from AIDS or AIDs-related complex (ARC) Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 AZrAWe refl. ReleaselWO iO 1c Ql I 196p9QZ92MO**8Od80OOWthe clinical trial on the ethical grounds that it would be improper to continue to give placebo to the control group patients. As a footnote to this display let me add the result of a small informal poll I took a few weeks ago of some physicians spending the year at the Center for Advanced Study in the Behavioral Sciences. I asked them to tell me of some medical break- through that was of very great practical importance. Their consensus was that the breakthrough was the effect of cyclosporine in increasing the probability that the body would not reject an organ transplant and that the recipient patient would not die. A multi-center randomized experiment was published in 1983 (Canadian Multicentre Transplant Study Group, 1983). The results of this breakthrough experiment were less dramatic than the results of the AZT study. For the dependent variable of organ rejection the effect size r was .19 (r2 = .036); for the dependent variable of patient survival the effect size r was. 15 (r2 =.022). The bottom display of Table 2 shows the results of a famous meta-analysis of psychotherapy outcome studies reported by Smith and Glass (1977). An eminent critic (Rimland, 1979) believed that the results of their analysis sounded the "death Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 ne or psychotherapy because ot the modest -size o e effect. This modest ettect Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 size was an r of .32 accounting for "only 10% of the variance." Examination of the bottom display of Table-2 shows that it is not very realistic to label as "modest indeed" an effect size equivalent to increasing a success rate from 34% to 66% (for example, reducing a death rate or a failure rate from 66% to 34%). Indeed, as we have seen, the dramatic effects of AZT were substantially smaller (r = .23), and the "breakthrough" effects of cyclosporine were smaller still (r = .19). Telling How Well We're Doing The Binomil Effect Size Display is a useful way to display the practical magni- tude of an effect size regardless of whether the dependent variable is dichotomous or continuous (Rosenthal & Rubin, 1982b). An especially useful feature of the display is how easily we can go from the display to an r (just take the difference between the success rates of the experimental versus the control group) and how easily we can go from an effect size r to the display (just compute the treatment success rate as .50 plus one-half of r and the control success rate as .50 minus one-half of r). One effect of the standard use of a display procedure such as the Binomial Effect Size Display to index the practical value of our research results would be to give us more useful and more realistic assessments of how well we are really doing as Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 researchers in the social and behavioral sciences. Employment of the Binomial Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Effect Size Display has, in fact, shown that we are doing considerably better in our "softer, wilder" sciences than we may have thought we were doing. So far, our conversation has been intended to comfort the afflicted. In what follows the intent is a bit more to afflict the comfortable. We begin with the topic of replication. The Meaning of Successful Replication There is a long tradition in psychology of our urging one another'to replicate each other's research. Indeed, there seems to be something nearly scriptural about it--I quote: "If a scholar's work be deemed unreplicable then shall ye gladly cast that scholar out." (That's from either Referees I or Editors II, I believe.) Now, while we have been very good at calling for replications we have not been too good. at deciding when a= replication has been successful. The issue we now address is: When shall a study be deemed successfully replicated? Successful replication is ordinarily taken to mean that a null hypothesis that has been rejected at time 1 is rejected again, and with the same direction of outcome, on the basis of a new study at time 2. The basic model of this usage can be seen in Table 3. The results of the first study are described dichotomously asp < .05 or p > Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Approved For Release 2a0%b8T'fiejA~P96-00792R000100130006-7 .05 (or some other critical level, e.g., .01). Each of these two possible outcomes is further dichotomized as to the results of the second study as p < .05 or p >.05. Thus, cells A and D of Table 3 are examples of failure to replicate because one study was significant and the other was not. Let us examine more closely a specific example of such a "failure to replicate." Pseudo-Failures to Replicate The saga of Smith and Jones. Smith has published the results of an experiment in which a certain treatment procedure was predicted to improve performance. She reported results significant at p 1 in the numerator or by X2 tests with df > 1. For example, suppose the specific question is whether increased incentive level improves the productivity of work groups. We employ four levels of incentive so that our omnibus F test would have 3 df in the numerator or our omnibus X2 would be on at least 3 df. Common as these tests are, they reflect poorly on our teaching of data analytic procedures. The diffuse hypothesis tested by these omnibus tests usually tells us nothing of importance about our research question. The rule of thumb is unambiguous: Whenever we have tested a fixed effect with df > 1 for X2 or for the numerator of F, we have tested a question in which we are almost surely not interested. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 The situation is even worse when there are several dependent variables as well Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 as multiple tI for the independent variable. The paradigm case here is canonical correlation and special cases are MANOVA, MANCOVA, Multiple discriminant function., multiple path analysis, and complex multiple partial correlation. While all of these procedures have useful exploratory data analytic applications they are commonly used to test null hypotheses which are scientifically almost always of doubtful value. The effect size estimates they yield (e.g., the canonical correlation) are also almost always of doubtful value. This is not the place to go into detail, but one approach to the problem of analyzing canonical data structures is to reduce the set of dependent variables to some smaller number of composite variables using the principal-components- follov: 2d-by-unit-weighting approach. Each composite can then be analyzed serially. Meta-analytic questions are basically contrast questions. F tests with df 1 in the numerator or X2's with df > 1 are useless in meta-analytic work. That leads to an additional scientific benefit: The increased recognition of contrast analysis. Meta-analytic questions require precise :formulation of questions and contrasts are procedures for obtaining answers to such questions, often in an analysis of variance or table analysis context. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Although most textbooks of statistics describe the logic and the machinery of Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 contrast analyses, one still sees contrasts employed all too rarely. That is a real pity given the precision of thought and theory they encourage and (especially relevant to these times of publication pressure) given the boost in power conferred with the resulting increase in.05 asterisks (Rosenthal & Rosnow, 1985). A probable increase in the accurate understanding of interaction effects. Probably the universally most misinterpreted empirical results in psychology are the results of interaction effects. A recent survey of 191 research articles involving interactions found only two articles that showed the authors interpreting inter- actions in an unequivocally correct manner (i.e., by examining the residuals that define the interaction) (Rosnow & Rosenthal, 1989). The rest of the articles simply compared means of conditions with other means, a procedure that does not investigate interaction effects but rather the sum of main effects and interaction effects. Most standard textbooks of statistics for psychologists provide accurate mathematical definitions of interaction effects but then interpret not the residuals that define those interactions but the means of cells that are the sums of all main effects and all interactions. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 In addition, users of SPSS, SAS, BMDP, and virtually all other data-analytic Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 software are poorly served in the matter of interactions since virtually no programs provide convenient tabular output giving the residuals defining interaction. The only exception to that of which I am aware is a little-known package called Data- Text developed by Arthur Couch and David Armor for which William Cochran and Donald Rubin provided the statistical consultation. Since many meta-analytic questions are by nature questions of interaction (for example, that opposite sex dyads will conduct standard transactions more slowly than will same sex dyads), we can be hopeful that increased use of meta-analytic procedures will bring with it increased sophistication about the meaning of interaction. Meta-analytic procedures are applicable beyond meta-analyses. Many of the techniques of contrast analyses among effect sizes, for example, can be used within a single study (Rosenthal & Rosnow, 1985). Computing a single effect size from correlated dependent variables, or comparing treatment effects on two or more dependent variables serve as illustrations (Rosenthal & Rubin, 1986). The decrease in the splendid detachment of the full professor. Meta-analytic work requires careful reading of research and moderate data analytic skills. We Approved For Release 2000/08/10 : CIA-RDP96,-00792R000100130006-7 cannot send an undergraduate research assistant to the library with a stack of 5 X 8 Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 cards to bring us sack "the results." With narrative reviews that seems often to have been done. With meta-analysis the reviewer must get involved with the actual data and that is all to the good. Conclusion I hope that this paper has provided some comfort to the afflicted in showing that many of the findings of our discipline are neither as small nor as unimportant from a practical point of view as we may have feared. Perhaps I hope, too, that there may have been some affliction of the comfortable in showing that in our views of replication and of the cumulation of the wisdom of our field there is much yet remaining to be done. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 References Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Barnes, D. M. (1986). Promising results halt trial of anti-AIDS drug. Science, 234, 15-16. Canadian Multicentre Transplant Study Group. (1983). A randomized clinical trial of cyclosporine in cadaveric renal transplantation. New England Journal of Medicine, 309,809-815. Centers for Disease Control Vietnam Experience Study. (1988). Health status of Vietnam veterans: 1. Psychosocial characteristics. Journal of the American Medical Association, 259,2701-2707. Cohen, J. (1962). the statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65,145-153. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Hillsdale, NJ: Lawrence Erlbaum Associates. Collins, H. M. (1985). Changing Order: Replication and Induction in Scientific Practice. Beverly Hills, CA: Sage. Cooper, :H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for summarizing research findings. Psychological Bulletin, 87; 442-449. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Elashoff, J. D. (1978). Box scores are for baseball. The Behavioral and Brain Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Sciences, 3, 392. Fiske, D. W. (1978). The several kinds of generalization. The Behavioral and Brain Sciences, 3, 393-394. Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8. Glass, G. V. (1978). In defense of generalization. The Behavioral and Brain Sciences, 3, 394-395. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-Analysis in Social Research. Beverly Hills, CA: Sage. Hedges,, L. V. (1982). Estimation of effect size from a series of independent experiments. Psychological Bulletin, 92,490-499. Hedges., L. V. (1987). How hard is hard science, how soft is soft science? American Psychologist, 42, 443-455. Jung, J. (1978). Self-negating functions of self-fulfilling prophecies. The Behavioral and Brain Sciences, 3, 397-398. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Lamb, W. K., & Whitla, D. K. (1983). Meta-Analysis and the Integration of Research Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Findings: A Trend Analysis and Bibliography Prior to 1983. Unpublished manuscript, Harvard University, Cambridge. Mayo, R.. J. (1978). Statistical considerations in analyzing the results of a collection of experiments. The Behavioral and Brain Sciences, 3,400-401. Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist, 41, 1299-1301. Pool, R. (1988). Similar experiments, dissimilar results. Science, 242, 192-193. Rimland, B. (1979). Death knell for psychotherapy? American Psychologist, 34, 192. Rosenthal, R. (1966). Experimenter Effects in. Behavioral Research. New York: Appleton-Century-Crofts. Rosenthal, R. (1969). Interpersonal expectations. In R. Rosenthal and R. L. Rosnow (Eds.), Artifact in Behavioral Research (pp. 181-277). New York: Academic Press. Rosenthal, R. (1979a). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 86,638-641. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Rosenthal, R. (1979b). Replications and their relative utilities. Replications in Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Social Psychology, 1(1), 15-23. Rosenthal, R. (1984). Meta-Analytic Procedures for Social Research. Beverly Hills, CA: Sage. Rosenthal, R. (1986). Nonsignificant relationships as scientific evidence. Behavioral and Brain Sciences, 9,479-481. Rosenthal, R., & Gaito, J. (1963). The interpretation of levels of significance by psychological researchers. Journal of Psychology, 55, 33-38. Rosenthal, R., & Gaito, J. (1964). Further evidence for the cliff effect in the interpretation of levels of significance. Psychological Reports, 15, 570. Rosenthal, R., & Rosnow, R. L. (1984). Essentials of Behavioral Research: Methods and Data Analysis. New York: McGraw-Hill. Rosenthal, R., & Rosnow, R. L. (1985). Contrast Analysis: Focused Comparisons in the Analysis of Variance. New York: Cambridge University Press. Rosenthal, R., & Rosnow, R. L. (in press). Essentials of Behavioral Research: Methods and Data Analysis. 2nd ed., New York: McGraw-Hill. Rosenthal, R., & Rubin, D. B. (1978). Interpersonal expectancy effects: The first 345 studies. The Behavioral and Brain Sciences, 3,377-386. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Rosenthal, R., & Rubin, D. B. (1979). Comparing significance levels of independent Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 studies. Psychological Bulletin. 36,1165-1168. Rosenthal, R., & Rubin, D. B. (1982a). Comparing effect sizes of independent studies. Psychological Bulletin, 92,500-504. Rosenthal, R., & Rubin, D. B. (1982b). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166- 169. Rosenthal, R., & Rubin, D. B. (1985). Statistical analysis: Summarizing evidence versus establishing facts. Psychological Bulletin, 97,527-529. Rosenthal, R., & Rubin, D. B. (1986). Meta-analytic procedures for combining studies with multiple effect sizes. Psychological Bulletin, 99, 400-406. Rosenthal, R., & Rubin, D. B. (1988). Comment: Assumptions and procedures in the file: drawer problem. Statistical Science, 3, 120-125. Rosenthal, R., & Rubin, D. B. (in press). Effect size estimation for one-sample multiple-choice-type data: Design, analysis, and meta-analysis. Psychological Bulletin. Rosnow, R. L., & Rosenthal, R. (1989). Definition and interpretation of interaction effects. Psychological Bulletin, 105, 143-146. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Sedlmeier, P., & Gigerenzer, G. (in press). Do studies of statistical power have an Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 effect on the power of studies? Psychological Bulletin. Skinner, B. F. (1983, August). On the value of research. APA Monitor, p. 39. Smith, M. L., & Glass, G. V (1977). Meta-analysis of psychotherapy, outcome studies. American Psychologist, 32, 752-760. Snedecor, G. W., & Cochran, W. G. (1980). Statistical Methods (7th ed.). Ames: Iowa State University Press. Steering; Committee of the Physicians Health Study Research Group. (1988). Preliminary report: Findings from the aspirin component of the ongoing physicians' health study. The New England Journal of Medicine, 318,262-264. Tukey, J.W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Author Notes Approved For Release 2000/08/10 : CIA-RDP96-00792R00010013000 7 This paper was presented as an EPA Distinguished Lecture atf Meeting of the Eastern Psychological Association, Boston, April 2, 191' tion of this paper was supported in part by the National Science Founc the author was a Fellow at the Center for Advanced Study in the Behavioral Sciences. I am grateful for financial support provided by the John D. and Catherine T. MacArthur Foundation, and for improvements suggested by Lynn Gale, Deanna Knickerbocker, Harold Luft, and Lincoln Moses. Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Apprg a for Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Effects of Aspirin on Heart Attacks Among 22,000 Physicians Heart Attack No Heart Attack Total I. Raw Counts Aspirin 104 10,933 11,037 Placebo 189 10,845 11,034 Total 293 21,778 22,071 H. Percentages Aspirin 0.94 99.06 100 Placebo 1.71 98.29 100 Total 1.33 98.67 100 III. Binomial Effect Size Display Aspirin 48.3 51.7 100 Placebo 51.7 48.3 100 Total 100 100 200 Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Ap~gqjd For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Other Examples of Binomial Effect Size Displays I. Vietnam Service and Alcohol Problems (r = .07) Problem No Problem Total Vietnam Veteran 53.5 46.5 100 Non-Vietnam Veteran 46.5 53.5 100 Total 100 100 200 II. AZT in the Treatment of AIDS (r = .23) Death Survival Total AZT 38.5 61.5 100 Placebo 61.5 38.5 100 Total 100 100 200 III. Benefits of Psychotherapy (r = .32) Less'Benefit Greater Benefit Total Psychotherapy 34 66 100 Control 66 34 100 Total 100 100 200 Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Tabroved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7 Common Model of Successful Replication: Judgment is Dichotomous and Based on Significance Testing First Study p>.05* p < .05a Second Study p>.05 p