ROBERT ROSENTHAL - HARVARD UNIVERSITY
Document Type:
Collection:
Document Number (FOIA) /ESDN (CREST):
CIA-RDP96-00792R000100130006-7
Release Decision:
RIFPUB
Original Classification:
K
Document Page Count:
54
Document Creation Date:
November 4, 2016
Document Release Date:
October 28, 1998
Sequence Number:
6
Case Number:
Publication Date:
April 2, 1989
Content Type:
BRIEF
File:
Attachment | Size |
---|---|
CIA-RDP96-00792R000100130006-7.pdf | 1.69 MB |
Body:
Approved For Release 2000 J1Q Rc QP96-00792R000100130006-7
Harvard University
My talk today is designed in part both to comfort the afflicted and to afflict the
comfortable. The afflicted are those of us who work in the softer, wilder areas of our
field--the areas where the results seem ephemeral and unreplicable, and where the
r2's seem always to be approaching zero as a limit. These softer, wilder areas include
those of social, personality, clinical, developmental, educational, organizational, and
health psychology. They also include parts of ps;rchobiology and cognitive psy-
chology. These softer, wilder areas, however, may not include too much of
psychophysics.
My message to those of us toiling in these muddy vineyards will be that we are
doing better that we might have thought. My message to those of us in any areas in
which we feel we have pretty well nailed things down will be that we haven't, and
that we could be doing a whole lot better.
How Large Must an Effect Be, To Be Important?
There is a bit of good news-bad news abroad in the land. The good news is that
more sophisticated editors, referees, and researchers are becoming aware that
reporting the results of a significance test is not a sufficiently enligh ten ingprocedure
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Ap oared For Releas 2000/081/1, ? heIA RDP96-907~922R000113000 grt7
eqf the ec accompanying e p eve . ad news is a we are I 1 not giii a sure
what to do with such a report of the magnitude of the effe. t, for example, a
correlation coefficient.
There is one bit of training that all psychologists have undergone. From under-
graduate days onward we have all been taught that there is only one proper, decent
thing to do whenever we see a correlation coefficient--we must square it. For most of
the softer, wilder areas of psychology, squaring the correlation coefficient tends to
make it go away--vanish into nothingness as it were. That is one of the sources of
malaise in the social and behavioral sciences. It is sad and quite unnecessary, as we
shall soon see.
The Physician's Aspirin Study
At a special meeting held on December 18, 1987, it was decided to end
prematurely, a randomized double blind experiment on the effects of aspirin on
reducing heart attacks (Steering Committee of the Physicians' Health Study
Research Group, 1988). The reason for this unusual termination of such an experi-
ment was that it had become so clear that aspirin prevented heart attacks (and
deaths from heart attacks) that it would be unethical to continue to give half the
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
of tlMIA
~i~ralt~~~sair-c~~~o9~a~t~~Q~f~his
research? Was r2 .90, or.80, or .70, or .60, so that the corresponding is would '.Lave
been .95, .89, .84, or .77? No. Well, was r2 50, .40, .30, or even .20, so that the
corresponding is would have been .71, .63, .55, or.45? No. Actually, what r' was,
was.0011, with a corresponding r of .034.
Insert Table 1 about here
Table 1 shows the results of the aspirin study in terms of raw counts, per-
centages, and as a Binomial Effect Size Display (BESD). This display is a way of
showing the practical importance of any effect indexed by a correlation coefficient.
The correlation is shown to be the simple difference in outcome rates between the
experimental and the control "groups in this standard table which always adds up to
column totals of 100 and row totals of 100 (Rosenthal & Rubin, 1982b).
This type of result seen in the physicians' aspirin study is not at all unusual in
biomedical research. Some years earlier, on October 29, 1981, the National Heart,
Lung, and Blood Institute discontinued its placebo-controlled study of propranolol
because results were so favorable to the treatment that it would be unethical to
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
con inue wannoiaing the life-saving drug from e control patients. n was
ApprQvei FQr Releese 2$00/08/10 :CIA-FpP96-0079280001001 g0006 ~.
gain the effect size r was .04, and the Leading
the magnitude o this a ect? Once again'
digits of the r2 were .00! As behavioral researchers we are not used to thinking of is
of .04 as reflecting effect sizes of practical importance. But when we think of an r of
.04 as reflecting a 4% decrease in heart attacks, the interpretation given r in a
Binomial Effect Size Display, the r does not appear to be quite so small; especially if
we can count ourselves among the 4 per 100 who manage to survive.
Insert Table 2 about here
Additional Results
Table 2 gives three further examples of Binomial Effect Size Displays. In a
recent study of 4,462 Army veterans of the Vietnam War era (1965-1971), the
correlation between having served in Vietnam (rather than elsewhere) and having
suffered from alcohol abuse or dependence was .07 (Centers for Disease Control,
1988). The top display of Table 2 shows that the difference between the problem
rates of 53.5 and 46.5 per 100 is equal to the correlation coefficient of .07.
The center display of Table 2 shows the results of a study of the effects of AZT
on the survival of 282 patients suffering from AIDS or AIDs-related complex (ARC)
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
AZrAWe refl. ReleaselWO iO 1c Ql I 196p9QZ92MO**8Od80OOWthe
clinical trial on the ethical grounds that it would be improper to continue to give
placebo to the control group patients.
As a footnote to this display let me add the result of a small informal poll I took
a few weeks ago of some physicians spending the year at the Center for Advanced
Study in the Behavioral Sciences. I asked them to tell me of some medical break-
through that was of very great practical importance. Their consensus was that the
breakthrough was the effect of cyclosporine in increasing the probability that the
body would not reject an organ transplant and that the recipient patient would not
die. A multi-center randomized experiment was published in 1983 (Canadian
Multicentre Transplant Study Group, 1983). The results of this breakthrough
experiment were less dramatic than the results of the AZT study. For the dependent
variable of organ rejection the effect size r was .19 (r2 = .036); for the dependent
variable of patient survival the effect size r was. 15 (r2 =.022).
The bottom display of Table 2 shows the results of a famous meta-analysis of
psychotherapy outcome studies reported by Smith and Glass (1977). An eminent
critic (Rimland, 1979) believed that the results of their analysis sounded the "death
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
ne or psychotherapy because ot the modest -size o e effect. This modest ettect
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
size was an r of .32 accounting for "only 10% of the variance."
Examination of the bottom display of Table-2 shows that it is not very realistic
to label as "modest indeed" an effect size equivalent to increasing a success rate from
34% to 66% (for example, reducing a death rate or a failure rate from 66% to 34%).
Indeed, as we have seen, the dramatic effects of AZT were substantially smaller (r =
.23), and the "breakthrough" effects of cyclosporine were smaller still (r = .19).
Telling How Well We're Doing
The Binomil Effect Size Display is a useful way to display the practical magni-
tude of an effect size regardless of whether the dependent variable is dichotomous or
continuous (Rosenthal & Rubin, 1982b). An especially useful feature of the display
is how easily we can go from the display to an r (just take the difference between the
success rates of the experimental versus the control group) and how easily we can go
from an effect size r to the display (just compute the treatment success rate as .50
plus one-half of r and the control success rate as .50 minus one-half of r).
One effect of the standard use of a display procedure such as the Binomial
Effect Size Display to index the practical value of our research results would be to
give us more useful and more realistic assessments of how well we are really doing as
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
researchers in the social and behavioral sciences. Employment of the Binomial
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Effect Size Display has, in fact, shown that we are doing considerably better in our
"softer, wilder" sciences than we may have thought we were doing.
So far, our conversation has been intended to comfort the afflicted. In what
follows the intent is a bit more to afflict the comfortable. We begin with the topic of
replication.
The Meaning of Successful Replication
There is a long tradition in psychology of our urging one another'to replicate
each other's research. Indeed, there seems to be something nearly scriptural about
it--I quote: "If a scholar's work be deemed unreplicable then shall ye gladly cast that
scholar out." (That's from either Referees I or Editors II, I believe.)
Now, while we have been very good at calling for replications we have not been
too good. at deciding when a= replication has been successful. The issue we now
address is: When shall a study be deemed successfully replicated?
Successful replication is ordinarily taken to mean that a null hypothesis that
has been rejected at time 1 is rejected again, and with the same direction of outcome,
on the basis of a new study at time 2. The basic model of this usage can be seen in
Table 3. The results of the first study are described dichotomously asp < .05 or p >
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Approved For Release 2a0%b8T'fiejA~P96-00792R000100130006-7
.05 (or some other critical level, e.g., .01). Each of these two possible outcomes is
further dichotomized as to the results of the second study as p < .05 or p >.05. Thus,
cells A and D of Table 3 are examples of failure to replicate because one study was
significant and the other was not. Let us examine more closely a specific example of
such a "failure to replicate."
Pseudo-Failures to Replicate
The saga of Smith and Jones. Smith has published the results of an experiment
in which a certain treatment procedure was predicted to improve performance. She
reported results significant at p 1 in the numerator or by X2 tests with df > 1. For
example, suppose the specific question is whether increased incentive level improves
the productivity of work groups. We employ four levels of incentive so that our
omnibus F test would have 3 df in the numerator or our omnibus X2 would be on at
least 3 df. Common as these tests are, they reflect poorly on our teaching of data
analytic procedures. The diffuse hypothesis tested by these omnibus tests usually
tells us nothing of importance about our research question. The rule of thumb is
unambiguous: Whenever we have tested a fixed effect with df > 1 for X2 or for the
numerator of F, we have tested a question in which we are almost surely not
interested.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
The situation is even worse when there are several dependent variables as well
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
as multiple tI for the independent variable. The paradigm case here is canonical
correlation and special cases are MANOVA, MANCOVA, Multiple discriminant
function., multiple path analysis, and complex multiple partial correlation. While all
of these procedures have useful exploratory data analytic applications they are
commonly used to test null hypotheses which are scientifically almost always of
doubtful value. The effect size estimates they yield (e.g., the canonical correlation)
are also almost always of doubtful value.
This is not the place to go into detail, but one approach to the problem of
analyzing canonical data structures is to reduce the set of dependent variables to
some smaller number of composite variables using the principal-components-
follov: 2d-by-unit-weighting approach. Each composite can then be analyzed serially.
Meta-analytic questions are basically contrast questions. F tests with df 1 in
the numerator or X2's with df > 1 are useless in meta-analytic work. That leads to
an additional scientific benefit:
The increased recognition of contrast analysis. Meta-analytic questions require
precise :formulation of questions and contrasts are procedures for obtaining answers
to such questions, often in an analysis of variance or table analysis context.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Although most textbooks of statistics describe the logic and the machinery of
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
contrast analyses, one still sees contrasts employed all too rarely. That is a real pity
given the precision of thought and theory they encourage and (especially relevant to
these times of publication pressure) given the boost in power conferred with the
resulting increase in.05 asterisks (Rosenthal & Rosnow, 1985).
A probable increase in the accurate understanding of interaction effects.
Probably the universally most misinterpreted empirical results in psychology are
the results of interaction effects. A recent survey of 191 research articles involving
interactions found only two articles that showed the authors interpreting inter-
actions in an unequivocally correct manner (i.e., by examining the residuals that
define the interaction) (Rosnow & Rosenthal, 1989). The rest of the articles simply
compared means of conditions with other means, a procedure that does not
investigate interaction effects but rather the sum of main effects and interaction
effects.
Most standard textbooks of statistics for psychologists provide accurate
mathematical definitions of interaction effects but then interpret not the residuals
that define those interactions but the means of cells that are the sums of all main
effects and all interactions.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
In addition, users of SPSS, SAS, BMDP, and virtually all other data-analytic
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
software are poorly served in the matter of interactions since virtually no programs
provide convenient tabular output giving the residuals defining interaction. The
only exception to that of which I am aware is a little-known package called Data-
Text developed by Arthur Couch and David Armor for which William Cochran and
Donald Rubin provided the statistical consultation.
Since many meta-analytic questions are by nature questions of interaction (for
example, that opposite sex dyads will conduct standard transactions more slowly
than will same sex dyads), we can be hopeful that increased use of meta-analytic
procedures will bring with it increased sophistication about the meaning of
interaction.
Meta-analytic procedures are applicable beyond meta-analyses. Many of the
techniques of contrast analyses among effect sizes, for example, can be used within a
single study (Rosenthal & Rosnow, 1985). Computing a single effect size from
correlated dependent variables, or comparing treatment effects on two or more
dependent variables serve as illustrations (Rosenthal & Rubin, 1986).
The decrease in the splendid detachment of the full professor. Meta-analytic
work requires careful reading of research and moderate data analytic skills. We
Approved For Release 2000/08/10 : CIA-RDP96,-00792R000100130006-7
cannot send an undergraduate research assistant to the library with a stack of 5 X 8
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
cards to bring us sack "the results." With narrative reviews that seems often to have
been done. With meta-analysis the reviewer must get involved with the actual data
and that is all to the good.
Conclusion
I hope that this paper has provided some comfort to the afflicted in showing
that many of the findings of our discipline are neither as small nor as unimportant
from a practical point of view as we may have feared. Perhaps I hope, too, that there
may have been some affliction of the comfortable in showing that in our views of
replication and of the cumulation of the wisdom of our field there is much yet
remaining to be done.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
References
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Barnes, D. M. (1986). Promising results halt trial of anti-AIDS drug. Science, 234,
15-16.
Canadian Multicentre Transplant Study Group. (1983). A randomized clinical trial
of cyclosporine in cadaveric renal transplantation. New England Journal of
Medicine, 309,809-815.
Centers for Disease Control Vietnam Experience Study. (1988). Health status of
Vietnam veterans: 1. Psychosocial characteristics. Journal of the American
Medical Association, 259,2701-2707.
Cohen, J. (1962). the statistical power of abnormal-social psychological research: A
review. Journal of Abnormal and Social Psychology, 65,145-153.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed.,
Hillsdale, NJ: Lawrence Erlbaum Associates.
Collins, H. M. (1985). Changing Order: Replication and Induction in Scientific
Practice. Beverly Hills, CA: Sage.
Cooper, :H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for
summarizing research findings. Psychological Bulletin, 87; 442-449.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Elashoff, J. D. (1978). Box scores are for baseball. The Behavioral and Brain
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Sciences, 3, 392.
Fiske, D. W. (1978). The several kinds of generalization. The Behavioral and Brain
Sciences, 3, 393-394.
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research.
Educational Researcher, 5, 3-8.
Glass, G. V. (1978). In defense of generalization. The Behavioral and Brain
Sciences, 3, 394-395.
Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-Analysis in Social Research.
Beverly Hills, CA: Sage.
Hedges,, L. V. (1982). Estimation of effect size from a series of independent
experiments. Psychological Bulletin, 92,490-499.
Hedges., L. V. (1987). How hard is hard science, how soft is soft science? American
Psychologist, 42, 443-455.
Jung, J. (1978). Self-negating functions of self-fulfilling prophecies. The Behavioral
and Brain Sciences, 3, 397-398.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Lamb, W. K., & Whitla, D. K. (1983). Meta-Analysis and the Integration of Research
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Findings: A Trend Analysis and Bibliography Prior to 1983. Unpublished
manuscript, Harvard University, Cambridge.
Mayo, R.. J. (1978). Statistical considerations in analyzing the results of a collection
of experiments. The Behavioral and Brain Sciences, 3,400-401.
Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Interpretation of significance
levels and effect sizes by psychological researchers. American Psychologist, 41,
1299-1301.
Pool, R. (1988). Similar experiments, dissimilar results. Science, 242, 192-193.
Rimland, B. (1979). Death knell for psychotherapy? American Psychologist, 34,
192.
Rosenthal, R. (1966). Experimenter Effects in. Behavioral Research. New York:
Appleton-Century-Crofts.
Rosenthal, R. (1969). Interpersonal expectations. In R. Rosenthal and R. L. Rosnow
(Eds.), Artifact in Behavioral Research (pp. 181-277). New York: Academic
Press.
Rosenthal, R. (1979a). The "file drawer problem" and tolerance for null results.
Psychological Bulletin, 86,638-641.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Rosenthal, R. (1979b). Replications and their relative utilities. Replications in
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Social Psychology, 1(1), 15-23.
Rosenthal, R. (1984). Meta-Analytic Procedures for Social Research. Beverly Hills,
CA: Sage.
Rosenthal, R. (1986). Nonsignificant relationships as scientific evidence.
Behavioral and Brain Sciences, 9,479-481.
Rosenthal, R., & Gaito, J. (1963). The interpretation of levels of significance by
psychological researchers. Journal of Psychology, 55, 33-38.
Rosenthal, R., & Gaito, J. (1964). Further evidence for the cliff effect in the
interpretation of levels of significance. Psychological Reports, 15, 570.
Rosenthal, R., & Rosnow, R. L. (1984). Essentials of Behavioral Research: Methods
and Data Analysis. New York: McGraw-Hill.
Rosenthal, R., & Rosnow, R. L. (1985). Contrast Analysis: Focused Comparisons in
the Analysis of Variance. New York: Cambridge University Press.
Rosenthal, R., & Rosnow, R. L. (in press). Essentials of Behavioral Research:
Methods and Data Analysis. 2nd ed., New York: McGraw-Hill.
Rosenthal, R., & Rubin, D. B. (1978). Interpersonal expectancy effects: The first 345
studies. The Behavioral and Brain Sciences, 3,377-386.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Rosenthal, R., & Rubin, D. B. (1979). Comparing significance levels of independent
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
studies. Psychological Bulletin. 36,1165-1168.
Rosenthal, R., & Rubin, D. B. (1982a). Comparing effect sizes of independent
studies. Psychological Bulletin, 92,500-504.
Rosenthal, R., & Rubin, D. B. (1982b). A simple, general purpose display of
magnitude of experimental effect. Journal of Educational Psychology, 74, 166-
169.
Rosenthal, R., & Rubin, D. B. (1985). Statistical analysis: Summarizing evidence
versus establishing facts. Psychological Bulletin, 97,527-529.
Rosenthal, R., & Rubin, D. B. (1986). Meta-analytic procedures for combining
studies with multiple effect sizes. Psychological Bulletin, 99, 400-406.
Rosenthal, R., & Rubin, D. B. (1988). Comment: Assumptions and procedures in the
file: drawer problem. Statistical Science, 3, 120-125.
Rosenthal, R., & Rubin, D. B. (in press). Effect size estimation for one-sample
multiple-choice-type data: Design, analysis, and meta-analysis. Psychological
Bulletin.
Rosnow, R. L., & Rosenthal, R. (1989). Definition and interpretation of interaction
effects. Psychological Bulletin, 105, 143-146.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Sedlmeier, P., & Gigerenzer, G. (in press). Do studies of statistical power have an
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
effect on the power of studies? Psychological Bulletin.
Skinner, B. F. (1983, August). On the value of research. APA Monitor, p. 39.
Smith, M. L., & Glass, G. V (1977). Meta-analysis of psychotherapy, outcome
studies. American Psychologist, 32, 752-760.
Snedecor, G. W., & Cochran, W. G. (1980). Statistical Methods (7th ed.). Ames: Iowa
State University Press.
Steering; Committee of the Physicians Health Study Research Group. (1988).
Preliminary report: Findings from the aspirin component of the ongoing
physicians' health study. The New England Journal of Medicine, 318,262-264.
Tukey, J.W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Author Notes
Approved For Release 2000/08/10 : CIA-RDP96-00792R00010013000 7
This paper was presented as an EPA Distinguished Lecture atf
Meeting of the Eastern Psychological Association, Boston, April 2, 191'
tion of this paper was supported in part by the National Science Founc
the author was a Fellow at the Center for Advanced Study in the Behavioral
Sciences. I am grateful for financial support provided by the John D. and Catherine
T. MacArthur Foundation, and for improvements suggested by Lynn Gale, Deanna
Knickerbocker, Harold Luft, and Lincoln Moses.
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Apprg a for Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Effects of Aspirin on Heart Attacks Among 22,000 Physicians
Heart Attack
No Heart
Attack
Total
I. Raw Counts
Aspirin 104
10,933
11,037
Placebo 189
10,845
11,034
Total 293
21,778
22,071
H. Percentages
Aspirin 0.94
99.06
100
Placebo 1.71
98.29
100
Total 1.33
98.67
100
III. Binomial Effect Size Display
Aspirin
48.3
51.7
100
Placebo
51.7
48.3
100
Total
100
100
200
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Ap~gqjd For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Other Examples of Binomial Effect Size Displays
I. Vietnam Service and Alcohol Problems (r = .07)
Problem No Problem
Total
Vietnam Veteran
53.5 46.5
100
Non-Vietnam Veteran
46.5 53.5
100
Total 100 100
200
II. AZT in the Treatment of AIDS (r = .23)
Death Survival
Total
AZT 38.5 61.5
100
Placebo 61.5 38.5
100
Total 100 100
200
III. Benefits of Psychotherapy (r = .32)
Less'Benefit
Greater
Benefit
Total
Psychotherapy
34
66
100
Control
66
34
100
Total
100
100
200
Approved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Tabroved For Release 2000/08/10 : CIA-RDP96-00792R000100130006-7
Common Model of Successful Replication: Judgment is Dichotomous and Based on
Significance Testing
First Study
p>.05*
p < .05a
Second
Study
p>.05
p