REPLICATION AND META-ANALYSIS IN PARAPSYCHOLOGY
Document Type:
Collection:
Document Number (FOIA) /ESDN (CREST):
CIA-RDP96-00792R000100130004-9
Release Decision:
RIFPUB
Original Classification:
K
Document Page Count:
41
Document Creation Date:
November 4, 2016
Document Release Date:
September 5, 2003
Sequence Number:
4
Case Number:
Content Type:
MAGAZINE
File:
Attachment | Size |
---|---|
CIA-RDP96-00792R000100130004-9.pdf | 4.42 MB |
Body:
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Statistical Science
1991, Vol. 6, :No. 4, 363-403
Replication and Meta-Analysis in
Parapsychology
Abstract. Parapsychology, the laboratory study of psychic phenomena,
has had its history interwoven with that of statistics. Many of the
controversies in parapsychology have focused on statistical issues, and
statistical models have played an' integral role in the experimental
work. Recently, parapsychologists -have been using meta-analysis as a
tool for synthesizing large bodies of work. This paper presents an
overview of the use of statistics in parapsychology and offers a summary
of the meta-analyses that have been conducted. It begins with some
anecdotal information about the involvement of statistics and statisti-
cians with the early history of parapsychology. Next, it is argued that
most nonstatisticians do not appreciate the connection between power
and "successful" replication of experimental effects. Returning to para-
psychology, a particular experimental regime is examined by summariz-
ing an extended debate over the interpretation of the results. A new set
of experiments designed to resolve the debate is then reviewed. Finally,
meta-analyses from several areas of parapsychology are summarized. It
is concluded that the overall evidence indicates that there is an anoma-
lous effect in need of an explanation.
Key words and phrases: Effect size, psychic research, statistical contro-
versies, randomness, vote-counting.
In a June 1990 Gallup Poll, 49% of the 1236
respondents claimed to believe in extrasensory per-
ception (ESP), and one in four claimed to have had
a personal experience involving telepathy (Gallup
and Newport, 1991). Other surveys have shown
even higher percentages; the University of
Chicago's National Opinion Research Center re-
cently surveyed 1473 adults, of which 67% claimed
that they had experienced ESP (Greeley, 1987).
Public opinion is a poor arbiter of science, how-
ever, and experience is a poor substitute for the
scientific method. For more than a century, small
numbers of scientists have been conducting labora-
tory experiments to study phenomena such as
telepathy, clairvoyance and precognition, collec-
tively known as "psi" abilities. This paper will
examine some of that work, as well as some of the
statistical controversies it has generated.
Jessica Utts is Associate Professor, Division of
Statistics, University of California at Davis, 469
Kerr Hall, Davis, California 95616.
Parapsychology, as this field is called, has been a
source of controversy throughout its history. Strong
beliefs tend to be resistant to change even in the
face of data, and many people, scientists included,
seem to have made up their minds on the question
without examining any empirical data at all. A
critic of parapsychology recently acknowledged that
"The level of the debate during the past 130 years
has been an embarrassment for anyone who would
like to believe that scholars and scientists adhere
to standards of rationality and fair play" (Hyman,
1985a, page 89). While much of the controversy has
focused on poor experimental design and potential
fraud, there have been attacks and defenses of the
statistical methods as well, sometimes calling into
question the very foundations of probability and
statistical inference.
Most of the criticisms have been leveled by psy-
chologists. For example, a 1988 report of the U.S.
National Academy of Sciences concluded that "The
committee finds no scientific justification from
research conducted over a period of 130 years for
the existence of parapsychological phenomena"-
(Druckman and Swets, 1988, page 22). The chapter
on parapsychology was written by a subcommittee
Approved For Release 2003/09/10 : CIAsRDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
364
chaired by a psychologist who had published a
similar conclusion prior to his appointment to the
committee (Hyman, 1985a, page 7). There were no
parapsychologists involved with the writing of the
report. Resulting accusations of bias (Palmer, Hon-
orton and Utts, 1989) led U.S. Senator Claiborne
Pell to request that the Congressional Office of
Technology Assessment (OTA) conduct an investi-
gation with a more balanced group. A one-day
workshop was held on September 30, 1988, bring-
ing together parapsychologists, critics and experts
in some related fields (including the author of this
paper). The report concluded that parapsychology
needs "a fairer hearing across a broader spectrum
of the scientific community, so that emotionality
does not impede objective assessment of experimen-
tal results" (Office of Technology Assessment,
1989).
It is in the spirit of the OTA report that this
article is written. After Section 2, which offers an
anecdotal account of the role of statisticians and
statistics in parapsychology, the discussion turns to
the more general question of replication of experi-
mental results. Section 3 illustrates how replica-
4: L l ~~r (,,-, \ rf vrvPtP~ by s! ientists in many
fields. Returning to parapsychology in Section 4, a
particular experimental regime called the "ganz
feld" is described, and an extended debate about
the 'interpretation of the experimental results is
discussed. Section 5 examines a meta-analysis of
recent ganzfeld experiments designed to resolve the
debate. Finally, Section 6 contains a brief account
of meta-analyses that have been conducted in other
areas of parapsychology, and conclusions are given
in Section 7.
2. STATISTICS AND PARAPSYCHOLOGY
Parapsychology had its beginnings in the investi-
gation of purported mediums and other anecdotal
claims in the late 19th century. The Society for
Psychical Research was founded in Britain in 1882,
and its American counterpart was founded in
Boston in 1884. While these organizations and their
members were primarily involved with investigat-
ing anecdotal material, a_ few of the early re-
searchers were already conducting "forced-choice"
experiments such as card-guessing. (Forced-choice
experiments are like multiple choice tests; on each
trial the subject must guess from a small, known
set of possibilities.) Notable among these was
Nobel Laureate Charles Richet, who is generally
credited with being the first to recognize that prob-
ability theory could be applied to card-guessing
experiments (Rhine, 1977, page 26; Richet, 1884).
F. Y. Edgeworth, partly in response to what he
considered to be incorrect analyses of these experi-
ments, offered one of the earliest treatises on the
statistical evaluation of forced-choice experiments
in two articles published in the Proceedings of the
Society for Psychical Research (Edgeworth, 1885,
1886). Unfortunately, as noted by Mauskopf and
McVaugh (1979) in their historical account of the
period, Edgeworth's papers were "perhaps too diffi-
cult for their immediate audience" (page 105).
Edgeworth began his analysis by using Bayes'
theorem to derive the formula for the posterior
probability that chance was operating, given the
data. He then continued with an argument
"savouring more of Bernoulli than Bayes" in which
"it is consonant, I submit, to experience, to put 1/2
both for a and 0," that is, for both the prior proba-
bility that chance alone was operating, and the
prior probability that "there should have been some
additional agency." He then reasoned (using a
Taylor series expansion of the posterior prob-
ability formula) that if there were a large prob-
ability of observing the data given that some
additional agency was at work, and a small objec-
tive probability of the data under chance, then the
latter (binomial) probability "may be taken as a
rough measure of the sought a posteriori probabil-
ity in tavour of mere cnanee 1pa5C a~
worth concluded his article by applying his method
to some data published previously in the same
journal. He found the probability against chance to
be 0.99996, which he said "may fairly be regarded
as physical certainty" (page 199). He concluded:
Such is the evidence which the calculus of
probabilities affords as to the existence of an
agency other than mere chance. The calculus is
silent as to the nature of that agency-whether
it is more likely to be vulgar illusion or ex-
traordinary law. That is a question to be
decided, not by formulae and figures, but by
general philosophy and common sense [page
1991.
Both the statistical arguments and the experi-
mental controls in these early experiments were
somewhat loose. For example, Edgeworth treated
as binomial an experiment in which one person
chose a string of eight letters and another at-
tempted to guess the string. Since it has long been
understood that people are poor random number (or
letter) generators, there is no statistical basis for
analyzing such an experiment. Nonetheless, Edge-
worth and his contemporaries set the stage for the
use of controlled experiments with statistical evalu-
ation in laboratory parapsychology. An interesting
historical account of Edgeworth's involvement and
the role telepathy experiments played in the early
history of randomization and experimental design
is provided by Hacking (1988).
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
and Greenwood, 1937). Stuart, who had been an
'undergraduate in mathematics at Duke, was one of
Rhine's early subjects and continued to work with
him as a researcher until Stuart's death in 1947.
Greenwood was a Duke mathematician, who appar-
ently converted to a statistician at the urging of
Rhine.
Another prominent figure who was distressed
with Kellogg's attack was E. V. Huntington, a
mathematician at Harvard. After corresponding
with Rhine, Huntington decided that, rather than
further confuse the public with a technical reply to
Kellogg's arguments, a simple statement should be
made to the effect that the mathematical issues in
Rhine's work had been resolved. Huntington must
have successfully convinced his former student,
Burton Camp of Wesleyan, that this was a wise
approach. Camp was the 1937 President of IMS.
When the annual meetings were held in December
of 1937 (jointly with AMS and AAAS), Camp
released a statement to the press that read:
Dr. Rhine's investigations have two aspects:
experimental and statistical. On the exper-
imental side mathematicians, of course,
have nothing to say. On the statistical side,
however, recent mathematical work has
established the fact that, assuming that the
experiments have been properly performed,
the statistical analysis is essentially valid. If
the Rhine investigation is to be fairly attacked,
it must be on other than mathematical grounds
[Camp, 19371.
One statistician who did emerge as a critic was
William Feller. In a talk at the Duke Mathemati-
cal Seminar on April 24, 1940, Feller raised three
criticisms to Rhine's work (Feller, 1940). They had
been raised before by others (and continue to be
raised even today). The first was that inadequate
shuffling of the cards resulted in additional infor-
mation from one series to the next. The second was
what is now known as the "file-drawer effect,"
namely, that if one combines the results of pub-
lished studies only, there is sure to be a bias in
favor of successful studies. The third was that the
results were enhanced by the use of optional stop-
ping, that is, by not specifying the number of trials
in advance. All three of these criticisms were ad-
dressed in a rejoinder by Greenwood and Stuart
(1940), but Feller was never convinced. Even in its
third edition published in 1968, his book An Intro-
duction to Probability Theory and Its Applications
still contains his conclusion about Greenwood and
Stuart: "Both their arithmetic and their experi-
ments have a distinct tinge of the supernatural"
(Feller, 1968, page 407). In his discussion of Feller's
position, Diaconis (1978) remarked, "I believe
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
One of the first American researchers to
use statistical methods in parapsychology was
John Edgar Coover, who was the Thomas Welton
Stanford Psychical Research Fellow in the Psychol-
ogy Department at Stanford University from 1912
to 1937 (Dommeyer, 1975). In 1917, Coover pub-
lished a large volume summarizing his work
(Coover, 1917). Coover believed that his results
were consistent with chance, but others have ar-
gued that Coover's definition of significance was
too strict (Dommeyer, 1975). For example, in one
evaluation of his telepathy experiments, Coover
found a two-tailed p-value of 0.0062. He concluded,
"Since this value, then, lies within the field of
chance deviation, although the probability of its
occurrence by chance is fairly low, it cannot be
accepted as a decisive indication of some cause
beyond chance which operated in favor of success in
guessing" (Coover, 1917, page 82). On the next
page, he made it explicit that he would require a
p-value of 0.0000221 to declare that something
other than chance was operating.
It was during the summer of 1930, with the
card-guessing experiments of J. B. Rhine at Duke
University, that parapsychology began to take hold
as a laboratory science. Rhine's laboratory still
exists under the name of the Foundation for Re-
search on the Nature of Man, housed at the edge of
the Duke University campus.
It wasn't long after Rhine published his first
book, Extrasensory Perception in 1934, that the
attacks on his methodology began. Since his claims
were wholly based on statistical analyses of his
experiments, the statistical methods were closely
scrutinized by critics anxious to find a conventional
explanation for Rhine's positive results.
The most persistent critic was a psychologist
from McGill_ University named Chester Kellogg
(Mauskopf and McVaugh, 1979). Kellogg's main
argument was that Rhine was using the binomial
distribution (and normal approximation) on a se-
ries of trials that were not independent. The experi-
ments in question consisted of having a subject
guess the order of a deck of 25 cards, with five each
of five symbols, so technically Kellogg was correct.
By ].937, several mathematicians and statis-
ticians had come to Rhine's aid. Mauskopf and
McVaugh (1979) speculated that since statistics was
itself a young discipline, "a number of statisticians
were equally outraged by Kellogg, whose argu-
ments they saw as discrediting their profession"
(page 258). The major technical work, which ac-
knowledged that Kellogg's criticisms were accurate
but did little to change the significance of the
results, was conducted by Charles Stuart and
Joseph A. Greenwood and published in the first
volume of the Journal of Parapsychology (Stuart
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
366 J. UTIS
Feller was confused.. . he seemed to have decided
the opposition was wrong and that was that."
Several statisticians have contributed to the
literature in parapsychology to greater or lesser
degrees. T. N. E. Greville developed applicable
statistical methods for many of the experiments in
parapsychology and was Statistical Editor of the
Journal of Parapsychology (with J. A. Greenwood)
from its start in 1937 through Volume 31 in 1967;
Fisher (1924, 1929) addressed some specific prob-
lems in card-guessing experiments; Wilks (1965a, b)
described various statistical methods for parapsy-
chology; Lindley (1957) presented a Bayesian anal-
ysis of some parapsychology data; and Diaconis
(1978) pointed out some problems with certain ex-
periments and presented a method for analyzing
experiments when feedback is given.
Occasionally, attacks on parapsychology have
taken the form of attacks on statistical inference in
general, at least as it is applied to real data.
Spencer-Brown (1957) attempted to show that true
randomness is impossible, at least in finite se-
quences, and that this could be the explanation for
the results in parapsychology. That argument re-
emerged in a recent debate on the role of random-
ness in parapsychology, initiated by psychologist J.
Barnard Gilmore (Gilmore, 1989, 1990; Utts, 1989;
Palmer, 1989, 1990). Gilmore stated that "The ag-
nostic statistician, advising on research in psi,
should take account of the possible inappropriate-
ness of classical inferential statistics" (1989, page
338). In his second paper, Gilmore reviewed several
non-psi studies showing purportedly random sys-
tems that do not behave as they should under
randomness (e.g., Iversen, Longcor, Mosteller,
Gilbert and Youtz, 1971; Spencer-Brown, 1957).
Gilmore concluded that "Anomalous data ...
should not be found nearly so often if classical
statistics offers a valid model of reality" (1990,
page 54), thus rejecting the use of classical statisti-
cal inference for real-world applications in general.
Implicit and explicit in the literature on parapsy-
chology is the assumption that, in order to truly
establish itself, the field needs to find a repeat-
able experiment. For example, Diaconis (1978)
started the summary of his article in Science with
the words "In search of repeatable ESP experi-
ments, modern investigators ... " (page 131). On
October 28-29, 1983, the 32nd International Con-
ference of the Parapsychology Foundation was held
in San Antonio, Texas, to address "The Repeatabil-
ity Problem in Parapsychology." The Conference
Proceedings (Shapin and Coly, 1985) reflect the
diverse views among parapsychologists on the na-
ture of the problem. Honorton (1985a) and Rao
(1985), for example, both argued that strict replica-
tion is uncommon in most branches of science and
that parapsychology should not be singled out as
unique in this regard. Other authors expressed
disappointment in the lack of a single repeatable
experiment in parapsychology, with titles such
as "Unrepeatability: Parapsychology's Only Find-
ing" (Blackmore, 1985), and "Research Strategies
for Dealing with Unstable Phenomena" (Beloff,
1985).
It has never been clear, however, just exactly
what would constitute acceptable evidence of a re-
peatable experiment. In the early days of investiga-
tion, the major critics "insisted that it would be
sufficient for Rhine and Soal to convince them of
ESP if a parapsychologist could perform success-
fully a single 'fraud-proof' xperiment" (Hyman,
1985a, page 71). However, as soon as well-designed
experiments showing statistical significance
emerged, the critics realized that a single experi-
ment could be statistically significant just by
chance. British psychologist C. E. M. Hansel quan-
tified the new expectation, that the experiment
should be repeated a few times, as follows:
If a result is significant at the .01 level and
this result is not due to chance but to informa-
tion reaching the subject, it may be expected
that by making two further sets of trials the
antichance odds of one hundred to one will be
increased to around a million to one, thus en-
abling the effects of ESP-or whatever is re-
sponsible for the original result-to manifest
itself to such an extent that there will be little
doubt that the result is not due to chance
[Hansel, 1980, page 2981.
In other words, three consecutive experiments at
p:5 0.01 would convince Hansel that something
other than chance was at work.
This argument implies that if a particular experi-
ment produces a statistically significant result, but
subsequent replications fail to attain significance,
then the original result was probably due to chance,
or at least remains unconvincing. The problem with
this line of reasoning is that there is no consid-
eration given to sample size or power. Only an
experiment with extremely high power should
be expected to be "successful" three times in
succession.
It is perhaps a failure of the way statistics is
taught that many scientists do not understand the
importance of power in defining successful replica-
tion. To illustrate this point, psychologists Tversky
and Kahnemann (1982) distributed a questionnaire
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
REPLICATION IN PARAPSYCHOLOGY 867
to their colleagues at a professional meeting, with
the question:
An investigator has reported a result that you
consider implausible. He ran 15 subjects, and
reported a significant value, t = 2.46. Another
investigator has attempted to duplicate his pro-
cedure, and he obtained a nonsignificant value
of t with the same number of subjects. The
direction was the same in both sets of data.
You are reviewing the literature. What is the
highest value of t in the second set of data that
you would describe as a failure to replicate?
[1982, page 281.
In reporting their results, Tversky and Kahne-
mann stated:
The majority of our respondents regarded t =
1.70 as a failure to replicate. If the data of two
such studies (t = 2.46 and t = 1.70) are pooled,
the value of t for the combined data is about
3.00 (assuming equal variances). Thus, we are
faced with a paradoxical state of affairs, in
which the same data that would increase our
confidence in the finding when viewed as part
of the original study, shake our confidence
when viewed as an independent study [1982,
page 281.
At a recent presentation to the History and Phi-
losophy of Science Seminar at the University of
California at Davis, I asked the following question.
Two scientists, Professors A and B, each have a
theory they would like to demonstrate. Each plans
to run a fixed number of Bernoulli trials and then
test Ho: p 0.25 versus Ha: p > 0.25. Professor A
has access to large numbers of students each
semester to use as subjects. In his first experiment,
he runs 100 subjects, and there are 33 successes
(p = 0.04, one-tailed). Knowing the importance of
replication, Professor A runs an additional 100 sub-
jects as a second experiment. He finds 36 successes
(p = 0.009, one-tailed).
Professor B only teaches small classes. Each
quarter, she runs an experiment on her students to
test her theory. She carries out ten studies this
way, with the results in Table 1.
I asked the audience by a show of hands to
indicate whether or not they felt the scientists had
successfully demonstrated their theories. Professor
A's theory received overwhelming support, with
approximately 20 votes, while Professor B's theory
received only one vote.
If you aggregate the results of the experiments
for each professor, you will notice that each con-
ducted 200 trials, and Professor B actually demon-
strated a higher level of success than Professor A,
with 71 as opposed to 69 successful trials. The
one-tailed p-values for the combined trials are
0.0017 for Professor A and 0.0006 for Professor B.
To address the question of replication more ex-
plicitly, I also posed the following scenario. In
December of 1987, it was decided to prematurely
terminate a study on the effects of aspirin in reduc-
ing heart attacks because the data were so convinc-
ing (see, e.g., Greenhouse and Greenhouse, 1988;
Rosenthal, 1990a). The physician-subjects had been
randomly assigned to take aspirin or a placebo.
There were 104 heart attacks among the 11,037
subjects in the aspirin group, and 189 heart attacks
among the 11,034 subjects in the placebo group
(chi-square = 25.01, p < 0.00001).
After showing the results of that study, I pre-
sented the audience with two hypothetical experi-
ments conducted to try to replicate the original
result, with outcomes in Table 2.
I asked the audience to indicate which one they
thought was a more successful replication. The au-
dience chose the second one, as would most journal
editors, because of the "significant p-value." In
fact, the first replication has almost exactly the
same proportion of heart attacks in the two groups
as the original study and is thus a very close repli-
cation of that result. The second replication has
TABLE 1
Attempted replccations for professor B
10
4
0.22
15
6
0.15
17
6
0.23
25
8
0.17
30
10
0.20
40
13
0.18
18
7
0.14
10
5
0.08
15
5
0.31
20
7
0.21
TABLE 2
Hypothetical replications of the aspirin / heart
attack study
Replication #1
Heart attack
Replication #2
Heart attack
Aspirin
11
1156
20
2314
Placebo
19
1090
48
2170
Chi-square
2.596, p =
0.11
13.206, p =
0.0003
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
368 J. UTI'S
very different proportions, and in fact the relative
risk from the second study is not even contained in
a 95% confidence interval for relative risk from the
original study. The magnitude of the effect has
been much more closely matched by the "nonsig-
nificant" replication.
Fortunately, psychologists are beginning to no-
tice that replication is not as straightforward as
they were originally led to believe. A special issue
of the Journal of Social Behavior and Personality
was entirely devoted to the question of replication
(Neuliep, 1990). In one of the articles, Rosenthal
cautioned his colleagues: "Given the levels of sta-
tistical power at which we normally operate, we
have no right to expect the proportion of significant
results that we typically do expect, even if in na-
ture there is a very real and very important effect"
(Rosenthal, 1990b, page 16).
Jacob Cohen, in his insightful article titled
"Things I Have Learned (So Far)," identified an-
other misconception common among social scien-
tists: "Despite widespread misconceptions to the
contrary, the rejection of a given null hypothesis
gives us no basis for estimating the probability that
a replication of the research will again result in
rejecting that null hypothesis" (Cohen, 1990, page
1307).
Cohen and Rosenthal both advocate the use of
effect sizes as opposed to significance levels when
defining the strength of an experimental effect. In
general, effect sizes measure the amount by which
the data deviate from the null hypothesis in terms
of standardized units. For instance, the effect size
for a two-sample t-test is usually defined to be the
difference in the two means, divided by the stan-
dard deviation for the control group. This measure
can be compared across studies without the depen-
dence on sample size inherent in significance lev-
els. (Of course there will still be variability in the
sample effect sizes, decreasing as a function of sam-
ple size.) Comparison of effect sizes across studies is
one of the major components of meta-analysis.
Similar arguments have recently been made in
the medical literature. For example, Gardner and
Altman (1986) stated that the use of p-values "to
define two alternative outcomes-significant and
not significant-is not helpful and encourages lazy
thinking" (page 746). They advocated the use of
confidence intervals instead.
As discussed in the next section, the arguments
used to conclude that parapsychology has failed to
demonstrate a replicable effect hinge on these mis-
conceptions of replication and failure to examine
power. A more appropriate analysis would compare
the effect sizes for similar experiments across ex-
perimenters and across time to see if there have
been consistent effects of the same magnitude.
Rosenthal also advocates this view of replication:
The traditional view of replication focuses on
significance level as the relevant summary
statistic of a study and evaluates the success of
a replication in a dichotomous fashion. The
newer, more useful view of replication focuses
on effect size as the more important summary
statistic of a study and evaluates the success of
a replication not in a dichotomous but in a
continuous fashion [Rosenthal, 1990b, page 281.
The dichotomous view of replication has been
used throughout the history of parapsychology,- by
both parapsychologists and critics (Utts, 1988). For
example, the National Academy of Sciences report
critically evaluated "significant" experiments, but
entirely ignored "nonsignificant" experiments.
In the next three sections, we will examine some
of the results in parapsychology using the broader,
more appropriate definition of replication. In doing
so, we will show that the results are far more
interesting than the critics would have us believe.
4. THE GANZFELD DEBATE IN
PARAPSYCHOLOGY
An extensive debate took place in the mid-1980s
between a parapsychologist and critic, questioning
whether or not a particular body of parapsychologi-
cal data had demonstrated psi abilities. The experi-
ments in question were all conducted using the
ganzfeld setting (described below). Several authors
were invited to write commentaries on the debate.
As a result, this data base has been more thor-
oughly analyzed by both critics and proponents
than any other and provides a good source for
studying replication in parapsychology.
The debate concluded with a detailed series of
recommendations for further experiments, and left
open the question of whether or not psi abilities
had been demonstrated. A new series of experi-
ments that followed the recommendations were
conducted over the next few years. The results of
the new experiments will be presented in Section 5.
4.1 Free-Response Experiments
Recent experiments in parapsychology tend to
use more complex target material than the cards
and dice used in the early investigations, partially
to alleviate boredom on the part of the subjects and
partially because they are thought to "more nearly
resemble the conditions of spontaneous psi occur-
rences" (Burdick and Kelly, 1977, page 109). These
experiments fall under the general heading of
"free-response" experiments, because the subject is
asked to give a verbal or written description of the
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
REPLICATION IN PARAPSYCHOLOGY 369
target, rather than being forced to make a choice
from a small discrete set of possibilities. Various
types of target material have been used, including
pictures, short segments of movies on video tapes,
actual locations and small objects.
Despite the more complex target material, the
statistical methods used to analyze these experi-
ments are similar to those for forced-choice experi-
ments. A typical experiment proceeds as follows.
Before conducting any trials, a large pool of poten-
tial targets is assembled, usually in packets of four.
Similarity of targets within a packet is kept to a
minimum, for reasons made clear below. At the
start of an experimental session, after the subject is
sequestered in an isolated room, a target is selected
at random from the pool. A sender is placed in
another room with the target. The subject is asked
to provide a verbal or written description of what
he or she thinks is in the target, knowing only that
it is a photograph, an object, etc.
After the subject's description has been recorded
and secured against the potential for later alter-
ation, a judge (who may or may not be the subject)
is given a copy of the subject's description and the
four possible targets that were in the packet with
the correct target. A properly conducted experi-
ment either uses video tapes or has two identical
sets of target material and uses the duplicate set
for this part of the process, to ensure that clues
such as fingerprints don't give away the answer.
Based on the subject's description, and of course on
a blind basis, the judge is asked to either rank the
four choices from most to least likely to have been
the target, or to select the one from the four that
seems to best match the subject's description. If
ranks are used, the statistical analysis proceeds by
summing the ranks over a series of trials and
.comparing the sum to what would be expected by
chance. If the selection method is used, a "direct
hit" occurs if the correct target is chosen, and the
number of direct hits over a series of trials is
compared to the number expected in a binomial
experiment with p = 0.25.
Note that the subjects' responses cannot be con-
sidered to be "random" in any sense, so probability
assessments are based on the random selection of
the target; and decoys. In a correctly designed ex-
periment, the probability of a direct hit by chance
is 0.25 on each trial, regardless of the response, and
the trials are independent. These and other issues
related to analyzing free-response experiments are
discussed by Utts (1991).
4.2 The Psi Ganzfeld Experiments
The ganzfeld procedure is a particular kind of
free-response experiment utilizing a perceptual
isolation technique originally developed by Gestalt
psychologists for other purposes. Evidence from
spontaneous case studies and experimental work
had led parapsychologists to a model proposing that
psychic functioning may be masked by sensory in-
put and by inattention to internal states (Honorton,
1977). The ganzfeld procedure was specifically de-
signed to test whether or not reduction of external
"noise" would enhance psi performance.
In these experiments, the subject is placed in a
comfortable reclining chair in an acoustically
shielded room. To create a mild form of sensory
deprivation, the subject wears headphones through
which white noise is played, and stares into a
constant field of red light. This is achieved by
taping halved translucent ping-pong balls over the
eyes and then illuminating the room with red light.
In the psi ganzfeld experiments, the subject speaks
into a microphone and attempts to describe the
target material being observed by the sender in a
distant room.
At the 1982 Annual Meeting of the Parapsycho-
logical Association, a debate took place over the
degree to which the results of the psi ganzfeld
experiments constituted evidence of psi abilities.
Psychologist and critic Ray Hyman and parapsy-
chologist Charles Honorton each analyzed the re-
sults of all known psi ganzfeld experiments to date,
and they reached strikingly different conclusions
(Honorton, 1985b; Hyman, 1985b). The debate con-
tinued with the publication of their arguments in
separate articles in the March 1985 issue of the
Journal of Parapsychology. Finally, in the Decem-
ber 1986 issue of the Journal of Parapsychology,
Hyman and Honorton (1986) wrote a joint article
in which they highlighted their agreements and
disagreements and outlined detailed criteria for
future experiments. That same issue contained
commentaries on the debate by 10 other authors.
The data base analyzed by Hyman and Honorton
(1986) consisted of results taken from 34 reports
written by a total of 47 authors. Honorton counted
42 separate experiments described in the reports, of
which 28 reported enough information to determine
the number of direct hits achieved. Twenty three of
the studies (55%) were classified by Honorton as
having achieved statistical significance at 0.05.
4.3 The Vote-Counting Debate
Vote-counting is the term commonly used for the
technique of drawing inferences about an experi-
mental effect by counting the number of significant
versus nonsignificant studies of the effect. Hedges
and Olkin (1985) give a detailed analysis of the
inadequacy of this method, showing that it is more
and more likely to make the wrong decision as the
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
number of studies increases. While Hyman ac-
knowledged that "vote-counting raises many prob-
lems" (Hyman, 1985b, page 8), he nonetheless spent
if of his critique of the ganzfeld studies showing
why Honorton's count of 55% was wrong.
Hyman's first complaint was that several of the
studies contained multiple conditions, each of which
should be considered as a separate study. Using
this definition he counted 80 studies (thus further
reducing the sample sizes of the individual studies),
of which 25 (31%) were "successful." Honorton's
response to this was to invite readers to examine
the studies and decide for themselves if the varying
conditions constituted separate experiments.
Hyman next postulated that there was selection
bias, so that significant studies were more likely to
be reported. He raised some important issues about
how pilot studies may be terminated and not re-
ported if they don't show significant results, or may
at least be subject to optional stopping, allowing
the experimenter to determine the number of tri-
als. He also presented a chi-square analysis that
"suggests a tendency to report studies with a small
sample only if they have significant results"
(Hyman, 1985b, page 14), but I have questioned his
analysis elsewhere (Utts, 1986, page 397).
Honorton refuted Hyman's argument with four
rejoinders (Honorton, 1985b, page 66). In addition
to reinterpreting Hyman's chi-square analysis,
Honorton pointed out that the Parapsychological
Association has an official policy encouraging the
publication of nonsignificant results in its journals
and proceedings, that a large number of reported
ganzfeld studies did not achieve statistical signifi-
cance and that there would have to be 15 studies in
the "file-drawer" for every one reported to cancel
out the observed significant results.
The remainder of Hyman's vote-counting analy-
sis consisted of showing that the effective error rate
for each study was actually much higher than the
nominal 5%. For example, each study could have
been analyzed using the direct hit measure, the
sum of ranks measure or one of two other measures
used for free-response analyses. Hyman carried out
a simulation study that showed the true error rate
would be 0.22 if "significance" was defined by re-
quiring at least one of these four measures to
achieve the 0.05 level. He suggested several other
ways in which multiple testing could occur and
concluded that the effective error rate in each ex-
periment was not the nominal 0.05, but rather was
probably close to the 31% he had determined to be
the actual success rate in his vote-count.
Honorton acknowledged that there was a multi-
ple testing problem, but he had a two-fold response.
First, he applied a Bonferroni correction and found
that the number of significant studies (using his
definition of a study) only dropped from 55% to
45%. Next, he proposed that a uniform index of
success be applied to all studies. He used the num-
ber of direct hits, since it was by far the most
commonly reported measure and was the measure
used in the first published psi ganzfeld study. He
then conducted a detailed analysis of the 28 studies
reporting direct hits and found that 43% were sig-
nificant at 0.05 on that measure alone. Further, he
showed that significant effects were reported by six
of the 10 independent investigators and thus were
not due to just one or two investigators or laborato-
ries. He also noted that success rates were very
similar for reports published in refereed journals
and those published in unrefereed monographs and
abstracts.
While Hyman's arguments identified issues such
as selective reporting and optional stopping that
should be considered in any meta-analysis, the de-
pendence of significance levels on sample size makes
the vote-counting technique almost useless for as-
sessing the magnitude of the effect. Consider, for
example, the 24 studies where the direct hit meas-
ure was reported and the chance probability of a
direct hit was 0.25, the most common type of study
in the data base. (There were four direct hit studies
with other chance probabilities and 14 that did not
report direct hits.) Of the 24 studies, 13 (54%) were
"nonsignificant" at a = 0.05, one-tailed. But if the
367 trials in these "failed replications" are com-
bined, there are 106 direct hits, z = 1.66, and p =
0.0485, one tailed. This is reminiscent of the
dilemma of Professor B in Section 3.
Power is typically very low for these studies. The
median sample size for the studies reporting direct
hits was 28. If there is a real effect and it increases
the success probability from the chance 0.25 to
an actual 0.33 (a value whose rationale will be
made clear below), the power for a study with 28
trials is only 0.181 (Utts, 1986). It should be no
surprise that there is a "repeatability" problem in
parapsychology.
4.4 Flaw Analysis and Future Recommendations
The second half of Hyman's paper consisted of a
"Meta-Analysis of Flaws and Successful Outcomes"
(1985b, page 30), designed to explore whether or
not various measures of success were related to
specific flaws in the experiments. While many crit-
ics have argued that the results in parapsychology
can be explained by experimental flaws, Hyman's
analysis was the first to attempt to quantify the
relationship between flaws and significant results.
Hyman identified 12 potential flaws in the
ganzfeld experiments, such as inadequate random-
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
ization, multiple tests used without adjusting the
significance level (thus inflating the significance
level from the nominal 5%) and failure to use a
duplicate set of targets for the judging process (thus
allowing possible clues such as fingerprints). Using
cluster and factor analyses, the 12 binary flaw
variables were combined into three new variables,
which Hyman named General Security, Statistics
and Controls.
Several. analyses were then conducted. The one
reported with the most detail is a factor analysis
utilizing 17 variables for each of 36 studies. Four
factors emerged from the analysis. From these,
Hyman concluded that security had increased over
the years, that the significance level tended to be
inflated the most for the most complex studies and
that both effect size and level of significance were
correlated with the existence of flaws.
Following his factor analysis, Hyman picked the
three flaws that seemed to be most highly corre-
lated with success, which were inadequate atten-
tion to both randomization and documentation and
the potential for ordinary communication between
the sender and receiver. A regression equation was
then computed using each of the three flaws as
dummy variables, and the effect size for the experi-
ment as the dependent variable. From this equa-
tion, Hyman concluded that a study without these
three flaws would be predicted to have a hit rate of
27%. He concluded that this is "well within the
statistical neighborhood of the 25% chance rate"
(1985b, page 37), and thus "the ganzfeld psi data
base, despite initial impressions, is inadequate ei-
ther to support the contention of a repeatable study
or to demonstrate the reality of psi" (page 38).
Honorton discounted both Hyman's flaw classifi-
cation and his analysis. He did not deny that flaws
existed, but he objected that Hyman's analysis was
faulty and impossible to interpret. Honorton asked
psychometrician David Saunders to write an Ap-
pendix to his article, evaluating Hyman's analysis.
Saunders first criticized Hyman's use of a factor
analysis with 17 variables (many of which were
dichotomous) and only 36 cases and concluded that
"the entire analysis is meaningless" (Saunders,
1985, page 87). He then noted that Hyman's choice
of the three flaws to include in his regression anal-
ysis constituted a clear case of multiple analysis,
since there were 84 possible sets of three that could
have been selected (out of nine potential flaws), and
Hyman chose the set most highly correlated with
effect size. Again, Saunders concluded that "any
interpretation drawn from [the regression analysis]
must be regarded as meaningless" (1985, page 88).
Hyman's results were also contradicted by Harris
and Rosenthal (1988b) in an analysis requested by
Hyman in his capacity as Chair of the National
Academy of Sciences' Subcommittee on Parapsy-
chology. Using Hyman's flaw classifications and a
multivariate analysis, Harris and Rosenthal con-
cluded that "Our analysis of the effects of flaws on
study outcome lends no support to the hypothesis
that ganzfeld research results are a significant
function of the set of flaw variables" (1988b,
page 3).
Hyman and Honorton were in the process of
preparing papers for a second round of debate when
they were invited to lunch together at the 1986
Meeting of the Parapsychological Association. They
discovered that they were in general agreement on
several major issues, and they decided to coauthor
a "Joint Communique" (Hyman and Honorton,
1986). It is clear from their paper that they both
thought it was more important to set the stage for
future experimentation than to continue the techni-
cal arguments over the current data base. In the
abstract to their paper, they wrote:
We agree that there is an overall significant
effect in this data base that cannot reasonably
be explained by selective reporting or multiple
analysis. We continue to differ over the degree
to which the effect constitutes evidence for psi,
but we agree that the final verdict awaits the
outcome of future experiments conducted by a
broader range of investigators and according to
more stringent standards [page 351].
The paper then outlined what these standards
should be. They included controls against any kind
of sensory leakage, thorough testing and documen-
tation of randomization methods used, better re-
porting of judging and feedback protocols, control
for multiple analyses and advance specification of
number of trials and type of experiment. Indeed,
any area of research could benefit from such a
careful list of procedural recommendations.
4.5 Rosenthal's Meta-Analysis
The same issue of the Journal of Parapsychology
in which the Joint Communique appeared also car-
ried commentaries on the debate by 10 separate
authors. In his commentary, psychologist Robert
Rosenthal, one of the pioneers of meta-analysis in
psychology, summarized the aspects of Hyman's
and Honorton's work that would typically be in-
cluded in a meta-analysis (Rosenthal, 1986). It is
worth reviewing Rosenthal's results so that they
can be used as a basis of comparison for the more
recent psi ganzfeld studies reported in Section 5.
Rosenthal, like Hyman and Honorton, focused
only on the 28 studies for which direct hits were
known. He chose to use an effect size measure
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
called Cohen's h, which is the difference between
the aresin transformed proportions of direct hits
that were observed and expected:
h 2 (aresin - arcsin VI-p).
One advantage of this measure over the difference
in raw proportions is that it can be used to compare
experiments with different chance hit rates.
If the observed and expected numbers of hits
were identical, the effect size would be zero. Of the
28 studies, 23 (82%) had effect sizes greater than
zero, with a median effect size of 0.32 and a mean
of 0.28. These correspond to direct hit rates of 0.40
and 0.38 respectively, when 0.25 is expected by
chance. A 95% confidence interval for the true
effect size is from 0.11 to 0.45, corresponding to
direct hit rates of from 0.30 to 0.46 when chance is
0.25.
A common technique in meta-analysis is to calcu-
late a "combined z," found by summing the indi-
vidual z scores and dividing by the square root of
the number of studies. The result should have a
standard normal distribution if each z score has a
standard normal distribution. For the ganzfeld
studies, Rosenthal reported a combined z of 6.60
with a p-value of 3.37 x 10-11. He also reiterated
Honorton's file-drawer assessment by calculating
that there would have to be 423 studies unreported
to negate the significant effect in the 28 direct hit
studies.
Finally, Rosenthal acknowledged that, because of
the flaws in the data base and the potential for at
least a small file-drawer effect, the true average
effect size was probably closer to 0.18 than 0.28. He
concluded, "Thus, when the accuracy rate expected
under the null is 1/4, we might estimate the ob-
tained accuracy rate to be about 1/3" (1986, page
333). This is the value used for the earlier power
calculation.
It is worth mentioning that Rosenthal was com-
missioned by the National Academy of Sciences to
prepare a background paper to accompany its 1988
report on parapsychology. That paper (Harris and
Rosenthal, 1988a) contained much of the same
analysis as his commentary summarized above.
Ironically, the discussion of the ganzfeld work in
the National Academy Report focused on Hyman's
1985 analysis, but never mentioned the work it had
commissioned Rosenthal to perform, which contra-
dicted the final conclusion in the report.
5. A META.-ANALYSIS OF RECENT GANZFELD
EXPERIMENTS
After the initial exchange with Hyman at
the 1982 Parapsychological Association Meeting,
Honorton and his colleagues developed an auto-
mated ganzfeld experiment that was designed to
eliminate the methodological flaws identified by
Hyman. The execution and reporting of the experi-
ments followed the detailed guidelines agreed upon
by Hyman and Honorton.
Using this "autoganzfeld" experiment, 11 experi-
mental series were conducted by eight experi-
menters between February 1983 and September
1989, when the equipment had to be dismantled
due to lack of funding. In this section, the results
of these experiments are summarized and com-
pared to the earlier ganzfeld studies. Much of the
information is derived from Honorton et al. (1990).
Like earlier ganzfeld studies, the "autoganzfeld"
experiments require four participants. The first is
the Receiver (R), who attempts to identify the tar-
get material being observed by the Sender (S). The
Experimenter (E) prepares R for the task, elicits
the response from R and supervises R's judging of
the response against the four potential targets.
(Judging is double blind; E does not know which is
the correct target.) The fourth participant is the lab
assistant (LA) whose only task is to instruct the
computer to randomly select the target. No one
involved in the experiment knows the identity of
the target.
Both R and S are sequestered in sound-isolated,
electrically shielded rooms. R is prepared as in
earlier ganzfeld studies, with white noise and a
field of red light. In a nonadjacent room, S watches
the target material on a television and can hear R's
target description ("mentation") as it is being
given. The mentation is also tape recorded.
The judging process takes place immediately af-
ter the 30-minute sending period. On a TV monitor
in the isolated room, R views the four choices from
the target pack that contains the actual target. R is
asked to rate each one according to how closely it
matches the ganzfeld mentation. The ratings are
converted to ranks and, if the correct target is
ranked first, a direct hit is scored. The entire proc-
ess is automatically recorded by the computer. The
computer then displays the correct choice to R as
feedback.
There were 160 preselected targets, used with
replacement, in 10 of the 11 series. They were
arranged in packets of four, and the decoys for a
given target were always the remaining three in
the same set. Thus, even if a particular target in a
set were consistently favored by Rs, the probability
of a direct hit under the null hypothesis would
remain at 1/4. Popular targets should be no more
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
likely to be selected by the computer's random
number generator than any of the others in the set.
The selection of the target by the computer is the
only source of randomness in these experiments.
This is an important point, and one that is often
misunderstood. (See Utts, 1991, for elucidation.)
Eighty of the targets were "dynamic," consisting
of scenes from movies, documentaries and cartoons;
80 were "static," consisting of photographs, art
prints and advertisements. The four targets within
each set were all of the same type. Earlier studies
indicated that dynamic targets were more likely to
produce successful results, and one of the goals of
the new experiments was to test that theory.
The randomization procedure used to select the
target and the order of presentation for judging was
thoroughly tested before and during the experi-
ments. A detailed description is given by Honorton
et al. (1990, pages 118-120).
Three of the 11 series were pilot series, five were
formal series with novice receivers, and three were
formal series with experienced receivers. The last
series with experienced receivers was the only one
that did not use the 160 targets. Instead, it used
only one set of four dynamic targets in which one
target had previously received several first place
ranks and. one had never received a first place
rank. The receivers, none of whom had had prior
exposure to that target pack, were not aware that
only one target pack was being used. They each
contributed one session only to the series. This will
be called the "special series" in what follows.
Except for two of the pilot series, numbers of
trials were planned in advance for each series.
Unfortunately, three of the formal series were not
yet completed when the funding ran out, including
the special series, and one pilot study with advance
planning was terminated early when the experi-
menter relocated. There were no unreported trials
during the 6-year period under review, so there was
no "file drawer."
Overall, there were 183 Rs who contributed only
one trial and 58 who contributed more than one, for
a total of 241 participants and 355 trials. Only 23
Rs had previously participated in ganzfeld experi-
ments, and 194 Rs (81%) had never participated in
any parapsychological research.
While acknowledging that no probabilistic con-
clusions can be drawn from qualitative data, Hon-
orton et al. (1990) included several examples of
session excerpts that Rs identified as providing the
basis for their target rating. To give a flavor for the
dream-like quality of the mentation and the amount
of information that can be lost by only assigning a
rank, the first example is reproduced here. The
target was a painting by Salvador Dali called
"Christ Crucified." The correct target received a
first place rank. The part of the mentation R used
to make this assessment read:
.,.. I think of guides, like spirit guides, leading
me and I come into a court with a king. It's
quiet .... It's like heaven. The king is some-
thing like Jesus. Woman. Now I'm just sort of
summersaulting through heaven ... .
Brooding .... Aztecs, the Sun God .... High
priest ....Fear . . . . Graves. Woman.
Prayer . . . . Funeral . . . . Dark.
Death ... Souls .... Ten Commandments.
Moses .... [Honorton et al., 1990].
Over all 11 series, there were 122 direct hits in
the 355 trials, for a hit rate of 34.4% (exact bino-
mial p-value = 0.00005) when 25% were expected
by chance. Cohen's h is 0.20, and a 95% confidence
interval for the overall hit rate is from 0.30 to 0.39.
This calculation assumes, of course, that the proba-
bility of a direct hit is constant and independent
across trials, an assumption that may be question-
able except under the null hypothesis of no psi
abilities.
Honorton et al. (1990) also calculated effect sizes
for each of the 11 series and each of the eight
experimenters. All but one of the series (the first
novice series) had positive effect sizes, as did all of
the experimenters.
The special series with experienced Rs had an
exceptionally high effect size with h = 0.81, corre-
sponding to 16 direct hits out of 25 trials (64%), but
the remaining series and the experimenters had
relatively homogeneous effect sizes given the
amount of variability expected by chance. If the
special series is removed, the overall hit rate is
32.1%, h = 0.16. Thus, the positive effects are not
due to just one series or one experimenter.
Of the 218 trials contributed by novices, 71 were
direct hits 7(32.5%, h = 0.17), compared with 51
hits in the 137 trials by those with prior ganzfeld
experience (37%, h = 0.26). The hit rates and effect
sizes were 31% (h = 0.14) for the combined pilot
series, 32.5% (h = 0.17) for the combined formal
novice series, and 41.5% (h = 0.35) for the com-
bined experienced series. The last figure drops to
31.6% if the outlier series is removed. Finally,
without the outlier series the hit rate for the com-
bined series where all of the planned trials were
completed was 31.2% (h = 0.14), while it was 35%
(h = 0.22) for the combined series that were termi-
nated early. Thus, optional stopping cannot
account for the positive effect.
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
There were two interesting comparisons that had
been suggested by earlier work and were pre-
planned in these experiments. The first was to
compare results for trials with dynamic targets
with those for static targets. In the 190 dynamic
target sessions there were 77 direct hits (40%, h =
0.32) and for the static targets there were 45 hits
in 165 trials (27%, h = 0.05), thus indicating
that dynamic targets produced far more successful
results.
The second comparison of interest was whether
or not the sender was a friend of the receiver. This
was a choice the receiver could make. If he or she
did not bring a friend, a lab member acted as
sender. There were 211 trials with friends as
senders (some of whom were also lab staff), result-
ing in 76 direct hits (36%, h 0.24). Four trials
used no sender. The remaining 140 trials used
nonfriend lab staff as senders and resulted in 46
direct hits (33%, h = 0.18). Thus, trials with friends
as senders were slightly more successful than those
without.
Consonant with the definition of replication based
on consistent effect sizes, it is informative to com-
pare the autoganzfeld experiments with the direct
hit studies in the previous data base. The overall
success rates are extremely similar. The overall
direct hit rate was 34.4% for the autoganzfeld stud-
ies and was 38% for the comparable direct hit
studies in the earlier meta-analysis. Rosenthal's
(1986) adjustment for flaws had placed a more con-
servative estimate at 33%, very close to the
observed 34.4% in the new studies.
One limitation of this work is that the auto-
ganzfeld studies, while conducted by eight experi-
menters, all used the same equipment in the same
laboratory. Unfortunately, the level of fund-
ing available in parapsychology and the cost in
time and equipment to conduct proper experiments
make it difficult to amass large amounts of data
across laboratories. Another autoganzfeld labora-
tory is currently being constructed at the Univer-
sity of Edinburgh in Scotland, so interlaboratory
comparisons may be possible in the near future.
Based on the effect size observed to date, large
samples are :needed to achieve reasonable power. If
there is a constant effect across all trials, resulting
in 33% direct hits when 25% are expected by chance,
to achieve a one-tailed significance level of 0.05
with 95% probability would require 345 sessions.
We end this section by returning to the aspirin
and heart attack example in Section 3 and expand-
ing a comparison noted by Atkinson, Atkinson,
Smith and Item (1990, page 237). Computing the
equivalent of Cohen's h for comparing obser-
ved heart attack rates in the aspirin and placebo
groups results in h = 0.068. Thus, the effect size
observed in the ganzfeld data base is triple the
much publicized effect of aspirin on heart attacks.
6. OTHER META-ANALYSES IN
PARAPSYCHOLOGY
Four additional meta-analyses have been con-
ducted in various areas of parapsychology since the
original ganzfeld meta-analyses were reported.
Three of the four analyses focused on evidence of
psi abilities, while the fourth examined the rela-
tionship between extroversion and psychic func-
tioning. In this section, each of the four analyses
will be briefly summarized.
There are only a handful of English-language
journals and proceedings in parapsychology, so
retrieval of the relevant studies in each of the
four cases was simple to accomplish by searching
those sources in detail and by searching other
bibliographic data bases for keywords.
Each analysis included an overall summary, an
analysis of the quality of the studies versus the size
of the effect and a "file-drawer" analysis to deter-
mine the possible number of unreported studies.
Three of the four also contained comparisons across
various conditions.
6.1 Forced-Choice Precognition Experiments
Honorton and Ferrari (1989) analyzed forced-
choice experiments conducted from 1935 to 1987, in
which the target material was randomly selected
after the subject had attempted to predict what it
would be. The time delay in selecting the target
ranged from under a second to one year. Target
material included items as diverse as ESP cards
and automated random number generators. Two
investigators, S. G. Soal and Walter J. Levy, were
not included because some of their work has been
suspected to be fraudulent.
Overall Results. There were 309 studies re-
ported by 62 senior authors, including more than
50,000 subjects and nearly two million individual
trials. Honorton and Ferrari used z// as the
measure of effect size (ES) for each study, where n
was the number of Bernoulli trials in the study.
They reported a mean ES of 0.020, and a mean
z-score of 0.65 over all studies. They also reported a
combined z of 11.41, p = 6.3 x 10-25. Some 30%
(92) of the studies were statistically significant at
a = 0.05. The mean ES per investigator was 0.033,
and the significant results were not due to just a
few investigators.
Quality. Eight dichotomous quality measures
were assigned to each study, resulting in possible
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
scores from zero for the lowest quality, to eight for
the highest. They included features such as ade-
quate randomization, preplanned analysis and au-
tomated recording of the results. The correlation
between ;study quality and effect size was 0.081,
indicating a slight tendency for higher quality
studies to be more successful, contrary to claims by
critics that the opposite would be true. There was
a clear relationship between quality and year of
publication, presumably because over the years
experimenters in parapsychology have responded
to suggestions from critics for improving their
methodology.
File Drawer. Following Rosenthal (1984), the
authors calculated the "fail-safe N" indicating the
number of unreported studies that would have to be
sitting in file drawers in order to negate the signifi-
cant effect. They found N = 14,268, or a ratio of 46
unreported studies for each one reported. They also
followed a suggestion by Dawes, Landman and
Williams (1984) and computed the mean z for all
studies with z > 1.65. If such studies were a ran-
dom sample from the upper 5% tail of a N(0,1)
distribution, the mean z would be 2.06. In this case
it was 3.61. They concluded that selective reporting
could not explain these results.
Comparisons. Four variables were identified
that appeared to have a systematic relationship to
study outcome. The first was that the 25 studies
using subjects selected on the basis of good past
performance were more successful than the 223
using unselected subjects, with mean effect sizes of
0.051 and. 0.008, respectively. Second, the 97 stud-
ies testing subjects individually were more success-
ful than the 105 studies that used group testing;
mean effect sizes were 0.021 and 0.004, respec-
tively. Timing of feedback was the third moderat-
ing variable, but information was only available for
104 studies. The 15 studies that never told the
subjects what the targets were had a mean effect
size of -0.001. Feedback after each trial produced
the best results, the mean ES for the 47 studies
was 0.03;5. Feedback after each set of trials re-
sulted in mean ES of 0.023 (21 studies), while
delayed feedback (also 21 studies) yielded a mean
ES of only 0.009. There is a clear ordering; as the
gap between time of feedback and time of the
actual guesses decreased, effect sizes increased.
The fourth variable was the time interval be-
tween the subject's guess and the actual target
selection, available for 144 studies. The best results
were for the 31 studies that generated targets less
than a second after the guess (mean ES = 0.045),
while the worst were for the seven studies that
delayed target selection by at least a month (mean
ES = 0.001). The mean effect sizes showed a clear
trend, decreasing in order as the time interval
increased from minutes to hours to days to weeks to
months.
6.2 Attempts to Influence Random Physical
Systems
Radin and Nelson (1989) examined studies de-
signed to test the hypothesis that "The statistical
output of an electronic RNG [random number gen-
erator] is correlated with observer intention in ac-
cordance with prespecified instructions" (page
1502). These experiments typically involve RNGs
based on radioactive decay, electronic noise or pseu-
dorandom number sequences seeded with true ran-
dom sources. Usually the subject is instructed to
try to influence the results of a string of binary
trials by mental intention alone. A typical protocol
would ask a subject to press a button (thus starting
the collection of a fixed-length sequence of bits),
and then try to influence the random source to
produce more zeroes or more ones. A run might
consist of three successive button presses, one each
in which the desired result was more zeroes or
more ones, and one as a control with no conscious
intention. A z score would then be computed for
each button press.
The 832 studies in the analysis were conducted
from 1959 to 1987 and included 235 "control" stud-
ies, in which the output of the RNGs were recorded
but there was no conscious intention involved.
These were usually conducted before and during
the experimental series, as tests of the RNGs.
Results. The effect size measure used was again
z / v , where z was positive if more bits of the
specified type were achieved. The mean effect size
for control studies was not significantly different
from zero (-1.0 x 10-s). The mean effect size
for the experimental studies was also very small,
3.2 x 10-4, but it was significantly higher than the
mean ES for the control studies (z = 4.1).
Quality. Sixteen quality measures were defined
and assigned to each study, under the four general
categories of procedures, statistics, data and the
RNG device. A score of 16 reflected the highest
quality. The authors regressed mean effect size on
mean quality for each investigator and found a
slope of 2.5 x 10-5 with standard error of 3.2 x
10-5, indicating little relationship between quality
and outcome. They also calculated a weighted mean
effect size, using quality scores as weights, and
found that it was.very similar to the unweighted
mean ES. They concluded that "differences
in methodological quality are not significant
predictors of effect size" (page 1507).
File Drawer. Radin and Nelson used several
methods for estimating the number of unreported
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
studies (pages 1508-1510). Their estimates ranged
from 200 to 1000 based on models assuming
that all significant studies were reported. They
calculated the fail-safe N to be 54,000.
6.3 Attempts; to Influence Dice
Radin and Ferrari (1991) examined 148 studies,
published from 1935 to 1987, designed to test
whether or not consciousness can influence the
results of tossing dice. They also found 31 "con-
trol" studies in which no conscious intention was
involved.
Results. The effect size measure used was
z / v , where z was based on the number of throws
in which the die landed with the desired face (or
faces) up, in n throws. The weighted mean ES for
the experimental studies was 0.0122 with a stan-
dard error of 0.00062; for the control studies the
mean and standard error were 0.00093 and 0.00255,
respectively. Weights for each study were de-
termined by quality, giving more weight to high-
quality studies. Combined z scores for the exper-
imental and control studies were reported by Radin
and Ferrari to be 18.2 and 0.18, respectively.
Quality. Eleven dichotomous quality measures
were assigned, ranging from automated recording
to whether or not control studies were interspersed
with the experimental studies. The final quality
score for each study combined these with informa-
tion on method of tossing the dice, and with source
of subject (defined below). A regression of quality
score versus effect size resulted in a slope of - 0.002,
with a standard error of 0.0011. However, when
effect sizes were weighted by sample size, there was
a significant relationship between quality and ef-
fect size, leading Radin and Ferrari to conclude
that higher-quality studies produced lower weighted
effect sizes.
File Drawer: Radin and Ferrari calculated
Rosenthal's fail-safe. N for this analysis to be
17,974. Using the assumption that all significant
studies were reported, they estimated the number
of unreported studies to be 1152. As a final assess-
ment, they compared studies published before and
after 1975, when the Journal of Parapsychology
adopted an official policy of publishing nonsigni-
ficant results. They concluded, based on that an-
alysis, that more nonsignificant studies were
published after 1975, and thus "We must consi-
der the overall (1935-1987) data base as suspect
with respect to the filedrawer problem."
Comparisons. Radin and Ferrari noted that
there was bias in both the experimental and control
studies across die face. Six was the face most likely
to come up, consistent with the observation that it
has the least mass. Therefore, they examined re-
sults for the subset of 69 studies in which targets
were evenly balanced among the six faces. They
still found a significant effect, with mean and stan-
dard error for effect size of 8.6 x 10-3 and 1.1 x
10-3, respectively. The combined z was 7.617 for
these studies.
They also compared effect sizes across types of
subjects used in the studies, categorizing them as
unselected, experimenter and other subjects, exper-
imenter as sole subject, and specially selected sub-
jects. Like Honorton and Ferrari (1989), they found
the highest mean ES for studies with selected
subjects; it was approximately 0.02, more than twice
that for unselected subjects.
6.4 Extroversion and ESP Performance
Honorton, Ferrari and Bem (1991) conducted a
meta-analysis to examine the relationship between
scores on tests of extroversion and scores on
psi-related tasks. They found 60 studies by 17
investigators, conducted from 1945 to 1983.
Results. The effect size measure used for this
analysis was the correlation between each subject's
extroversion score and ESP score. A variety of
measures had been used for both scores across stud-
ies, so various correlation coefficients were used.
Nonetheless, a stem and leaf diagram of the corre-
lations showed an approximate bell shape with
mean and standard deviation of 0.19 and 0.26,
respectively, and with an additional outlier at r =
0.91. Honorton et al. reported that when weighted
by degrees of freedom, the weighted mean r was
0.14, with a 95% confidence interval covering 0.10
to 0.19.
Forced-Choice versus Free-Response Re-
sults. Because forced-choice and free-response tests
differ qualitatively, Honorton et al. chose to exam-
ine their relationship to extroversion separately.
They found that for free-response studies there was
a significant correlation between extroversion and
ESP scores, with mean r = 0.20 and z = 4.46. Fur-
ther, this effect was homogeneous across both
investigators and extroversion scales.
For forced-choice studies, there was a significant
correlation between ESP and extroversion, but only
for those studies that reported the ESP results
to the subjects before measuring extroversion.
Honorton et al. speculated that the relationship
was an artifact, in which extroversion scores
were temporarily inflated as a result of positive
feedback on ESP performance.
Confirmation with New Data Following the
extroversion /ESP meta-analysis, Honorton et al.
attempted to confirm the relationship using
the autoganzfeld data base. Extroversion scores
based on the Myers-Briggs Type Indicator were
available for 221 of the '241 subjects who had
participated in autoganzfeld studies.
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
REPLICATION IN PARAPSYCHOLOGY
The correlation between extroversion scores and
ganzfeld rating scores was r = 0.18, with a 95%
confidence interval from 0.05 to 0.30. This is con-
sistent with the mean correlation of r = 0.20 for
free-response experiments, determined from the
meta-analysis. These correlations indicate that ex-
troverted subjects can produce higher scores in
free-response ESP tests.
7. CONCLUSIONS
Parapsychologists often make a distinction be-
tween "proof-oriented research" and "process-
oriented research." The former is typically con-
ducted to test the hypothesis that psi abilities exist,
while the latter is designed to answer questions
about how psychic functioning works. Proof-
oriented research has dominated the literature
in parapsychology. Unfortunately, many of the
studies used small samples and would thus be
nonsignificant even if a moderate-sized effect
exists.
The recent focus on meta-analysis in parapsy-
chology has revealed that there are small but
consistently nonzero effects across studies, experi-
menters and laboratories. The sizes of the effects in
forced-choice studies appear to be comparable to
those reported in some medical studies that had
been heralded as breakthroughs. (See Section 5;
also Honorton and Ferrari, 1989, page 301.) Free-
response studies show effect sizes of far greater
magnitude.
A promising direction for future process-oriented
research is to examine the causes of individual
differences in psychic functioning. The ESP/ex-
troversion meta-analysis is a step in that direction.
In keeping with the idea of individual differ-
ences, Bayes and empirical Bayes methods would
appear to make more sense than the classical infer-
ence methods commonly used, since they would
allow individual abilities and beliefs to be modeled.
Jeffreys (1990) reported a Bayesian analysis of some
of the RNG experiments and showed that conclu-
sions were closely tied to prior beliefs even though
hundreds of thousands of trials were available.
It may be that the nonzero effects observed in the
meta-analyses can be explained by something other
than ESP, such as shortcomings in our understand-
ing of randomness and independence. Nonetheless,
there is an anomaly that needs an explanation. As
I have argued elsewhere (Utts, 1987), research in
parapsychology should receive more support from
the scientific community. If ESP does not exist,
there is little to be lost by erring in the direction of
further research, which may in fact uncover other
anomalies. If ESP does exist, there is much to be
lost by not doing process-oriented research, and
much to be gained by discovering how to enhance
and apply these abilities to important world
problems.
ACKNOWLEDGMENTS
I would like to thank Deborah Delanoy, Charles
Honorton, Wesley Johnson, Scott Plous and an
anonymous reviewer for their helpful comments on
an earlier draft of this paper, and Robert Rosenthal
and Charles Honorton for discussions that helped
clarify details.
REFERENCES
ATKINSON, R. L., ATKINSON, R. C., SMITH, E. E. and BEM, D. J.
(1990). Introduction to Psychology, 10th ed. Harcourt Brace
Jovanovich, San Diego.
BELOFF, J. (1985). Research strategies for dealing with unstable
phenomena. In The Repeatability Problem in Parapsychol-
ogy (B. Shapin and L. Coly, eds.) 1-21. Parapsychology
Foundation, New York.
BLACKMORE, S. J. (1985). Unrepeatability: Parapsychology's only
finding. In The Repeatability Problem in Parapsychology
(B. Shapin and L. Coly, eds.) 183-206. Parapsychology
Foundation, New York.
BURDICK, D. S. and KELLY, E. F. (1977). Statistical methods in
parapsychological research. In Handbook of Parapsychology
(B. B. Wolman, ed.) 81-130. Van Nostrand Reinhold, New
York.
CAMP, B. H. (1937). (Statement in Notes Section.) Journal of
Parapsychology 1 305.
COHEN, J. (1990). Things I have learned (so far). American
Psychologist 45 1304-1312.
COOVER, J. E. (1917). Experiments in Psychical Research at
Leland Stanford Junior University. Stanford Univ.
DAWES, R. M., LANDMAN, J. and WILLIAMS, J. (1984). Reply to
Kurosawa. American Psychologist 39 74-75.
DIACONIS, P. (1978). Statistical problems in ESP research. Sci-
ence 201 131-136.
DOMMEYER, F. C. (1975). Psychical research at Stanford Univer-
sity. Journal of Parapsychology 39 173-205.
DRUCKMAN, D. and SWETS, J. A., eds. (1988) Enhancing Human
Performance: Issues, Theories, and Techniques. National
Academy Press, Washington, D.C.
EDGEWORTH, F. Y. (1885). The calculus of probabilities applied
to psychical research. In Proceedings of the Society for
Psychical Research 3 190-199.
EDGEWORTH, F. Y. (1886). The calculus of probabilities applied
to psychical research. II. In Proceedings of the Society for
Psychical Research 4 189-208.
FELLER, W. K. (1940). Statistical aspects of ESP. Journal of
Parapsychology 4 271-297.
FELLER, W. K. (1968). An Introduction to Probability Theory
and Its Applications 1, 3rd ed. Wiley, New York.
FISHER, R. A. (1924). A method of scoring coincidences in tests
with playing cards. In Proceedings of the Society for Psychi-
cal Research 34 181-185.
FISHER, R. A. (1929). The statistical method in psychical re-
search. In Proceedings of the Society for Psychical Research
39 189-192.
GALLUP, G. H., JR., and NEWPORT, F. (1991). Belief in paranor-
mal phenomena among adult Americans. Skeptical Inquirer
15 137-146.
GARDNER, M. J. and ALTMAN, D. G. (1986). Confidence intervals
rather than p-values: Estimation rather than hypothesis
testing. British Medical Journal 292 746-750.
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
GILMORE, J. B. (1989). Randomness and the search for psi.
Journal of Parapsychology 53 309-340.
GILMORE, J. B. (1990). Anomalous significance in pararandom
and psi-free domains. Journal of Parapsychology 54 53-58.
GREELEY, A. (1987). Mysticism goes mainstream. American
Health 7 47 -49.
GREENHOUSE, J. B. and GREENHOUSE, S. W. (1988). An aspirin a
day ... ? Chance 124-31.
GREENWOOD, J. A. and STUART, C. E. (1940). A review of Dr.
Feller's critique. Journal of Parapsychology 4 299-319.
HACKING, I. (1988). Telepathy: Origins of randomization in ex-
perimental design. Isis 79 427-451.
HANSEL, C. E. M. (1980). ESP and Parapsychology: A Critical
Re-evaluation. Prometheus Books, Buffalo, N.Y.
HARRIS, M. J. and ROSENTHAL, R. (1988a). Interpersonal Ex-
pectancy Effects and Human Performance Research. Na-
tional Academy Press, Washington, D.C.
HARRIS, M. J. and ROSENTHAL, R. (1988b). Postscript to Interper-
sonal Expectancy Effects and Human Performance Research.
National Academy Press, Washington, D.C.
HEDGES, L. V. and OLKIN, I. (1985). Statistical Methods for
Meta-Analysis. Academic, Orlando, Fla.
HONORTON, C. (1977). Psi and internal attention states. In
Handbook of Parapsychology (B. B. Wolman, ed.) 435-472.
Van Nostrand Reinhold, New York.
HONORTON, C. (1985a). How to evaluate and improve the repli-
cability of parapsychological effects. In The Repeatability
Problem in Parapsychology (B. Shapin and L. Coly, eds.)
238-255. Parapsychology Foundation, New York.
HONORTON, C. (:1985b). Meta-analysis of psi ganzfeld research: A
response to Hyman. Journal of Parapsychology 49 51-91.
HONORTON, C., BERGER, R. E., VARVOGLIS, M. P., QUANT, M.,
DERR, P., SCHECHTER, E. I. and FERRARI, D. C. (1990).
Psi communication in the ganzfeld: Experiments with an
automated testing system and a comparison with a meta-
analysis of earlier studies. Journal of Parapsychology 54
99-139.
HONORTON, C. and FERRARI, D. C. (1989). "Future telling": A
meta-analysis of forced-choice precognition experiments,
1935-1987.. Journal of Parapsychology 53 281-308.
HONORTON, C., FERRARI, D. C. and BEM, D. J. (1991). Extraver-
sion and ESP performance: A meta-analysis and a new
confirmation. Research in Parapsychology 1990. The Scare-
crow Press, Metuchen, N.J. To appear.
HYMAN, R. (1985a). A critical overview of parapsychology. In A
Skeptic's Handbook of Parapsychology (P. Kurtz, ed.) 1-96.
Prometheus Books, Buffalo, N.Y.
HYMAN, R. (1985b). The ganzfeld psi experiment: A critical
appraisal. Journal of Parapsychology 49 3-49.
HYMAN, R. and HONORTON, C. (1986). Joint communiqud: The
psi ganzfeld controversy. Journal of Parapsychology 50
351-364.
IVERSEN, G. R.,, LONGCOR, W. H., MOSTELLER, F., GILBERT, J. P.
and YouTz, C. (1971). Bias and runs in dice throwing and
recording: A few million throws. Psychometrika 36 1-19.
JEFFREYS, W. H. (1990). Bayesian analysis of random event
generator data. Journal of Scientific Exploration 4 153-169.
LINDLEY, D. V. (1957). A statistical paradox. Biometrika 44
187-192.
MAUSKOPF, S. H. and MCVAUGH, M. (1979). The Elusive Science:
Origins of Experimental Psychical Research. Johns Hopkins
Univ. Press.
MCVAUGH, M. R. and MAUSKOPF, S. H. (1976). J. B. Rhine's
Extrasensory Perception and its background in psychical
research. Isis 67 161-189.
NEULIEP, J. W., ed. (1990). Handbook of replication research in
the behavioral and social sciences. Journal of Social Behav-
ior and Personality 5 (4) 1-510.
OFFICE OF TECHNOLOGY ASSESSMENT (1989). Report of a work-
shop on experimental parapsychology. Journal of the Amer-
ican Society for Psychical Research 83 317-339.
PALMER, J. (1989). A reply to Gilmore. Journal of Parapsychol-
ogy 53 341-344.
PALMER, J. (1990). Reply to Gilmore: Round two. Journal of
Parapsychology 54 59-61.
PALMER, J. A., HONORTON, C. and Urrs, J. (1989). Reply to the
National Research Council study on parapsychology. Jour-
nal of the American Society for Psychical Research 83 31-49.
RADIN, D. I. and FERRARI, D. C. (1991). Effects of consciousness
on the fall of dice: A meta-analysis. Journal of Scientific
Exploration 5 61-83.
RADIN, D. I. and NELSON, R. D. (1989). Evidence for conscious-
ness-related anomalies in random physical systems. Foun-
dations of Physics 19 1499-1514.
RAO, K. R. (1985). Replication in conventional and controversial
sciences. In The Repeatability Problem in Parapsychology
(B. Shapin and L. Coly, eds.) 22-41. Parapsychology Foun-
dation, New York.
RHINE, J. B. (1934). Extrasensory Perception. Boston Society for
Psychical Research, Boston. (Reprinted by Branden Press,
1964.)
RHINE, J. B. (1977). History of experimental studies. In Hand-
book of Parapsychology (B. B. Wolman, ed.) 25-47. Van
Nostrand Reinhold, New York.
RICHET, C. (1884). La suggestion mentale et le calcul des proba-
bilites. Revue Philosophique 18 608-674.
ROSENTHAL, R. (1984). Meta-Analytic Procedures for Social Re-
search. Sage, Beverly Hills.
ROSENTHAL, R. (1986). Meta-analytic procedures and the nature
of replication: The ganzfeld debate. Journal of Parapsychol-
ogy 50 315-336.
ROSENTHAL, R. (1990a). How are we doing in soft psychology?
American Psychologist 45 775-777.
ROSENTHAL, R. (1990b). Replication in behavioral research.
Journal of Social Behavior and Personality 5 1-30.
SAUNDERS, D. R. (1985). On Hyman's factor analysis. Journal of
Parapsychology 49 86-88.
SHAPIN, B. and COLY, L., eds. (1985). The Repeatability Problem
in Parapsychology. Parapsychology Foundation, New York.
SPENCER-BROWN, G. (1957). Probability and Scientific Inference.
Longmans Green, London and New York.
STUART, C. E. and GREENWOOD, J. A. (1937). A review of criti-
cisms of the mathematical evaluation of ESP data. Journal
of Parapsychology 1 295-304.
TVERSKY, A. and KAHNEMAN, D. (1982). Belief in the law of
small numbers. In Judgment Under Uncertainty: Heuristics
and Biases (D. Kahneman, P. Slovic and A. Tversky, eds.)
23-31. Cambridge Univ. Press.
Urrs, J. (1986). The ganzfeld debate: A statistician's perspec-
tive. Journal of Parapsychology 50 395-402.
UTrs, J. (1987). Psi, statistics, and society. Behavioral and
Brain Sciences 10 615-616.
Urrs, J. (1988). Successful replication versus statistical signifi-
cance. Journal of Parapsychology 52 305-320.
Urrs, J. (1989). Randomness and randomization tests: A reply to
Gilmore. Journal of Parapsychology 53 345-351.
Urrs, J. (1991). Analyzing free-response data: A progress report.
In Psi Research Methodology: A Re-examination (L. Coly,
ed.). Parapsychology Foundation, New York. To appear.
WILKS, S. S. (1965a). Statistical aspects of expeirments in
telepath. N.Y. Statistician 16 (6) 1-3.
WILKS, S. S. (1965b). Statistical aspects of experiments in
telepathy. N. Y. Statistician 16 (7) 4-6.
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
J. UITS 379
Comment
M. J. Bayard and James Berger
1. INTRODUCTION
There are many fascinating issues discussed in
this paper. Several concern parapsychology itself
and the interpretation of statistical methodology
therein. We are not experts in parapsychology, and
so have only one comment concerning such mat-
ters: In Section 3 we briefly discuss the need to
switch from P-values to Bayes factors in discussing
evidence concerning parapsychology.
A more general issue raised in the paper is that
of replication. It is quite illuminating to consider
the issue of replication from a Bayesian perspec-
tive, and this is done in Section 2 of our discussion.
Many insightful observations concerning replica-
tion are given in the article, and these spurred us
to determine if they could be quantified within
Bayesian reasoning. Quantification requires clear
delineation. of the possible purposes of replication,
and at least two are obvious. The first is simple
reduction of random error, achieved by obtaining
more observations from the replication. The second
purpose is to search for possible bias in the original
experiment. We use "bias" in a loose sense here, to
refer to any of the huge number of ways in which
the effects being measured by the experiment can
differ from. the actual effects of interest. Thus a
clinical trial without a placebo can suffer a placebo
"bias"; a Survey can suffer a "bias" due to the
sampling frame being unrepresentative of the
actual population; and possible sources of bias
in parapsychological experiments have been
extensively discussed.
Replication to_Reduce Random Error
If the sole goal of replication of an experiment is
to reduce random error, matters are very straight-
forward. Reviewing the Bayesian way of studying
this issue is, however, useful and will be done
through the following simple example.
M. J. Bayarri is Titular Professor, Department of
Statistics and Operations Research, University of
Valencia, Avenida Dr. Moliner 50, 46100 Burjassot,
Valencia, Spain. James Berger is the Richard M.
Brumfield Distinguished Professor of Statistics,
Purdue University, West Lafayette, Indiana 47907.
EXAMPLE 1. Consider the example from Tversky
and Kahnemann (1982), in which an experiment
results in a standardized test statistic of z1 = 2.46.
(We will assume normality to keep computations
trivial.) The question is: What is the highest value
of z2 in a second set of data that would be consid-
ered a failure to replicate? Two possible precise
versions of this question are: Question 1: What is
the probability of observing z2 for which the null
hypothesis would be rejected in the replicated ex-
periment? Question 2: What value of z2 would
leave one's overall opinion about the null hypothe-
sis unchanged?
Consider the simple case where Z1 -- N(z1 0, 1)
and (independently) Z2 -- N(z2 101 1), where 0 is
the mean and 1 is the standard deviation of the
normal distribution. Note that we are considering
the case in which no experimental bias is suspected
and so the means for each experiment are assumed
to be the same.
Suppose that it is desired to test Ho: 0 s 0 versus
Hl: 0 > 0, and suppose that initial prior opinion
about 0 can be described by the noninformative
prior r(0) = 1. We consider the one-sided testing
problem with a constant prior in this section, be-
cause it is known that then the posterior probabil-
ity of H0, to be denoted by P(H0 data), equals the
P-value, allowing us to avoid complications arising
from differences between Bayesian and classical
answers.
After observing z1 = 2.46, the posterior distribu-
tion of 0 is
ir(0Iz1) =N(012.46,1).
Question 1 then has the answer (using predictive
Bayesian reasoning)
P(rejecting at level a z1)
0000
1
C. 727
d8 dz2
c< - 2.46
=1-~ ),
where 4 is the standard normal cdf and c,, is the
(one-sided) critical value corresponding to the level,
a, of the test. For instance, if a = 0.05, then this
probability equals 0.7178, demonstrating that there
is a quite substantial probability that the second
experiment will fail to reject. If a is chosen to be
the observed significance level from the first exper-
iment, so that c, = z1, then the probability that the
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
Approved For Release 2003/09/10 : CIA-RDP96-00792R000100130004-9
second experiment will reject is just 1/2. This is
nothing but a statement of the well-known martin-
gale property of Bayesianism, that what you "ex-
pect" to see in the future is just what you know
today. In a sense, therefore, question 1 is exposed
as being uninteresting.
Question 2 more properly focuses on the fact that
the stated goal of replication here is simply to
reduce uncertainty in stated conclusions. The an-
swer to the question follows immediately from not-
ing that the posterior from the combined data
(z1, z2) is
'r (B 1 Z1, Z2) = N(9 I (z1 + z2)/2, 1/vG ),
P(Ho ldata) = t(- (z1 + z2)/s).
Setting this equal to P(Ho I z1) and solving for z2
yields z2 = (V2-- - 1)z1 = 1.02. Any value of z2
greater than this will increase the total evidence
against H0, while any value smaller than 1.02 will
decrease the evidence.
Replication to Detect Bias
The aspirin example dramatically raises the is-
sue of bias detection as a motive for replication.
Professor Utts observes that replication 1 gives
results that are fully compatible with those of the
original study, which could be interpreted as sug-
gesting that there is no bias in the original study,
while replication 2 would raise serious concerns of
bias. We became very interested in the implicit
suggestion that replication 2 would thus lead to
less overall evidence against the null hypothesis
than would replication 1, even though in isolation
replication 2 was much more "significant" than
was replication 1. In attempting to see if this is so,
we considered the Bayesian approach to study of
bias within the framework of the aspirin example.
EXAMPLE 2. For simplicity in the aspiring exam-
ple, we reduce consideration to
0 = true difference in heart attack rates between
aspirin and placebo populations multiplied by
1000;
Y = difference in observed heart attack rates be-
tween aspirin and placebo groups in original
study multiplied by 1000;
Xi = difference in observed heart attack rates be-
tween aspirin and placebo groups in Replica-
tion i multiplied by 1000.
We assume that the replication studies are ex-
tremely well designed and implemented, sQ that
one is very confident that the Xi have mean 0.
Using normal approximations for convenience, the
data can be summarized as
X1 -- N(x110, 4.82), X2 - N(x210, 3.63)
with actual observations xl = 7.704 and x2 =
13.07.
Consider now the bias issue. We assume that the
original experiment is somewhat suspect in this
regard, and we will model bias by defining the
mean of Y to be
,i=0+0,
where 0 is the unknown bias. Then the data in the
original experiment can be summarized by
Y--N(yIv,1.54),
with the actual observation being y = 7.707.
Bayesian analysis requires specification of a prior
distribution, a(f3), for the suspected amount of bias.
Of particular interest then are the posterior distri-
bution of 0, assuming replication i has been
performed, given by
ir(al y, xi)