EVALUATING FREE-RESPONSE RATING DATA (SCHOUTEN AND CAMFFERMAN)
Document Type:
Collection:
Document Number (FOIA) /ESDN (CREST):
CIA-RDP96-00792R000700960002-4
Release Decision:
RIFPUB
Original Classification:
U
Document Page Count:
3
Document Creation Date:
November 4, 2016
Document Release Date:
October 28, 1998
Sequence Number:
2
Case Number:
Content Type:
RP
File:
Attachment | Size |
---|---|
CIA-RDP96-00792R000700960002-4.pdf | 333.06 KB |
Body:
Approved For Release 2000/08/15 : CIA-RDP96-00792Rg00700960002-4
Papers Statistical Issues and methods
carry out a global significance test for such single hypotheses to
which a superordinate null hypothesis can be assigned.
It should be clear that by performing global significance
tests many psi experiments must lose t ieir significance. I remem-
ber, though, that I also mentioned the~interexperimental selection
above, to whose avoidance, at the least, all similar psi experiments
should be combined and submitted to a global significance test.
Trough such a "meta-analysis," on tl1ie other hand, the signifi-
can e may increase so that the single} experiment loses part of its
mearu Q.
My econd theme is the reduction of beta errors in the sta-
tistical ev ation of psi experiments. The problem is to increase
the statistic efficiency (or power)!of the significance tests in such
a way that--d ite the avoidance f selection errors--minimal psi
effects can be s tistically detecte . I confine myself to two differ-
ent questions, bot of which are f considerable importance to the
practice. The first uestion is: /which are the statistically optimal
Here, it can first be a
single result there is a simple
result with the number of giv
the p value will be strongly i
method of correcting intra- o
atistical correction possible that
.plies the p value of the selected
,ts. Naturally, in this manner,
o that the statistical signifi-
the case of a global sig-
Most of the other meth ds consist in wei ' ted combinations of
the single results so as to a tain a most efficient lobal significance
test. In the case of standa d psi experiments tha seems trivial be-
cause one needs only to ad the different hits, who sum can be
evaluated with a CR just as well as the separate resu s. However,
an analysis of intra- and i erindividual distributions o psi scores
shows that the simple addition of hits is one of the stati 'tally
least efficient methods, ev for the aggregation of small eri-
mental units such as indi ual runs. The reason for this 'es in
the strong variability of p i scores, which can vary even in bi-
polar fashion between psi- itting and psi-missing so that the t
special (nonlinear) trans' rmations weighting the single scores ac-
cording to their size. Fi ally, following the method of the likeli-
hood quotient, I came to Ia measure which is statistically most effi-
cient for strongly varyi psi scores and is a linear function of
the well-known "run-sco variance."
The second question refers to the identification of permissible
forms of selection which one could use to increase the statistical
efficiency. For example, the above definition of selection error al-
lows one to exclude any partial results from the
test of an experiment if the exclusion ensues ac
that, under the null hypothesis, is independent'
lar experimental situations, certain subjects,
can be a great advantage because every
In
because the stat
decreases with the
is allowed to eliminate ti?
able when calculating corre
psi-hitters and psi-missers
variables and other variabl
significant.
the problem of statisti
in Reykjavik, it also
must, apparently, t
regarding reality,
in such an obj
number of factors
,tical efficiency in
hould be
Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4
of a multivariate experiment,
criterion or predictor vari-
by performing a factor analysis,
the case of correlated variables
bles. Finally, the so-called
mentioned, according to which one
cases of the distribution of a vari-
f a correlational study, if enough
lanations will not
s of parapsychologists'
id not have any considerabl
effect. One
uch effects.
errors serve the general psychological tendency
given empirical data with one's own expectations
Therefore, the final demand can only be to
tive area as mathematical statistics. Otherwise,
with statisticsL one can prove everything.
obal significance
of the respective
clues that particu-
ertain variables, etc.,
irate them as is. This
Sybo A. Schoutent and Gert Camfferman (Parapsychology
Laboratory, University of Utrecht, Sorbonnelaan 16,
3584CA Utrecht, The Netherlands)
During the recent decades the use of forced-choice methods
in experimental research in parapsychology has gradually declined
in favor of free-response techniques. A disadvantage of free-
response techniques is that they are rather time consuming. The
Approved For Release 2000/08/15: CIA-RDP96-00792Ra00700960002-4
Papers atistical Issues and Methods
discrepancy in time investment between free-response and forced-
choice studies seems only acceptable if it can be proven that either
free-response studies are more sensitive for detecting ESP or that
knowledge is gained from the process analysis which free-response
studies allow. These two potential advantages of free-response
studies require, however, more sensitive techniques for analyzing
free-response data than evaluations based on hit /miss ratios which
are used with forced-choice methods.
An evaluation method often used in free-response studies is
one that employs different target sets for each trial and has the
subject assign ratings to all pictures of the set. A target set con-
sists of a number of pictures from which one is randomly selected
to serve as the actual target in the experiment. The others are
used as controls. The rating values assigned to pictures are based
on the agreement between mentation (reported or not) and the con-
tent (or perhaps symbolic meaning of the content) of the pictures.
Based on the ratings assigned to each response, the pictures can
be ranked and one of the familiar evaluation methods for preferen-
tial ranking may be applied. But by turning ratings into ranks the
greater sensitivity that the rating method might yield is lost. Hence,
a statistical evaluation is needed which does credit to the higher
sensitivity which ratings might offer. To this end most often Z-
scores are applied, first used and reported by Stanford and Mayer
in 1974 (JASPR, 1974, 182-191).
When free-response rating data of an experiment were analyzed
by applying nonparametric tests on the Stanford Z-score distribution
of the targets a significant result indicating psi-missing was ob-
served. However, it soon became clear that the result was purely
artifactual and could be explained by the rating behavior of the
subjects. This led us to study the properties of the Stanford
Z-scores in more detail.
Hansen reported to the 1985 PA Convention (RIP 1985, 93-94)
that Z-score distributions are bimodal. We found that Z-score dis-
tributions are in all cases nonnormal and only symmetrical but bi-
modal when subjects select ratings with equal probability from the
whole range. Decreasing the size of the target set yields flatter
distributions. Decreasing the range of the ratings results in more
irregular distributions. All distributions have an upper and lower
limit of Z-scores. In cases in which subjects select ratings with
unequal probabilities from the range applied, the distributions be-
come asymmetrical. Hence it can be concluded that rating behavior
influences the distributions of Stanford Z-scores. This seems an
important problem because in many cases the conditions of the ex-
periment will influence the rating behavior of subjects. That im-
plies that an influence of conditions on the rating behavior, and
consequently on the Z-scores, must be eliminated before a proper
evaluation of the difference as regards ESP scoring can be made.
Stanford Z-scores are also peculiar in some other respects.
Their value and range are rather sensitive to the number of equal
ratings assigned. In the case in which equal values are assigned,
the actual size of the ratings has no influence on the size of the
Stanford Z-score. For instance, 1-0-0-0-0 (first rating is target)
yields the same Stanford Z-score for the target as 100-0-0-0-0; in
both cases the target receives a Stanford Z-score of +1.72. Hence,
the Stanford Z-scores do not always reflect the similarities or dif-
ferences between mentation and targets that subjects express in
their assignment of rating values.
Another complication is that when relatively many ratings of
equal value are assigned, the Z-score distribution tends to become
discrete rather than continuous. Especially since free-response
studies in general involve few trials, the discrete character of such
distributions violates the assumptions on which many parametric and
nonparametric tests are based. To meet these objections a different
evaluation procedure based on a randomization test is proposed.
The randomization test is based on the sum of ratings over
the trials. In the case of assigning rating values to the control
pictures it can be assumed that ESP can have no affect on these
ratings. If we randomly select from each trial a control-picture
rating value and take the sum of these ratings, then based on all
possible combinations of ratings for control pictures over the trials
a distribution is obtained which will tend to be normal even in the
case that the ratings themselves were selected with unequal proba-
bilities. The randomization test provides an answer to the question
to what extent the sum, over the trials, of the ratings assigned to
the target pictures deviates from the mean sum of the ratings as-
signed to the control pictures. Consequently, the sum of the rat-
ings assigned to the target pictures is expressed as a standard
normal score based on the distribution of the sum of the ratings
assigned to the controls. This standard normal score will be called
the "standardized sum-of-ratings score" or SSR score. A good ap-
proximation of this distribution is obtained by calculating the mean
and standard deviation from the mean and variance of the ratings
for the controls of the individual trials. The mean of the distribu-
tion of sum of scores will be equal to the sum of the means of con-
trols for the trials. The standard deviation is found by taking the
square root of the sum of the variances for control ratings over
the trials.
Since SSR scores can be assumed to be standard normal,
their associated probability can be obtained from the standard nor-
mal distribution. SSR scores of different conditions can be com-
pared because SSR scores not only can be considered standard
normal but also are independent of differences between conditions
in range of ratings, rating behavior, or number of controls applied.
To obtain ESP scores for individual trials the rating value
Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4
Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4
Papers Statistical Issues and Methods
assigned to the target is converted into a standardized average
rating score for the target (SAR score).
The distribution of the sum of ratings for the controls can
be considered as the distribution of ratings associated with that
condition. Reduced to the level of individual trials we assume this
distribution to be typical for the condition and express all ratings
in this distribution of average ratings. Thus, all ratings are con-
verted into standard normal scores by computing its distance from
the mean of average ratings for the controls of the trials and divid-
ing it by the standard deviation observed for these average ratings.
Then for each trial a SAR score for the target is defined as
the difference between this standard normal score for the target
and the average standard normal score for target and controls.
Since the SAR scores are based on true standard normal scores,
which means scores obtained from a normal distribution, SAR scores
can be considered normal too. For each trial the sum of SAR scores
for controls and targets is zero. Therefore, in the case of related
samples we might compare individual achievement over conditions by
calculating a product-moment correlation between the SAR scores of
the two conditions.
Although the randomization test described above seems sta-
tisticaIIy sound we further studied its properties, especially regard-
ing its sensitivity to detect ESP. To this end we conducted a com-
puter simulation of 100 "experiments" for each combination of two
variables. Each experiment consisted of 20 trials and 5 pictures
per trial and was simulated by randomly generating 20 rows of 5
numbers between rating values 0 and 30, inclusive. The two vari-
ables involved were subjects' rating behavior and amount of ESP.
For rating behavior we manipulated the probability of selecting rat-
ing values of zero. The amount of ESP was operationalized as the
number of subjects assigning the highest rating value to the target
in addition to what could be expected by chance.
From the data obtained it can be concluded that in most con-
ditions the sensitivity of the SSR scores is rather low and less than
that when, for instance, a simple binomial test was applied. Only
in extreme cases of rating behavior and amount of ESP do the SSR
scores become more sensitive than the binomial test. For instance,
in the case of 5 ESP hits when in total 5 + 15/5 = 8 hits can be ex-
pected, the binomial yields an exact one-tailed probability of p_= .01
whereas the SSR score yields on average a Z of 1.7 with an associ-
ated one-tailed probability of .045.
In the same simulation studies Stanford Z-scores were com-
puted. We know that the distributions for these Z-scores are non-
normal but leaving this aside we found that in most cases the sen-
sitivity of t-test evaluations based on Stanford Z-scores is compar-
able to that of evaluations based on SSR scores. However, SSR
scores appear more sensitive than Stanford Z-scores in cases of
strong ESP and extreme rating behavior.
From these findings some practical conclusions can be drawn.
In general we must assume that the ESP influence on the data is
relatively little. Hence, unless there is reason to expect a strong
ESP influence in the experiment the binomial test can be assumed to
be more sensitive than an evaluation based on the rating values.
The same applies for experiments in which no extreme rating be-
havior can be expected, for instance, in an experiment in which
an atomistic approach to the judging is followed. In that case we
expect in general nonzero ratings assigned to all pictures, and our
findings show that in that case the SSR scores, as well as Stan-
ford's Z-scores, are rather insensitive.
A METHODOLOGY FOR THE DEVELOPMENT OF A
KNOWLEDGE-BASED JUDGING SYSTEM FOR FREE-RESPONSE
MATERIALS
Dick J. Bierman (Dept. of Psychology, University of Amsterdam)
unlikely that this is pu
generally does not displ
judge that accounts for
has been proposed (Morris
expert systems might help
the expertise of magicians
made available to each ind
tise of the best judges
judging system. This
intelligence (Al) to
fused with the use
response material
rtai judges perform consistently
ig argets to a target set. It seems
i e of the judge's psi, since psi
tent behavior. Therefore, it might
ntuitive) knowledge of the specific
er performance on this task. It
AP, 1986, 137-149) that the use of
researchers in tasks where they lack
ction of fraud. Morris argues that
be formalized in such a system and
i researcher. Similarly, the exper-
spouse material could become avail-
s nowledge-based free-response
se of tech iques from the field of artificial
AI te chniques for the representation of free-
the free-response ren, RIP 1986, 97-99). According to Maren,
erial and the protocols should be represented
free-response aterial and the protocols should be represented
in the form of trees in which the nodes are perceivable "objects,"
like "flames," and the links represent relations, like "adjacent to."
We expect that focusing our attention on the (knowledge used in
the) human matching process might reveal more fundamental informa-
tion about the role of the meaning of the material. It is striking
that in Maren's proposed representation of complex target material
only visual features are present. Actually, the type of visual
matching that Maren proposes to be done by a machine can be bet-
ter performed by a sighted human.
Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4