EVALUATING FREE-RESPONSE RATING DATA (SCHOUTEN AND CAMFFERMAN)

Document Type: 
Collection: 
Document Number (FOIA) /ESDN (CREST): 
CIA-RDP96-00792R000700960002-4
Release Decision: 
RIFPUB
Original Classification: 
U
Document Page Count: 
3
Document Creation Date: 
November 4, 2016
Document Release Date: 
October 28, 1998
Sequence Number: 
2
Case Number: 
Content Type: 
RP
File: 
AttachmentSize
PDF icon CIA-RDP96-00792R000700960002-4.pdf333.06 KB
Body: 
Approved For Release 2000/08/15 : CIA-RDP96-00792Rg00700960002-4 Papers Statistical Issues and methods carry out a global significance test for such single hypotheses to which a superordinate null hypothesis can be assigned. It should be clear that by performing global significance tests many psi experiments must lose t ieir significance. I remem- ber, though, that I also mentioned the~interexperimental selection above, to whose avoidance, at the least, all similar psi experiments should be combined and submitted to a global significance test. Trough such a "meta-analysis," on tl1ie other hand, the signifi- can e may increase so that the single} experiment loses part of its mearu Q. My econd theme is the reduction of beta errors in the sta- tistical ev ation of psi experiments. The problem is to increase the statistic efficiency (or power)!of the significance tests in such a way that--d ite the avoidance f selection errors--minimal psi effects can be s tistically detecte . I confine myself to two differ- ent questions, bot of which are f considerable importance to the practice. The first uestion is: /which are the statistically optimal Here, it can first be a single result there is a simple result with the number of giv the p value will be strongly i method of correcting intra- o atistical correction possible that .plies the p value of the selected ,ts. Naturally, in this manner, o that the statistical signifi- the case of a global sig- Most of the other meth ds consist in wei ' ted combinations of the single results so as to a tain a most efficient lobal significance test. In the case of standa d psi experiments tha seems trivial be- cause one needs only to ad the different hits, who sum can be evaluated with a CR just as well as the separate resu s. However, an analysis of intra- and i erindividual distributions o psi scores shows that the simple addition of hits is one of the stati 'tally least efficient methods, ev for the aggregation of small eri- mental units such as indi ual runs. The reason for this 'es in the strong variability of p i scores, which can vary even in bi- polar fashion between psi- itting and psi-missing so that the t special (nonlinear) trans' rmations weighting the single scores ac- cording to their size. Fi ally, following the method of the likeli- hood quotient, I came to Ia measure which is statistically most effi- cient for strongly varyi psi scores and is a linear function of the well-known "run-sco variance." The second question refers to the identification of permissible forms of selection which one could use to increase the statistical efficiency. For example, the above definition of selection error al- lows one to exclude any partial results from the test of an experiment if the exclusion ensues ac that, under the null hypothesis, is independent' lar experimental situations, certain subjects, can be a great advantage because every In because the stat decreases with the is allowed to eliminate ti? able when calculating corre psi-hitters and psi-missers variables and other variabl significant. the problem of statisti in Reykjavik, it also must, apparently, t regarding reality, in such an obj number of factors ,tical efficiency in hould be Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4 of a multivariate experiment, criterion or predictor vari- by performing a factor analysis, the case of correlated variables bles. Finally, the so-called mentioned, according to which one cases of the distribution of a vari- f a correlational study, if enough lanations will not s of parapsychologists' id not have any considerabl effect. One uch effects. errors serve the general psychological tendency given empirical data with one's own expectations Therefore, the final demand can only be to tive area as mathematical statistics. Otherwise, with statisticsL one can prove everything. obal significance of the respective clues that particu- ertain variables, etc., irate them as is. This Sybo A. Schoutent and Gert Camfferman (Parapsychology Laboratory, University of Utrecht, Sorbonnelaan 16, 3584CA Utrecht, The Netherlands) During the recent decades the use of forced-choice methods in experimental research in parapsychology has gradually declined in favor of free-response techniques. A disadvantage of free- response techniques is that they are rather time consuming. The Approved For Release 2000/08/15: CIA-RDP96-00792Ra00700960002-4 Papers atistical Issues and Methods discrepancy in time investment between free-response and forced- choice studies seems only acceptable if it can be proven that either free-response studies are more sensitive for detecting ESP or that knowledge is gained from the process analysis which free-response studies allow. These two potential advantages of free-response studies require, however, more sensitive techniques for analyzing free-response data than evaluations based on hit /miss ratios which are used with forced-choice methods. An evaluation method often used in free-response studies is one that employs different target sets for each trial and has the subject assign ratings to all pictures of the set. A target set con- sists of a number of pictures from which one is randomly selected to serve as the actual target in the experiment. The others are used as controls. The rating values assigned to pictures are based on the agreement between mentation (reported or not) and the con- tent (or perhaps symbolic meaning of the content) of the pictures. Based on the ratings assigned to each response, the pictures can be ranked and one of the familiar evaluation methods for preferen- tial ranking may be applied. But by turning ratings into ranks the greater sensitivity that the rating method might yield is lost. Hence, a statistical evaluation is needed which does credit to the higher sensitivity which ratings might offer. To this end most often Z- scores are applied, first used and reported by Stanford and Mayer in 1974 (JASPR, 1974, 182-191). When free-response rating data of an experiment were analyzed by applying nonparametric tests on the Stanford Z-score distribution of the targets a significant result indicating psi-missing was ob- served. However, it soon became clear that the result was purely artifactual and could be explained by the rating behavior of the subjects. This led us to study the properties of the Stanford Z-scores in more detail. Hansen reported to the 1985 PA Convention (RIP 1985, 93-94) that Z-score distributions are bimodal. We found that Z-score dis- tributions are in all cases nonnormal and only symmetrical but bi- modal when subjects select ratings with equal probability from the whole range. Decreasing the size of the target set yields flatter distributions. Decreasing the range of the ratings results in more irregular distributions. All distributions have an upper and lower limit of Z-scores. In cases in which subjects select ratings with unequal probabilities from the range applied, the distributions be- come asymmetrical. Hence it can be concluded that rating behavior influences the distributions of Stanford Z-scores. This seems an important problem because in many cases the conditions of the ex- periment will influence the rating behavior of subjects. That im- plies that an influence of conditions on the rating behavior, and consequently on the Z-scores, must be eliminated before a proper evaluation of the difference as regards ESP scoring can be made. Stanford Z-scores are also peculiar in some other respects. Their value and range are rather sensitive to the number of equal ratings assigned. In the case in which equal values are assigned, the actual size of the ratings has no influence on the size of the Stanford Z-score. For instance, 1-0-0-0-0 (first rating is target) yields the same Stanford Z-score for the target as 100-0-0-0-0; in both cases the target receives a Stanford Z-score of +1.72. Hence, the Stanford Z-scores do not always reflect the similarities or dif- ferences between mentation and targets that subjects express in their assignment of rating values. Another complication is that when relatively many ratings of equal value are assigned, the Z-score distribution tends to become discrete rather than continuous. Especially since free-response studies in general involve few trials, the discrete character of such distributions violates the assumptions on which many parametric and nonparametric tests are based. To meet these objections a different evaluation procedure based on a randomization test is proposed. The randomization test is based on the sum of ratings over the trials. In the case of assigning rating values to the control pictures it can be assumed that ESP can have no affect on these ratings. If we randomly select from each trial a control-picture rating value and take the sum of these ratings, then based on all possible combinations of ratings for control pictures over the trials a distribution is obtained which will tend to be normal even in the case that the ratings themselves were selected with unequal proba- bilities. The randomization test provides an answer to the question to what extent the sum, over the trials, of the ratings assigned to the target pictures deviates from the mean sum of the ratings as- signed to the control pictures. Consequently, the sum of the rat- ings assigned to the target pictures is expressed as a standard normal score based on the distribution of the sum of the ratings assigned to the controls. This standard normal score will be called the "standardized sum-of-ratings score" or SSR score. A good ap- proximation of this distribution is obtained by calculating the mean and standard deviation from the mean and variance of the ratings for the controls of the individual trials. The mean of the distribu- tion of sum of scores will be equal to the sum of the means of con- trols for the trials. The standard deviation is found by taking the square root of the sum of the variances for control ratings over the trials. Since SSR scores can be assumed to be standard normal, their associated probability can be obtained from the standard nor- mal distribution. SSR scores of different conditions can be com- pared because SSR scores not only can be considered standard normal but also are independent of differences between conditions in range of ratings, rating behavior, or number of controls applied. To obtain ESP scores for individual trials the rating value Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4 Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4 Papers Statistical Issues and Methods assigned to the target is converted into a standardized average rating score for the target (SAR score). The distribution of the sum of ratings for the controls can be considered as the distribution of ratings associated with that condition. Reduced to the level of individual trials we assume this distribution to be typical for the condition and express all ratings in this distribution of average ratings. Thus, all ratings are con- verted into standard normal scores by computing its distance from the mean of average ratings for the controls of the trials and divid- ing it by the standard deviation observed for these average ratings. Then for each trial a SAR score for the target is defined as the difference between this standard normal score for the target and the average standard normal score for target and controls. Since the SAR scores are based on true standard normal scores, which means scores obtained from a normal distribution, SAR scores can be considered normal too. For each trial the sum of SAR scores for controls and targets is zero. Therefore, in the case of related samples we might compare individual achievement over conditions by calculating a product-moment correlation between the SAR scores of the two conditions. Although the randomization test described above seems sta- tisticaIIy sound we further studied its properties, especially regard- ing its sensitivity to detect ESP. To this end we conducted a com- puter simulation of 100 "experiments" for each combination of two variables. Each experiment consisted of 20 trials and 5 pictures per trial and was simulated by randomly generating 20 rows of 5 numbers between rating values 0 and 30, inclusive. The two vari- ables involved were subjects' rating behavior and amount of ESP. For rating behavior we manipulated the probability of selecting rat- ing values of zero. The amount of ESP was operationalized as the number of subjects assigning the highest rating value to the target in addition to what could be expected by chance. From the data obtained it can be concluded that in most con- ditions the sensitivity of the SSR scores is rather low and less than that when, for instance, a simple binomial test was applied. Only in extreme cases of rating behavior and amount of ESP do the SSR scores become more sensitive than the binomial test. For instance, in the case of 5 ESP hits when in total 5 + 15/5 = 8 hits can be ex- pected, the binomial yields an exact one-tailed probability of p_= .01 whereas the SSR score yields on average a Z of 1.7 with an associ- ated one-tailed probability of .045. In the same simulation studies Stanford Z-scores were com- puted. We know that the distributions for these Z-scores are non- normal but leaving this aside we found that in most cases the sen- sitivity of t-test evaluations based on Stanford Z-scores is compar- able to that of evaluations based on SSR scores. However, SSR scores appear more sensitive than Stanford Z-scores in cases of strong ESP and extreme rating behavior. From these findings some practical conclusions can be drawn. In general we must assume that the ESP influence on the data is relatively little. Hence, unless there is reason to expect a strong ESP influence in the experiment the binomial test can be assumed to be more sensitive than an evaluation based on the rating values. The same applies for experiments in which no extreme rating be- havior can be expected, for instance, in an experiment in which an atomistic approach to the judging is followed. In that case we expect in general nonzero ratings assigned to all pictures, and our findings show that in that case the SSR scores, as well as Stan- ford's Z-scores, are rather insensitive. A METHODOLOGY FOR THE DEVELOPMENT OF A KNOWLEDGE-BASED JUDGING SYSTEM FOR FREE-RESPONSE MATERIALS Dick J. Bierman (Dept. of Psychology, University of Amsterdam) unlikely that this is pu generally does not displ judge that accounts for has been proposed (Morris expert systems might help the expertise of magicians made available to each ind tise of the best judges judging system. This intelligence (Al) to fused with the use response material rtai judges perform consistently ig argets to a target set. It seems i e of the judge's psi, since psi tent behavior. Therefore, it might ntuitive) knowledge of the specific er performance on this task. It AP, 1986, 137-149) that the use of researchers in tasks where they lack ction of fraud. Morris argues that be formalized in such a system and i researcher. Similarly, the exper- spouse material could become avail- s nowledge-based free-response se of tech iques from the field of artificial AI te chniques for the representation of free- the free-response ren, RIP 1986, 97-99). According to Maren, erial and the protocols should be represented free-response aterial and the protocols should be represented in the form of trees in which the nodes are perceivable "objects," like "flames," and the links represent relations, like "adjacent to." We expect that focusing our attention on the (knowledge used in the) human matching process might reveal more fundamental informa- tion about the role of the meaning of the material. It is striking that in Maren's proposed representation of complex target material only visual features are present. Actually, the type of visual matching that Maren proposes to be done by a machine can be bet- ter performed by a sighted human. Approved For Release 2000/08/15 : CIA-RDP96-00792R000700960002-4