This is not the document you are looking for? Use the search form below to find more!

Report home > Psychology

prep : An agony in five Fits

0.00 (0 votes)
Document Description
In 2005 Psychological Science, the flagship journal of the Association for Psychological Science, began their current practice of asking contributors to compute the statistic prep in lieu of the traditional p-value. In a polemic comprising five Fits we argue that prep is misnamed, commonly miscalculated, misapplied outside a narrow scope, and its large variability often produces values that invite mistrust and mislead the interpretation of data.
File Details
Submitter
  • Username: shinta
  • Name: shinta
  • Documents: 4332
Embed Code:

Add New Comment




Related Documents

How do contracts adapt to an increase in free cash flow?

by: samanta, 55 pages

I select a sample of firms that provides a powerful setting to study how contracts adapt to a change in the firm’s environment. I focus on firms experiencing a large and persistent increase in ...

An iPod in Your Classroom Toolbox

by: florus, 31 pages

An iPod in Your Classroom Toolbox

Using an ILIT in Estate Planning

by: gabriel, 17 pages

By Ward J. Wilsey, JD, LLM The Wilsey Law Firm www.wilseylaw.com [email_address] Using an ILIT in Estate Planning Circular 230 Warning Pursuant to the rules of professional ...

AN ANALYSIS OF TRANSITIVITY CHOICES IN FIVE APPELLATE DECISIONS IN ...

by: osmaro, 17 pages

Legal discourse has long been an object of linguistic analysis, from the beginning of the 1960s (Mellinkoff 1963; Crystal and Davy 1965; etc) to today (Conley and O'Barr in press; Stygall 1996; etc). ...

Determination of Vitamin C, Beta Carotene and Riboflavin Contents in Five Green Vegetables Organically and Conventionally Grown

by: shinta, 10 pages

As consumer interest in organically grown vegetables is increasing in Malaysia, there is a need to answer whether the vegetables are more nutritious than those conventionally grown. This ...

5 Things You Should Know Before You Rent An Apartment In Houston That Will Save You Big Bucks

by: writeonbro, 8 pages

How Houston renters can move out of overpriced under-maintained rental property and move into a better apt for less money.

‘Structured finance’ technique an asset in bankruptcy

by: monkey, 1 pages

“Structured finance” — a process for determining how certain assets will be han- dled in the event of bankruptcy — has become an important ...

Highly Mobile Children and Youth with Disabilities: Policies and Practices in Five States

by: samanta, 10 pages

The term "highly mobile children and youth" is used to describe a broad population of individuals ages 6-21, who share the condition of having moved six or more times during their school years (Popp, ...

The Important Role of an Attorney in a Residential Real Estate ...

by: chan, 25 pages

Lawyers' involvement in residential house sales in Minnesota began dwindling in the 1980s with the advent of HUD regulations requiring title insurance for mortgages, secured by residential ...

Mini-Course on “How to become an expert in business dashboards!”

by: subashini, 1 pages

This mini-course administrated through email is designed to help managers like you to acquire or hone your skills on how to design performance dashboards. Whether you are manually preparing ...

Content Preview
Journal of Mathematical Psychology 53 (2009) 195–202
Contents lists available at ScienceDirect
Journal of Mathematical Psychology
journal homepage: www.elsevier.com/locate/jmp
prep: An agony in five Fits
Geoffrey J. Iverson a,b,?, Michael D. Lee a,b, Shunan Zhang a,b, Eric-Jan Wagenmakers c
a Department of Cognitive Sciences, University of California, Irvine, United States
b Institute for Mathematical and Behavioral Sciences, University of California, Irvine, United States
c Department of Psychology, University of Amsterdam, Netherlands
a r t i c l e
i n f o
a b s t r a c t
Article history:
In 2005 Psychological Science, the flagship journal of the Association for Psychological Science, began their
Received 3 June 2008
current practice of asking contributors to compute the statistic prep in lieu of the traditional p-value.
Available online 22 November 2008
In a polemic comprising five Fits we argue that prep is misnamed, commonly miscalculated, misapplied
outside a narrow scope, and its large variability often produces values that invite mistrust and mislead
Keywords:
the interpretation of data.
prep
Published by Elsevier Inc.
Probability of replication
Posterior prediction
Prelude to the Agony
Psychological Science submit values of prep when reporting their
statistical analyses.
‘‘Come, listen, my men, while I tell you again,
p
The five unmistakable marks
rep is intended to be read ‘‘probability of replication’’, and
By which you may know, wheresoever you go,
gives the very strong impression that experiments yielding large
The warranted genuine Snarks.’’
values of prep (currently Psychological Science regards prep
?
0.85 as large1) are replicable with high probability. Recently the
The Hunting of the Snark: FIT THE SECOND, The Bellmans Speech.
euphemisms ‘reliable’ and ‘robust’ have crept into use, so that, for
Lewis Carroll, 1876.
example, prep = 0.92 is said to indicate a reliable experimental
In the May 2005 issue of Psychological Science Peter Killeen
finding. Whatever term is used, the unfortunate and misleading
introduced the statistic prep to the psychological community. He
impression is that prep = 0.92 indicates an experimental effect has
describes prep as follows:
been established. This impression does not encourage substantive
‘‘The statistic p
replication. If an experimental effect is remotely plausible and
rep estimates the probability of replicating an
effect. It captures traditional publication criteria for signal-
prep = 0.92, why bother to replicate?
to-noise ratio, while avoiding parametric inference and the
For its calculation, prep requires an analytical context, and to
resulting Bayesian dilemma. In concert with effect size and
keep matters as simple as possible we shall assume throughout
replication intervals, prep provides all of the information now
that this context is provided by the independent groups design in
used in evaluating research, while avoiding many of the pitfalls
which the same number of measurements n is provided by each
of traditional statistical inference’’. (Killeen, 2005a, Abstract).
of an ‘experimental’ and a ‘control’ group.2 All measurements are
At the time James Cutting was chief editor of Psychological
assumed to be mutually independent and normally distributed,
Science and in an Acknowledgment (Cutting, 2005) that appeared
in the December 2005 issue of Psychological Science, he wrote ‘‘and
the General Article by Peter Killeen in the May issue may change
1 There is no editorial statement that stamps p
how all psychologists report their statistics’’. This prediction has
rep ? 0.85 as the gold standard.
Indeed, Killeen (2005a,b,c) suggested prep ? 0.90. However, authors publishing in
turned out to be accurate. Currently, about 60% of contributors to
Psychological Science routinely declare values of prep = 0.86 and above as signaling
significant effects. The first clear signs of hesitation occur when prep = 0.85, with
some authors happy to declare this value significant, whereas others are reluctant
to do so.
? Corresponding address: Department of Cognitive Sciences, 3151 Social Sciences
2 Note that Killeen uses n to denote the combined sample size from both the
Plaza, University of California, Irvine, CA 92697-5100, United States.
control and experimental groups, whereas we use n for each group separately.
E-mail addresses: giverson@uci.edu (G.J. Iverson), mdlee@uci.edu (M.D. Lee),
We prefer our approach, because it generalizes more naturally to cases where the
szhang@uci.edu (S. Zhang), ej.wagenmakers@gmail.com (E.-J. Wagenmakers).
number of subjects in each group is not the same.
0022-2496/$ – see front matter. Published by Elsevier Inc.
doi:10.1016/j.jmp.2008.09.004

196
G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
with a common known3 variance ? 2. The parameter of interest
to the experimenter is the population effect ? = (µE ? µC ) /?
and is estimated by the experimental or substantive effect d =
xE ? ¯xC) /?. Clearly d ? N ?, 2 and, as is familiar from
n
elementary statistical theory, d is ‘best unbiased’ for ?. The related
quantity z = d
n is a familiar test statistic in this context. Under
2
the standard null hypothesis H0 : ? = 0, z is distributed as a
standard normal variate (mean 0, variance 1) and one rejects H0
whenever |z| ? z?/2 in carrying out the level-? Neyman–Pearson
test procedure. Equally familiar is the practice of reporting an
associated probability value, or p-value for short; p-values attach
themselves to test statistics and in the present context the (two-
sided) p-value attached to the statistic |z| is given by
n
p-value = 2? ? |d|
= 2? (? |z|) .
(1)
2
Killeen (2005a,b,c) rejects much of the standard frequentist es-
Fig. 1. Two independent experimental effects d
timation and inference machinery. He has no time for estimation:
1 and d2 are drawn from the
distribution f (d | ?) generating the data. Each draw gives rise to a different value of
p
‘‘But it is rare for psychologists to need estimates of parameters
rep , shown by shaded areas. Note that if the true state of nature ? is close enough to
zero, d
. . .
1 and d2 can have opposite signs. Even so, it is clear that prep is always greater
’’ (Killeen, 2005a, p. 345);
than 0.5.
and even less for frequentist inference:
Killeen (2005a) chooses to ‘‘Define replication as an effect of the
‘‘Our unfortunate historical commitment to significance tests
same sign as that found in the original experiment’’ (p. 346, emphasis
forces us to rephrase [these] good questions in the negative,
in original). We think this definition is unfortunate and belies
normal usage of the terms ‘replicate’ and ‘reliable’.
attempt to reject those nullities, and be left with nothing we
To attach a probability to this definition requires a model,
can logically say about the questions—whether p = .100 or
and despite the obvious ‘‘Bayesian dilemma’’ Killeen invokes two
p = .001’’ (Killeen, 2005a, pp. 345–346).
Bayesian models, the fixed effects model and the random effects
Of course, Killeen is not alone in harboring a critical view of
model. In the fixed effects model independent experiments are
frequentist inference. We hold similar opinions. He is also not alone
literally replicas of one another. That is, they are identical in all
in calling for an alternative methodology. Now the Bayesian School
respects save for sampling variability, and that variability is the
has elaborated a principled, coherent and readily interpretable
only source of differences among experimental outcomes. Let us
alternative to classical estimation and inference.
call this model M1 to distinguish it from the random effects model
Killeen declares that his alternative is not Bayesian (Killeen,
M2 in which independent repetitions of an experimental protocol
2005a). Indeed, he offers his ideas as an alternative that avoids the
combine uncertainty in the population effect parameter ? with
sampling variability. The standard calculation of p
‘‘Bayesian dilemma’’ (of having to specify a prior distribution on ?).
rep is carried out
under model M
But as we shall soon see, p
1:
rep is a Bayesian calculation, though one
that is not carried out on a routine basis in Bayesian inference.
prep = Pr d and drep agree in sign | d, M1 .
Killeen and Psychological Science propose that experimenters
The calculation of prep is pictured in Fig. 1. It is the larger of the
report an (estimate of) the probability that a repetition drep of
areas subtended by the posterior predictive f (drep | d) above and
an existing experimental effect d will agree with d in direction,
below zero. Since f (drep | d) is not available in frequentist theory,
and to do so in lieu of a conventional p-value This probability of
prep is a Bayesian construct.
replication prep seems new, exciting, and extremely useful. Despite
We take exception to the terminology and notation that attends
appearances however prep is misnamed, commonly miscalculated
the definition of prep. The following definition seems more in
even by its progenitors, misapplied outside a common but
line with standard English dictionaries and with dictionaries of
otherwise very narrow scope, and its seductively large values
statistical terms.
can be seriously misleading. In short, Psychological Science has bet
on the wrong horse, and nothing but mischief will follow from
Definition 1. Independent experimental effects d1 and d2 replicate
its continued promotion of p
if (and only if) they are each generated under model M
rep as a scientifically informative
1. That is, if
predictive probability of replicability.
they are each generated by the same value of ?.
FIT THE FIRST: In which p
Many experimental designs involve comparisons that invite
rep is misnamed
checks of no effect (e.g., no expected effect of order of treatment or
‘‘When I use a word’’, Humpty Dumpty said, in a rather scornful
of sex or of age cohort). It is anticipated that these comparisons will
tone, ‘‘it means just what I choose it to mean—neither more nor
rarely be significant, and at the same time it is expected that others
less.’’ Through the Looking-Glass: Humpty Dumpty, Lewis Carroll,
repeating the same comparisons would reach similar conclusions.
1872.
That is, experimental comparisons that everyone expects to reflect
no or at most a very small effect are nonetheless thought of as
highly replicable. This circumstance, which is a commonplace in
3
every empirical science, is entirely in line with the above definition.
This unrealistic assumption is one of convenience only. It can be dropped,
In such cases measured effects will, over replications, bounce about
but to do so would involve us in analytical complications that distract from our
main purpose. Our critique of p
zero, and there will be a low probability, near 50%, that any two
rep in no way depends on the assumption of known
variance.
randomly chosen effects agree in sign. For prep however, which

G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
197
‘‘In particular, whenever a p value has been calculated, one
can immediately infer prep by (a) calculating the z-score
corresponding to 1 ? p, (b) dividing by the square root of 2, and
(c) finding the probability associated with this new z-score:
?
prep = ?
??1 [1 ? p] / 2 ’’.
(2)
Unfortunately, the computations of authors following this recipe
are often wrong. The standard analytical expression for prep is5
n
prep = ? |d|
.
(3)
4
Here d is, as defined above, the observed effect in a comparison
of two independent groups, each involving samples of size n. The
accompanying (two-sided) p-value is given in Eq. (1).
Fig. 2. The distinction between the notions of ‘replication’ and ‘concurrence’,
Putting Eqs. (1) and (3) together gives
illustrated by three combinations of d and drep. The points A, B and C show different
states of nature. The circular contours around each indicate the joint distribution
of d and drep in each case. Combination A replicates but does not necessarily
??1 1 ? p
p
2
?
.
(4)
concur. Combination B concurs but does not replicate. Combination C replicates and
rep = ?
2
concurs.
The difference between Eqs. (2) and (4) appears to be minor.
places a premium on experimental effects agreeing in sign, these
The p-value in Eq. (2) is not halved as it is in Eq. (4) but otherwise
reliable and replicable null experimental outcomes (which seem so
the two formulas are identical. Of course the two formulas Eq. (2)
essential for the construction of uncluttered and workable theory),
and Eq. (4) give different numerical results – a calculation via Eq.
are deemed unlikely to replicate and are scorned as unreliable.
(2) is always smaller than via Eq. (4) – but often these differences
To put things another way: if experimental effects are
are rather modest.
truly generated under model M1, they will necessarily replicate
In its information for contributors, Psychological Science gives
according to our definition and it is then most puzzling why
the following examples6:
one goes to the trouble of computing the probability 1 ? prep
that they will not. Likewise, if repetitions of an experiment are
‘‘Thus, typical statistical reports would follow formats like
generated under the random effects model M
these:
2 then, according to
our definition, they (almost certainly) will not replicate, so why
t (50) = 2.68, prep = .95, d = 0.76; F (1, 30) = 4.69,
ought one compute the probability p
p
rep that they will?
rep = .90, ?2 = .135; or ? = .61, prep = .99, d = 1.56’’.
For the first two examples, the correct calculation of prep via Eq. (4)
Definition 2. Two real numbers x1 and x2 concur if they agree in
gives, in turn, prep = .97 and prep = .91. These values are
sign. That is, x1 and x2 concur if x1x2 ? 0.
sufficiently close to the ones quoted in the Journal, namely prep =
.
We believe that p
95 and prep = .90, to elicit little more than a shrug. All the same
rep is misnamed: prep = Pr(d and drep concur |
d) = Pr (ddrep ? 0 | d), and a more appropriate notation would
there is unnecessary confusion over how to compute prep from a
employ p
given p-value and it seems to us worthwhile to clarify the matter.
concur in place of prep. We shall nonetheless retain the
notation p
It might be argued that Eq. (2) is appropriate to the p-value from
rep throughout.
The distinction between replication and concurrence is shown
testing a one-sided hypothesis, and in part this is true. Since the
one-sided p-value is one-half of the two-sided p-value based on
pictorially in Fig. 2, in terms of three different combinations of d
the same data, Eqs. (2) and (4) should yield the same numerical
and drep. For true states of nature ? falling on the heavy diagonal
answer. To see how things can (and presently do) go awry, consider
line, effects d and drep replicate by definition. This means the
how the editors of Psychological Science obtained p
combination of parameters A shows that observed effects can
rep = .95 from
the fact that t (50) = 2.68. This value of Student’s t statistic gives
replicate but do not always concur. Conversely, combination B
p = .01 (two-sided) and p = .005 (one-sided). From Eq. (4) or (2)
shows that observed effects can concur but not replicate. Only for
we have (correctly)
the combination C do d and drep both replicate and concur.
FIT THE SECOND: In which p
??1 (.995)
2.58
rep is miscalculated
prep = ?
?
= ?
?
= ? [1.824] = .966.
‘‘Two added to one–if that could be done,
2
2
It said, ‘‘with one’s fingers and thumbs!’’,
What Psychological Science appears to have done instead is to
Recollecting with tears how, in earlier years
compute the two-sided p-value, p = .01, and to plug that value
It had taken no pains with its sums.
into the formula Eq. (2) appropriate to the one-sided p-value. That
The Hunting of the Snark: FIT THE FIFTH, The Beaver’s Lesson. Lewis
mistaken calculation gives
Carroll, 1876
??1 (.99)
2.33
Of the 60% or so of authors who currently report p
p
?
= ?
?
= ? [1.648] = .95.
rep values
rep = ?
2
2
in Psychological Science, a large majority use the recipe of Killeen
(2005c)4:
5 An explicit calculation is indicated below in Eq. (7).
4 Killeen (2005c) uses the symbol N to denote the cumulative distribution
6 This recommendation appears for the first time on the inside of the back cover
function of a standard normally distributed random variable. We use the Greek
of Psychological Science, 16(12), December 2005. It has remained there unchanged
letter ?.
ever since.

198
G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
It seems that both Killeen and Cumming were alert to the
of direction of effect is indicated by the fact that one cognitive
potential ambiguity in how to compute the value of prep from
model outperforms another on some body of data, as considered
a given p-value, but their recommendations were buried in an
by Ashby and O’Brien (2008). As a careful reading of Ashby and
Appendix (Killeen, 2005a) and a Table caption (Cumming, 2005).
O’Brien (2008) shows, their notion of replicability amounts to
In any event a little thought shows that the correct connection
conventional power or something very similar. Many authors (e.g.,
between prep and the p-value from a one-sided test is not Eq. (2)
Greenwald, Gonzalez, Guthrie, and Harris (1996), Oakes (1986) and
but rather
Tversky and Kahneman (1971)) earlier used power as a means
of quantifying ‘replicability’. But power, the complement of a
??1 (max{p, 1 ? p})
p
Neyman–Pearson long-term error rate, is antithetical to Killeen’s
rep = ?
?
.
(5)
2
views on statistical inference: ‘‘but once prep is determined,
calculation of traditional significance is a step backward’’, (Killeen,
Calculation of prep must yield a number in the interval 1 , 1 by
2
2005a, p. 349).
its very definition as a posterior predictive probability (and recall
While we are on the topic of power, it is noteworthy that
Eq. (3) for explicit confirmation); prep never takes values below 12
prep can be viewed as a predictive power calculation. One natural
and both Eqs. (4) and (5) respect this restriction. Allowing prep to
interpretation of predictive power is given in the following
take values in 0, 1 , as is permitted under Eq. (2), is to invite a
2
definition and calculation7:
jarring collision between what prep is intended to report and what
?
it does in fact report.
|
? (?,
n
d|
n/2 ? z?
d) = Pr drep
sgn d ? z? | d = ?
?
Suppose prior to an experiment you have convinced yourself
2
2
that the outcome will reflect a negative true effect parameter ?, and
you envisage a one-sided test of H
and it is seen at once that for ? = 1 , ? 1 , d = p
0 : ? ? 0 vs. H1 : ? < 0. Your
2
2
rep. In plain words,
observed effect d turns out to be positive, contrary to expectations,
when significance means concurrence (and this is achieved when
and the one-sided p-value is 0.88. Eq. (2) gives p
? = 1), p
rep = .20.
2
rep is predictive power. The trade-off between Type I and
Now this can only mean the following: you have observed an
Type II errors ensures that a large value of ? is accompanied by
experimental effect that disagrees with expectations. Despite the
a boost in power. No wonder then that prep so often returns large
evidence, you are fairly sure (1 ? p
values that can mislead the casual consumer (see the fourth and
rep
= .80) that a repetition
of the experiment will yield a negative effect, in conflict with the
fifth Fits for further detailed discussion).
data at hand but in agreement with your hypothesis. In other
As a concrete numerical example, suppose you have obtained an
words, the evidence at hand has been overridden by your prior
experimental effect d = 0.56 based on a sample size n = 25. One
expectations and your view of the matter is supported by a small
computes predictive power = .59 and this provides but modest
value of p
confidence that a replication will be significant at ? = .05 (in the
rep, and the smaller the better! Note that Eq. (5) gives
the answer that is intended of a sensible posterior predictive
same direction as the original). In contrast prep = .92. The message
probability of concurrence, namely p
conveyed by predictive power seems somewhat cautious in the
rep = .80. The observed effect
is positive and one has a legitimate Bayesian right to anticipate
first case (? = .05) but quite optimistic in the second (? = .5). The
that a replication is more likely than not to produce a positive
inflated confidence expressed by prep is revealed as a legerdemain
arising from the mere shift of a decimal point.
effect. We hasten to add, however, that this Bayesian prediction
Recently Psychological Science seems to have realized that the
is by no means guaranteed to mirror the aleatory behavior of
calculation of p
empirical replications. For a more detailed discussion of the
rep must be confined to its original scope, the
simple two independent groups design, and that it does not readily
critical distinction between substantive empirical replication and
extend beyond that limited scope (it does however extend to
posterior predictive replication, consult the fourth and fifth Fits.
linear contrasts in ANOVA and to some analogous problems in
One might have expected that contributors to Psychological
regression). It is becoming increasingly common for the same
Science, not to mention reviewers and action editors, would have
author to report p
spotted the difficulty of interpretation that is built into Eq. (2) when
rep in a two-group comparison, but to switch to
a p-value exceeds 1 , and to have corrected the matter by reporting
p-value for all other tests.8 This is terribly awkward, and anyway
2
prompts the question: why not report p-values for all tests, as was
the complement. Perhaps some did so, but certainly others did not;
done routinely before the p
even Sanabria and Killeen (2007) quote a value of p
rep era? The answer of course is that, for
rep below 1 .
2
a variety of good reasons, p-values themselves have been regarded
In Killeen (2005a, Figure 3) the trade-off between prep and the (one-
as unsatisfactory and misleading. Wagenmakers (2007) gives an
sided) p-value based on Eq. (1) is abruptly cut off at prep = 1 ,
2
extensive review of the many shortcomings of p-values that have
inviting the reader to interpret the tradeoff for p-values greater
been exposed and discussed at length in the literature.
than 1 .
2
We thus discover that prep is not only equally unsatisfactory as
FIT THE THIRD: In which prep is misapplied
the p-value when used as a test statistic, it is at the same time
considerably more restricted in its scope and interpretation as an
‘‘Thats a great deal to make one word mean,’’ Alice said in a
object of evidentiary import.
thoughtful tone. ‘‘When I make a word do a lot of work like that,
said Humpty Dumpty, I always pay it extra.’’
FIT THE FOURTH: In which prep invites mistrust
Through the Looking-Glass: Humpty Dumpty, Lewis Carroll,
‘‘I quite agree with you’’, said the Duchess; and the moral of that
1872.
is – ‘Be what you would seem to be’ – or, if you’d like to put it
The (incorrect) formula Eq. (2) for computing p
more simply—‘Never imagine yourself not to be otherwise than
rep invites the
unwary to carry out the indicated calculation whenever a p-
what it might appear to others that you were or might have
value is available, regardless of the context in which the p-value
been was not otherwise than what you had been would have
arose. But it is wise to recall from the first Fit that p
appeared to them to be otherwise’’.
rep is a
posterior probability of concurrence, and that last term requires
for its very meaning the notion of sign or direction of effect. What
is the (unambiguous) direction associated with an interaction
7 The signum function, abbreviated sgn, indicates the sign (+1 or ?1) of a real
in a 3 × 4 ANOVA, or for that matter the fact that the main
variable. It is convenient to adopt the convention that sgn (0) = 1.
effect of each variable is significant? More generally, what sense
8 In his final editorial (Cutting, 2007) mixes prep and p-value without comment.

G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
199
Alice’s Adventures in Wonderland: The Mock Turtle’s Story.
Lewis Carroll, 1865.
Killeen (2005a, p. 349) discusses prep as a statistical estimator,
saying
‘‘As is the case for all statistics, there is sampling variability
associated with prep, so that any particular value of prep may be
more or less representative of the values found by other studies
executed under similar conditions. It is an estimate’’. [emphasis
added].
The leading question is: What exactly is prep estimating? Address-
ing this question brings out the large variability of prep that all too
frequently produces large numerical values, giving a naive con-
sumer a misleading and exaggerated sense of optimism that a rep-
etition of an experiment will concur with a given one.
Suppose you know the value of the population effect parameter
?. You have in hand an observed effect d based on a per-group
Fig. 3. An example of the density of the posterior random variable ? (? sgn d | d),
calculated using Eq. (8). Also shown are p
sample size n. Suppose a repetition of your experiment yields an
rep , which is the mean of the distribution,
1 ? p/2, which is the median, and p/2, which is the area under the density from 0
independent observed effect drep. What is the probability that the
to 0.5.
two effects agree in sign (concur)? An elementary calculation gives
Pr drep concurs with d | d, ? = Pr ddrep ? 0 | d, ?
prep = .92 and 1 ? p/2 = .976. In contrast, the 95% HPD credible
interval is the rather modest prediction .63 ? Pr (concur | d, ?) ?
n
1. This broad credible interval for Pr (concur | d, ?) comes about
= ? ? sgn d 2
because, regarded as a function of the random variable
?
? | d,
the probability density of Pr (concur | d, ?) = ?(? n/2 sgn d)
= ? (? sgn d) .
(6)
is strongly skewed towards 1 , as shown in Fig. 3. In other words,
?
2
there is considerable posterior uncertainty about the probability
Here and below it is often convenient to write ? = ? n/2; ?
is a ‘non-centrality’ parameter, which determines power, familiar
that a future effect will concur with an original. A very similar
from classical inference. Note that if d and ? disagree in sign, you
and equally undesirable skew attends the predictive density of p-
would base your prediction on the sign of ?, not on the sign of
values (Cumming, in press), and essentially for the same reasons.
?
d, and your predictive probability (Eq. (6)) would be less than 1 .
An example of the density of ?(? n/2 sgn d | d) is shown in
2
This stands in contrast to the prediction afforded by p
Fig. 3. The analytic form is as follows: for 0 ? t ? 1
rep that relies
on the sign of d, and which takes on values that are necessarily
f (t) = exp ??1 (t) |z| exp ?z2/2 ,
(8)
larger than 1 . We often abbreviate Pr (drep concurs with d | d, ?) as
2
?
Pr (concur | d, ?).
in which z = d n/2 is the z-score corresponding to the observed
Of course one does not know ?, and it seems natural therefore to
effect d. This density first appeared as a histogram based on a small-
estimate Pr (concur | d, ?). prep is the estimator proposed by Killeen
scale simulation in Cumming (2005). The most striking feature of
(2005a) to do the job. We note that Pr (concur | d, ?) can be viewed
the density is the large negative skew that is responsible for broad
– though very differently – from both a Bayesian and a frequentist
credible intervals
standpoint and we discuss each interpretation in turn.
Another figure helps to explain why prep is often quite large,
For Bayesians, it is natural to consider Pr (concur | d, ?) as a
(e.g. prep
?
.85), even though the true state of nature ? is
function of posterior belief f (? | d). Indeed we have, from Killeen
quite small and is thus likely to generate many more effects that
(2005a,b,c), Sanabria and Killeen (2007); and especially Cumming
conflict with an original than are predicted by 1 ? p
(2005), Doros and Geier (2005), and Macdonald (2005),
rep. In Fig. 4
an observed value of d is imagined to arise from a value of ?
that with probability 1 is larger than d, and with probability 1
p
2
2
rep = E [Pr (concur | d, ?)] =
Pr (concur | d, ?) f (? | d) d?, (7)
is smaller. Three replications that might arise under a value of ?
that exemplifies each possibility are shown as open circles. prep is
in which the expectation is taken over the posterior distribution9 of
computed as a weighted average over all such imagined scenarios,
?. On the other hand it is unlikely that Bayesians would routinely
the weights being provided by the posterior distribution f (? | d).
summarize their posterior belief concerning Pr (concur | d, ?) by
Fig. 4 makes it clear that averaging over posterior uncertainty
computing a single number such as prep or alternatively 1 ? p/2
in ? will often produce large values for prep, mainly because
(which, as it happens, is the median value of Pr (concur | d, ?)),
Pr (concur | d, ?) ? 1 when ? > d, even though the true state of
when the entire posterior distribution of belief is available. If
nature might be more like the one shown in the lower branch for
a summary measure is desired it is more informative to give a
which replicates can frequently be negative, in conflict with the
credible interval of values. In particular, the inequalities
original.
For a frequentist ? is unknown but fixed, and as a statistic (i.e., as
n
? |d|
? z? ? Pr (concur | d, ?) ? 1,
a function on the sample space) Pr (concur | d, ?) = ? (? sgn d) is
2
the following dichotomous random variable:
give the endpoints for the (1 ? ?) 100% highest probability density
? (?) with probability ? (?)
(HPD) credible interval. For example, d = .56 and n = 25 yields
Pr (concur | d, ?) = ? (??) with probability ? (??) .
(9)
The value of ? (|?|) of Pr (concur | d, ?) arises whenever d and
9
? concur; the value ? (? |?|) = 1 ? ? (|?|) arises if d and ?
If one adopts a flat prior on ? (i.e., f (?) ? 1), it is well known that the posterior
density of ? | d turns out to be normal with mean d and variance 2/n. The integral in
conflict in sign. Note that of the two values ? (?) and ? (??) one
Eq. (7) is then straightforward and gives the standard expression in Eq. (3) for prep.
is necessarily ? 1 whereas the other is ? 1 .
2
2

200
G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
Fig. 4. Given an observed effect d, the possibility that ? > d is exemplified in the
upper scenario on the right, which shows three independent replicate effects as
open circles. Equally likely is the possibility that ? < d and a typical scenario is
Fig. 6. The expected values of prep and Pr (concur | d, ?) as functions of the non-
depicted in the lower branch. p
centrality parameter ?, with equivalent representative values for effect size ? and
rep is computed as a weighted average over all such
scenarios, the weights being provided by the posterior distribution f (? | d).
sample size n shown. The systematic bias in prep accounts for the difference between
the two curves. The large variability in prep is evident in the large error bars that
represent the 95% equal area intervals.
underestimates Pr (concur | d, ?) and the silly estimator 1 will
often do a better job.
One can understand visually the poor performance of prep by
plotting its expected value and the long-run expected value of
Pr (concur | d, ?), which equals ?2 (|?|) + ?2 (?|?|), on the
same axes, as functions of ?. These plots are given in Fig. 6. It
is evident that the expected value E prep is much larger than
E [Pr (concur | d, ?)] for small values of ?, but is dominated by
it for larger values of ?. The error bars represent the 95% equal
area intervals of the sampling distribution of prep. The bias and
imprecision of prep as an estimator is evident for all but large effects
or sample sizes (values of ? in excess of 3.5), for which it close to
1.
FIT THE FIFTH: The Psychological Science action editor’s dilemma
He was thoughtful and grave—but the orders he gave
Were enough to bewilder a crew.
When he cried ‘‘Steer to starboard, but keep her head
Fig. 5. The sampling density of prep for |?| = 0, 1, and 1.5.
larboard!’’
?
What on earth was the helmsman to do?
prep = ?(|d| n/4) is concentrated on 1 , 1 , and as an
2
estimator for Pr (concur | d, ?), which takes values in [0, 1], its
The Hunting of the Snark: FIT THE SECOND, The Bellman’s
inability to take values in 0, 1 presents it with a very difficult
Speech. Lewis Carroll, 1876.
2
challenge. This restriction on range is especially worrisome for
A fundamental distributional difference underlies the construc-
small-to-moderate values of |?|. For example if |?| ? 0 the
tion of prep and Pr (concur | d, ?). Under model M1 substantive
values ? (?) and ? (??) are each approximately 1 whereas
effects are independent draws from a normally distributed pop-
2
we know from its null distribution that the median value of prep
ulation, with mean ? and variance 2/n. These effects are what ac-
?
is ?
1/2??1(3/4)
?
.68 and its expected value is 1 +
tion editors examine when replications of an original experiment
2
?
1
come across their desks; these draws, provided by science, deter-
? arcsin( 1/3) ? .70.
In Fig. 5 we plot the distribution of p
mine Pr (concur | d, ?). On the other hand, prep is the probability
rep for several values
of ?. These functions are members of the following family of
of an event involving values of drep | d and those values are draws
expressions indexed by |?|:
from a normally distributed predictive distribution, with mean d
?
?
and variance 4/n. Values of drep | d are not substantive replicates,
f
and they are not independent. To confuse them with independent,
p
(t) = 2 2 exp ??2/2 cosh
2??1 (t) ?
rep
substantive replicates is a mistake.
× exp ? ??1 (t) 2 /2 ,
1 ? t ? 1.
With this remark in mind we now examine how prep can
2
misinform the important business of scientific induction that is
The null distribution corresponds to ? = 0.
carried out daily by authors, reviewers and action editors.
For small values of |?|, p
Suppose you are an action editor for Psychological Science and a
rep overestimates its target, often by a
large amount. On the other hand, for a sufficiently large value of |?|
paper for review comes across your desk that reports a surprising
one has ? (|?|) ? 1, whence prep ? 1 as well. This last rather trite
and somewhat controversial finding. The evidence for the effect in
fact does not, however, make prep a particularly good estimator
question is summarized in the following data: d = .56, n = 25,
for Pr (concur | d, ?) even though both probabilities approximate
z = 1.98, p (two-sided) = .05, prep = .92. If this finding is
1. As we will see in Fig. 6, for large |?|, prep systematically
true it will wrinkle the theoretical cloth of an important branch

G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
201
of experimental psychology, and likely promote new directions for
Postlude to the agony: Caveat emptor
empirical and theoretical research.
He had bought a large map representing the sea,
The referees are enthusiastic and you accept the article
Without the least vestige of land:
for publication. A few months after publication, independent
And the crew were much pleased when they found it to be
experimental replications make an appearance. In fact the first
A map they could all understand.
three lie unopened on your desk. Before opening any one, you ask
The Hunting of the Snark: FIT THE SECOND, The Bellman’s
yourself what you expect on the basis of p
Speech. Lewis Carroll, 1876
rep. All three replications
might exhibit positive effects, or none might. You quickly tabulate
If you must use prep do so with caution, a deliberate purpose
the various possibilities and the accompanying probabilities as
in mind, and with full awareness of its shortcomings as an
estimator of Pr (concur | d, ?). As we have seen, p
determined by p
rep does not
rep and the Binomial distribution.10 You find the
quantify ‘replicability’ of experimental effects, it does not appear
probabilities for 0, 1, 2 and 3 concurrences to be .00, .02, .20 and .78,
to generalize beyond linear contrasts, and as an estimator of
respectively.
concurrence it is unreliable and is in fact not even consistent.13
Armed with your calculations, you are confident that at least
To buy into prep as it is currently promoted by Psychological
two of the three replications will concur with the original, and
Science is to buy into the significance fallacy, the belief that
you would not be surprised if all three did so. Opening the new
significant effects are highly reliable and replicable (Oakes, 1986;
submissions you are dismayed to discover that two of the three
Tversky & Kahneman, 1971).14 Not only does prep encourage that
articles report negative effects, in conflict with the original. The
erroneous belief, it sanctions it with the authority and precision of
relevant data are: drep = 0.40, n = 25, z = 1.41, p (two-sided) =
a quantitative calculation.
1
.
In particular do not use p
16, p
rep as Psychological Science currently
rep = .84; drep = ?0.08, n = 25, z = ?0.28, p (two-sided) =
2
.
does, merely as a convenient way to lower the bar on conventional
78, prep = .58; and drep = ?0.03, n = 25, z = ?0.11,
3
criteria for significance, allowing Type I errors to triple in frequency
p (two-sided) = .91, prep = .53.
over conventional 5% rates, not to mention sanctioning a fourteen-
The authors who obtained the positive effect drep = 0.40 claim
fold increase over the more conservative (but often preferable) 1%
a replication of the original, despite the rather large p-value, and
standard. Despite the encouraging words that have been bandied
in their discussion call for an aggressive experimental foray along
about in praise of prep, the fact remains that prep = .85 corresponds
lines suggested by the original finding. In marked distinction, the
to a p-value of .14 and prep = .90 corresponds to a p-value of .07.
authors who found small negative effects are quite critical of the
If, based on p-value of .14, you would not reject the possibility that
original finding and state quite clearly that their efforts to replicate
|?| is quite small, perhaps negligible, why would you offer much
had failed utterly, and that little purpose would be served by
better than even odds that a substantive replication would agree
in sign with an original?
pursuing this particular line of research.
You protest: surely 85% and 90% are much closer to 100%
As action editor how are you to react to these unexpected and
than they are to 50%. Our response is that this simple fact of
conflicting findings? Did something quite unusual occur, or is there
arithmetic is as misleading as it is true (for reasons detailed in
a subtle causal artifact at work that would explain the two negative
our fourth and fifth Fits). The probability scale provided by prep
outcomes, one that you suspect will be hard if not impossible to
(or any other probability value for that matter) is the wrong
uncover.
metric on which to evaluate evidence about ?. If you had asked
Neither of these reactions is warranted. In fact the data
different questions of your data, for instance what does your
are quite consistent with one another and with the model M
value of p
1
rep tell you about ?
= 0 versus ? = 0, we would
that underlies the standard calculation of p
encourage you to compute a ratio of probabilities. The resulting
rep; and there are
Bayes Factor, a ratio of probability densities, is a sensible and
multiple considerations that support this view of the data. Based
readily interpretable means of evaluating (relative) evidence (e.g.,
on the original data, the 95% posterior predictive interval for
Bernardo and Smith (1994), Kass and Raftery (1995) and Lee
future replications is [?0.22, 1.34] and this interval11readily
and Wagenmakers (2005).15 It should be the routine business of
accommodates all of the observed replicate effects, though it is not
authors contributing to Psychological Science or any other Journal of
obliged to do so. A ?2 test of the assumption that all four effects are
scientific psychology to report Bayes Factors. Presently only a small
replicates (i.e., that all four are generated by a common value of ?
handful do so, in stark contrast to current practice in the statistical
yields ?2 (3) = 3.75, and this ?2 value is not close to signaling any
community.
significant differences among the various experimental outcomes.
Further, all data are compatible with a true value of ? about 0.20
(note that the arithmetic average of all four experimental effects is
13 The term ‘consistent’ is standard in statistics (Casella & Berger, 2002). A
0.21).12Assuming for illustration that ? = 0.20, the probability that
sequence of estimators ˆ
?n is consistent for a parameter ? if, for every > 0 and
any single replicate effect will be negative is .23, and consequently
every ?, limn?? Pr
ˆ
?n ? ? ?
= 0. This property, which is usually the very
the observed pattern of replications (2 negative, 1 positive) is
least one requires of an estimator, obviously does not hold for prep. When ? = 0 we
expected to occur about 12% of the time; this last probability is 6
have Pr (concur | d, ?) = 1 ; and since, with probability 1, p
2
rep does not take on the
times larger than the corresponding binomial prediction based on
value 1 it can hardly be said to be consistent for its target.
2
14
p
The term ‘significance fallacy’ is our terminology. Oakes called the common
rep. We remind the reader of the lower branch of Fig. 4.
but unjustified belief in the replicability of significant effects the ‘significance
hypothesis’, whereas Tversky and Kahneman discussed the matter in terms of a
folk-theorem, a ‘law of small numbers’, in a particular example of their more general
study of representativeness.
10 Actually, this is not the way Bayesian posterior predictive probabilities are
15 Under model M1 and a flat prior on ?, a Bayes Factor for selecting between the
?
computed. However the correct calculations do not change the conclusions of this
hypotheses H0 : ? = 0 and H1 : ? = 0 is given by B01 =
n exp ?z2/2 . When
analysis.
?
d = 0.56 and n = 25, one has z = 0.56 × 5/ 2 = 1.98 and B
11
01 = 0.7. Yes, the
Both frequentists and Bayesians (assuming a flat prior on ?) agree on the form
data favor H1 over H0, but by a factor that is scarcely worth the mention. If a priori
of this predictive interval, though they differ considerably on its interpretation. The
?
?
you believed Pr (H0) = .5, a posteriori you believe Pr (H0 | d) = .41; if a priori you
general form of the interval is d ? z?/2 4/n ? drep ? d + z?/2 4/n.
believed Pr (H0) = .1 the data have modified your belief so that Pr (H0 | d) = .065.
12 The symmetric 95% HPD credible interval for ? based on d is [.01, 1.11]. Based
In either case, the data d = 0.56 and n = 25 are pretty much inconsequential. What
on all four experimental effects it is [?.067, 0.487].
price then the frequentist asterisk and the declaration ‘‘significant at level .05’’?

202
G.J. Iverson et al. / Journal of Mathematical Psychology 53 (2009) 195–202
Acknowledgments
Doros, G., & Geier, A. B. (2005). Probability of replication revisited: Comment on
an alternative to null–hypothesis significance tests. Psychological Science, 16,
1005–1006.
We thank many colleagues for their comments at one stage or
Greenwald, A. G., Gonzalez, R., Guthrie, D. G., & Harris, R. J. (1996). Effect sizes
another in the preparation of this work: Bill Batcheler, Barbara
and p values: What should be reported and what should be replicated?
Dosher, Jean-Claude Falmagne, Yung-Fong Hsu, R. Duncan Luce,
Psychophysiology, 33, 175–183.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Larry Maloney, Roger Ratcliff, Jeffrey Rouder, Ching-Fan Sheu,
Association, 90, 377–395.
George Sperling. We are also most grateful to Sheng Kung (Mike)
Killeen, P. R. (2005a). An alternative to null-hypothesis significance tests.
Yi and Si Yi Deng for early help with relevant calculations.
Psychological Science, 16, 345–353.
Killeen, P. R. (2005b). Replicability, confidence, and priors. Psychological Science, 16,
1009–1012.
References
Killeen, P. R. (2005c). Tea tests. The General Psychologist, 40(2), 12–15.
Lee, M. D., & Wagenmakers, E.-J. (2005). Bayesian statistical inference in psychology:
Comment on Trafimow (2003). Psychological Review, 112, 662–668.
Ashby, F. G., & O’Brien, J. B. (2008). The prep statistic as a measure of confidence in
Macdonald, R. R. (2005). Why replication probabilities depend on prior probability
model fitting. Psychonomic Bulletin & Review, 15(1), 16–27.
distributions: A rejoinder to Killeen (2005). Psychological Science, 16(12),
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley.
1007–1008.
Casella, G., & Berger, R. (2002). Statistical inference (Second ed). Duxbury: Pacific
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral
Grove.
sciences. Chichester: Wiley.
Cumming, G. (2005). Understanding the average probability of replication:
Sanabria, F., & Killeen, P. R. (2007). Better statistics for better decisions: Rejecting
Comment on Killeen (2005). Psychological Science, 16(12), 1002–1004.
null hypotheses statistical tests in favor of replication statistics. Psychology in
Cumming, G. (in press). Replication and p intervals: p values predict the future only
the Schools, 44(5), 471–481.
vaguely, but confidence intervals do much better. Perspectives on Psychological
Tversky, A., & Kahneman, D. (1971). The belief in the ‘law of small numbers’.
Science.
Psychological Bulletin, 76, 105–110.
Cutting, J. E. (2005). Acknowledgment. Psychological Science, 16(12), 1013.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p
Cutting, J. E. (2007). Rhythms of research. Observer, 20(11), 11–14.
values. Psychonomic Bulletin & Review, 14, 779–804.

Document Outline
  • ??
    • ??
    • ??

Download
prep : An agony in five Fits

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share prep : An agony in five Fits to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share prep : An agony in five Fits as:

From:

To:

Share prep : An agony in five Fits.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share prep : An agony in five Fits as:

Copy html code above and paste to your web page.

loading