Judgment and Decision Making, Vol. 4, No. 2, March 2009, pp. 164–174
Criteria for performance evaluation
David J. Weiss?1, Kristin Brennan1, Rick Thomas2, Alex Kirlik3, and Sarah M. Miller3
1 California State University, Los Angeles
2 University of Oklahoma
3 University of Illinois
Abstract
Using a cognitive task (mental calculation) and a perceptual-motor task (stylized golf putting), we examined differ-
ential pro?ciency using the CWS index and several other quantitative measures of performance. The CWS index (Weiss
& Shanteau, 2003) is a coherence criterion that looks only at internal properties of the data without incorporating an
external standard. In Experiment 1, college students (n = 20) carried out 2- and 3-digit addition and multiplication prob-
lems under time pressure. In Experiment 2, experienced golfers (n = 12), also college students, putted toward a target
from nine different locations. Within each experiment, we analyzed the same responses using different methods. For
the arithmetic tasks, accuracy information (mean absolute deviation from the correct answer, MAD) using a coherence
criterion was available; for golf, accuracy information using a correspondence criterion (mean deviation from the target,
also MAD) was available. We ranked the performances of the participants according to each measure, then compared
the orders using Spearman’s rs. For mental calculation, the CWS order correlated moderately (rs =.46) with that of
MAD. However, a different coherence criterion, degree of model ?t, did not correlate with either CWS or accuracy.
For putting, the ranking generated by CWS correlated .68 with that generated by MAD. Consensual answers were also
available for both experiments, and the rankings they generated correlated highly with those of MAD. The coherence
vs. correspondence distinction did not map well onto criteria for performance evaluation.
Keywords: arithmetic, CWS index, judgment, measurement.
1 Introduction
dating of opinions should be governed by Bayes’s theo-
rem. Hammond (1996) refers to this type of standard as
To evaluate the work of a plumber, therapist, or surgeon,
a coherence criterion. These two types of criteria for op-
it is necessary to assess on-the-job performance. While
timality compare performance to a gold standard, a com-
all professionals have their creative moments, in most
pelling benchmark against which to measure the behav-
?elds it is the ability to perform a practiced task consis-
ior. Indeed, some researchers argue that performance can
tently well that is the hallmark of the expert. Performance
be measured meaningfully only when a gold standard has
assessment is also the key to determining whether a train-
been agreed upon (Ericsson, 1996). Just as Hammond
ing program or technical innovation is worthwhile. Ide-
(1996) hoped that the correspondence-coherence distinc-
ally, assessment can be objective rather than a matter of
tion would help to clarify debates about the proper way
opinion. Quantitative assessment of performance attends
to evaluate a scienti?c theory, in this paper we invoke that
to measurable aspects of the work, typically the “bottom
distinction in the hope of clarifying debates about how to
line” of the outcome of the labor. How many leaks were
assess performance.
stopped? How many patients were cured?
For many professional domains, gold standards sim-
Such outcome measures capture what Hammond
ply are not available. What is the outcome that re?ects
(1996) refers to as correspondence competence, in that
the quality of a ?lm review, the grade assigned by an in-
they focus directly on consequences. Outcomes can also
structor, or the sentence imposed by a magistrate? Weiss
be compared to theory-based standards; for example, up-
and Shanteau (2003) responded to the challenge that gold
?Preparation of this manuscript was supported by the U. S. Air Force
standards are elusive by constructing an empirical index,
Of?ce of Scienti?c Research (grant FA9550–04–1-0230 to California
referred to as CWS,1 that does not incorporate ground
State University, Los Angeles) and by NSF (grant DRMS-045216 to the
truth. They suggested that pro?ciency has evaluative skill
University of Illinois). We wish to thank Arash M. Tae? for writing the
program used to present the intuitive arithmetic tasks. Correspondence
1The CWS acronym derives from the initials of its creators, David
regarding this article, including requests for reprints, should be sent to
J. Weiss and James Shanteau, along with that of the statistician William
David J. Weiss, 609 Colonial Circle, Fullerton CA. 92835 United States.
Cochran, who independently had previously proposed using an F-ratio
Email: dweiss@calstatela.edu.
to compare response instruments.
164
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
165
at its core. Whatever the task, one must attend to relevant
truth (Surowiecki, 2004). Consensual answers have of-
aspects of the situation and decide what to do. View-
ten been proposed as surrogates for correct answers (Ash-
ing evaluation as akin to what a measuring instrument
ton, 1985; Einhorn, 1974), although the logic of doing so
does, Weiss and Shanteau (2003) identi?ed two neces-
has been criticized (Weiss & Shanteau, 2004). The gist
sary properties of expert judgment: discrimination, re-
of the criticism is simply that people may agree on poor
sponding differently to different stimuli, and consistency,
answers. One may view consensus as a coherence crite-
responding similarly to similar stimuli. The CWS index,
rion, postulating that there exists across people a common
presented as Equation 1, combines these two properties in
latent structure underlying their opinions (Batchelder &
a ratio format. The ratio is large when the judge discrimi-
Romney, 1988; Uebersax & Grove, 1990).
nates effectively, and is reduced when the judge is incon-
In the current project, we employed tasks for which
sistent. Weiss and Shanteau stressed that the two proper-
there were indisputably optimal responses, namely men-
ties are not conceptually independent. It is easy enough
tal calculation and golf putting. Accuracy in arithmetic
to adopt a strategy that trades off one property at the ex-
calculation is customarily assessed using a coherence cri-
pense of the other, but achieving both at the same time
terion; correct answers are dictated by the abstract, logi-
requires accurate evaluation of the stimuli, the essence of
cal rules of mathematics. The accuracy of a putt is usu-
expert judgment.
ally assessed using a correspondence criterion, how close
the ball gets to its target. A goal of the present research
Discrimination
CW S =
(1)
was to shed light on CWS’s ability to capture the sub-
Inconsistency
jective domains by examining objective domains. We
When they originally proposed the CWS index, Weiss
assessed performance for both tasks using the clear-cut
and Shanteau were intentionally non-committal about the
gold standards, then assessed that same performance us-
measures of discrimination and inconsistency. The trade-
ing CWS, which does not make use of such informa-
off implied by the ratio de?nition is the heart of the con-
tion, and consensus, which provides a “silver standard”
cept, and any measures that re?ect the two properties will
(Phillips, 1988) when the group knows what it is doing.
do. In applications that generate numerical data, includ-
ing the present ones, discrimination and inconsistency
1.1 The logic underlying CWS
have been operationalized using terms familiar from anal-
ysis of variance. An experimental design suitable for
A CWS assessment entails analyzing responses to a range
CWS analysis may be as simple as the presentation of
of stimuli, situations, or scenarios that would normally be
each of several stimuli more than once. Discrimina-
handled within the profession. Tasks may be divided into
tion means that different stimuli are responded to dif-
four categories (Weiss & Shanteau, 2003). Judgment is
ferently. Accordingly, discrimination is captured by the
exempli?ed by auditing a ?nancial statement or diagnos-
mean square between stimuli. Inconsistency implies that
ing a patient’s condition. Prediction includes forecasting
a given stimulus presented multiple times inspires differ-
the weather or advising the parole board. Teaching en-
ent responses on the various occasions. Inconsistency is
compasses training people or setting criteria for testing.
captured by the mean square between replications.
Typical physical performance tasks are playing an instru-
The CWS approach resembles a coherence criterion,
ment or shooting a ball. In all cases, evaluating the stim-
in that it examines purely internal properties of behavior.
uli underlies proper execution of the tasks. In the latter
However, it differs from other coherence criteria in that
three categories, additional abilities overlay the requisite
while pro?cient performance inexorably generates high
judgmental skill. The predictor must anticipate changes
values of CWS, there is no theory specifying the optimal
that will occur in the future. The instructor must com-
behavior. Our view is that performance ought to be tied
municate and motivate. The performer requires physical
to the external world, and that experts should follow the
abilities needed to execute the planned maneuvers. The
prescriptive model for their task. However, it is not al-
CWS index can be used to assess behavior in all of these
ways possible for an evaluator to know the best answers,
categories, but underlying judgmental skill may be ob-
and the applicable model is often unknown as well. The
scured by the additional components.
absence of optimal answers does not diminish the practi-
Still, because judgment is paramount, reasonably ac-
cal importance of having the capability to evaluate mem-
curate assessments of demonstrated skill in all of the cat-
bers of the large class of professionals who provide opin-
egories can be achieved with CWS. The key properties,
ions about the status and achievements of people (Weiss,
discrimination and consistency, are inherent in the be-
Shanteau, & Harries, 2006).
havior itself, so that measuring the ratio does not require
A more popular approach to evaluating these subjec-
knowledge about how things turned out. Of course, there
tive domains is to compare someone’s responses to those
is more to skilled performance than these two properties.
of other people. Opinions often converge toward the
CWS is necessary but not suf?cient; in other words, one
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
166
who does the task well will generate high CWS, but high
fourth and ?fth presumptions. The experiments are quite
CWS does not guarantee that the task was done correctly
different in nature, but they have in common that there
(Weiss & Shanteau, 2003). The question of how much of
are known correct answers. In the ?rst experiment, col-
the demonstrated skill is captured by the index is essen-
lege students are asked to carry out intuitive addition and
tially an issue of validity.
multiplication under time pressure. The tasks in Experi-
In order to assess validity, one must have some approx-
ment 1 are purely judgmental. In Experiment 2, golfers
imation of the truth. We suggest ?ve presumptions an
putt toward a series of targets. This task involves physical
analyst might make toward that end. Each presumption
performance as well as an implicit judgment. Our pur-
assumes domain knowledge on the part of the analyst,
pose in selecting both a cognitive and a perceptual-motor
external knowledge that is provided by experts within the
task was to shed light on the breadth of applicability of
?eld. This circular reasoning, presuming that the ana-
CWS as a performance index.
lyst can identify the true domain experts, seems unavoid-
able in the early stages of research. The ?rst three of the
presumptions have been supported in previous research
Presumption 4 is that CWS should be associated with
using the CWS framework. The last two have not been
the extent to which performance follows a correct process
tested before.
model. For the mental calculation tasks, participants who
show higher CWS should be more likely to follow the
additive and multiplicative models as assessed by func-
Presumption 1 is that CWS can distinguish experts
tional measurement. Functional measurement (Ander-
from novices; experts should generate higher CWS
son, 1979) invokes a coherence criterion, in that there is
scores. Weiss & Shanteau (2003) illustrated this capabil-
a normative model for the task; the analysis involves ex-
ity with data from several domains, including medicine,
amining the algebraic structure underlying the responses.
auditing, and personnel selection. Identifying novices is
Because the number of everyday tasks for which a pre-
easy, but we have to assume that we know who the ex-
scriptive model is available is limited, this presumption
perts are in order to validate. Regarding experience as
can be examined only in special cases. Presumption 4
the equivalent of expertise is risky (Weiss, Shanteau, &
cannot be tested with the putting task.
Harries, 2006).
Presumption 5 is more widely applicable. Presump-
Presumption 2 is that CWS decreases systematically
tion 5 is that, in tasks for which correct answers are avail-
with increasing task dif?culty. Here, the assumption
able, CWS should be higher for those whose answers
is that the analyst can identify the more dif?cult tasks.
are closer to correct. Our analytic strategy for testing
Shanteau, Friel, Thomas, and Raacke (2005) varied the
Presumption 5 is to ?rst rank the performances exhib-
number of planes in simulated air traf?c control, reason-
ited by the individuals within an experiment according
ing that having to deal with more aircraft should make the
to the gold standard of correct answers. Next, we rank
task harder. CWS decreased with the number of planes.
the same performances according to CWS, which knows
nothing of correct answers. High correlation between the
Presumption 3 is that CWS increases with training.
two rankings is supporting evidence for this presumption.
Shanteau et al. (2005) also found that CWS increased
This comparison is the key empirical contribution of the
over training periods, showing improvement long after
present paper. If CWS can be shown to capture differ-
less sensitive (outcome) measures such as the number of
ences in performance when correct answers are known,
accidents or number of intrusion errors stopped showing
that increases con?dence in the ability of this relatively
performance gains. The assumption in this case is that the
new index to provide valid assessments when correct an-
analyst knows performance to be in the sub-asymptotic
swers are unavailable.
range where increases are possible. In a study of the as-
In order for this research strategy to be effective, it is
sessment of upper limb disorders, Williams, Haslam, and
necessary that there be differential ability among the par-
Weiss (2008) found that professional ergonomists, who
ticipants. At the same time, they must all be able to do
had specialized training, exhibited higher CWS when
the task with some degree of competency, or the results
judging patients’ risk status than members of other pro-
mean little. We were willing to presume that all col-
fessions who also make such judgments regularly. The
lege students can do mental calculations. For golf, we
ergonomists also were superior, according to CWS, to
required credentials in the form of some experience, as
students in ergonomics courses. Similar results for oc-
novices have essentially no expertise. To some extent it
cupational therapy have been reported by Rassa?ani, Zi-
is a matter of fortune whether the recruits in a study vary
viani, Rodger, and Dalgleish (2008).
suf?ciently, but we tried to assist chance by informally
The two new experiments reported here examine our
seeking a range of self-estimated talent for arithmetic.
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
167
Employing a gold standard of correctness requires the
ple exhibit consistent biases, thus implying incorrect an-
analyst to choose a rule for penalizing errors. When the
swers, while following the appropriate model. This illus-
response is measured on a numerical scale, it is traditional
trates the key principle that a focus on accuracy may ob-
to use the mean squared deviation (MSD) from the cor-
scure important information. Multiplication is inherently
rect answers as an index. Gigone and Hastie (1997) pro-
more dif?cult than adding, and one would expect less
vide an extensive comparison of accuracy measures, fa-
accurate answers. Intuitive multiplication has not been
voring MSD because it contains the most information and
studied much beyond one or two-digit problems (Seitz &
penalizes large errors, which they see as an advantage.
Schumann-Hengsteler, 2000).
Our view is that while MSD ?ts nicely with statistical
theory, it does not re?ect how wrong the answer is from
2.1 Method
a behavioral perspective (Weiss, Edwards, & Shanteau,
2009).2 Although we will report MSD, we deem the
Participants and Procedure. Twenty participants were re-
mean absolute deviation (MAD) from the correct answers
cruited via ?iers posted across the California State Col-
to be the gold standard for performance.
lege, Los Angeles campus, with the quali?cation being
The unique feature of this study is that we use the same
that applicants were enrolled students (any major) and
data to compare various performance criteria. In evaluat-
at least 18 years old. Students who claimed to be poor
ing Presumption 5, we use the golf data to provide a di-
at math were encouraged to participate; those recruited
rect comparison of a correspondence criterion (accuracy)
spanned a wide range of (self-assessed) ability. Par-
and a coherence criterion (CWS). Similarly, we invoke
ticipants received base compensation at minimum wage
Presumption 4 with the mental calculation data to sug-
level, as well as a bonus for accurate answers.3 They also
gest a comparison of two kinds of coherence criteria, one
received a bonus for completing all sessions, which took
(functional measurement) that incorporates a standard of
between 2–3 hr.
optimality and one (CWS) that does not. We also eval-
After receiving instructions regarding use of the com-
uate the coherence criterion of consensus with both data
puter program, participants performed the arithmetic
sets.
tasks on a computer in a small individual laboratory, with
the experimenter visible in the hallway. The concept of
intuitive math was stressed; the use of paper or calcu-
2 Experiment 1: Mental calculation lator was prohibited. Participants were told to guess if
they did not know the answer, as there was no penalty
The task for participants is to solve math problems in
for wrong answers. They were informed that an answer
their heads (Busemeyer, 1991; Peterson & Beach, 1967);
within 5% of the correct value would be scored as cor-
speci?cally, to perform mental calculation of either the
rect for bonus purposes, but the answer had to be entered
sum or the product of a pair of numbers. Preliminary
within 30 sec. There was an additional brief training com-
work using these tasks suggested that incorporating some
ponent for each type of equation, with outcome feedback
time pressure was necessary in order to induce holistic
and illustrative bonus points. No feedback was provided
judgments. Explicit use of arithmetic rules was deemed
during data collection.
undesirable, because we wanted the laboratory task to
The program presented two types of problems, one
simulate real-world judgments, few of which have for-
calling for adding and the other for multiplying. Most of
mulaic solutions. The participant’s incentive on each trial
the problems were expected to be dif?cult for most stu-
was based on the difference between the response and the
dents. In an attempt to inhibit explicit calculation, each
correct answer; no credit was given for responses occur-
problem was presented brie?y (3 sec for addition, 5 sec
ring after a time limit speci?ed for each problem type.
for multiplication) before the screen went blank.
Intuitive addition of numerical stimuli has been exten-
The inter-question interval was 15 sec, but the partic-
sively studied, including research that employed a func-
ipant could bypass the break by clicking the Enter key.
tional measurement perspective (Anderson, 1968; Levin,
After a block of trials, which lasted 10–20 min, the pro-
1975). A result of particular interest is that some peo-
gram informed the participants of how many bonus points
had been earned during that block. Three to four blocks
2In everyday life, errors are often penalized on an absolute basis,
of trials were scheduled during each 1 hr session. In be-
and occasionally on the basis of extent. For example, a basketball shot
tween blocks, participants were allowed to take brief rest
either goes in or misses. In golf, the distance the ball lands from the
periods.
hole contributes to the dif?culty of the next shot. We are hard pressed
to think of natural situations in which errors are punished in proportion
Design. The anticipated dif?culty of the problems was
to the square of their magnitude. Using simulation results, Dielman
manipulated by varying the number of digits in the num-
(1986) concluded that the use of absolute value in regression analysis
provides better forecasts than does the use of least squares, especially
3An answer within 5% of the correct value earned a bonus of $.05.
when the data contain outliers.
Typical performance earned a bonus of $4–5/hr.
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
168
inconsistency, calculated separately for each task and dif-
Table 1: Numbers used for addition and multiplication
?culty level as the mean square between stimuli divided
problems within each dif?culty level.
by the mean square within replications. The computa-
2 digit, 2 digit
2 digit, 3 digit
3 digit, 3 digit
tion for Equation 1 is that for a single-S design (Weiss,
2006), which is identical to the calculation of an F-ratio
Left
Right
Left
Right
Left
Right
in an independent groups design. Accordingly, the data
Position Position Position Position Position Position
can be entered into a standard ANOVA program as a 64
18
15
26
195
131
138
(stimuli) x 2 (replications) design. CWS may be viewed
29
24
34
268
294
216
as a coherence standard, in that it is based on a theory
of optimality, but of a special type that does not incorpo-
33
51
49
391
352
425
rate correct answers. The CWS ratio depends only upon
46
57
52
453
384
548
internal properties of the set of responses.
64
64
65
575
585
641
Consensus.
The mean response can be calculated
79
72
77
628
613
776
across respondents for each stimulus pair and that mean
83
87
81
746
882
893
can be used as a criterion. From the set of consensual cri-
96
98
93
982
947
991
teria, we can construct pseudo-accuracy measures similar
to the accuracy measures described above. Our consen-
sus measure was based on MSD, in that we substituted the
ber. Three dif?culty levels were used. The same dif?-
mean response, Mi, for the correct answer, so that Con-
culty level was maintained throughout a block. The pairs
sensus is (
[Mi ? Xi]2)/N. We might equally have
of numbers that constituted the 64 problems within each
based the consensus measure on MAD, but did not be-
dif?culty level were constructed according to an 8x8 (left
cause the MSD-based version has been traditionally used
position x right position), factorial design, following An-
in the literature.
derson (1968). No number was permitted to have a zero
as the right-most digit. Addition problems came ?rst,
For an arithmetic task, those who provide correct
then multiplication; in both cases, the problems were pre-
answers will inevitably agree. However, agreed-upon
sented in order of increasing dif?culty level. Two replica-
answers need not be correct.
One possibility is the
tions of the design, using the same 64 pairs of numbers in
widespread use of a heuristic strategy (Gigerenzer, Todd,
an independently randomized order, were presented for
& The ABC Research Group, 1999) that simpli?es the
both addition and multiplication problems, thus yielding
challenging multiplication task. For example, one might
a total of 12 blocks for each participant. The numbers
round the numbers to the nearest ten or hundred prior to
used are shown in Table 1.
multiplying, and then make an upward or downward ad-
justment to correct for rounding.
2.2 Measures
Model Fit. Because the stimuli were constructed ac-
cording to a factorial design, it is possible to employ
Accuracy. Although accuracy sounds transparent enough,
functional measurement analyses (Weiss, 2006) on the
there are at least three sensible ways to capture the accu-
responses. Functional measurement invokes a coherence
racy of the responses. The most commonly used mea-
criterion, evaluating the ?t of a plausible algebraic model
sure, Mean Squared Deviation (MSD), is the average of
to the observed judgments. For the adding task, an ad-
the squared deviations, (
[Ci ? Xi]2)/N where Ci is
ditive model should apply. For the multiplying task, a
the correct answer and Xi is the response on the i-th trial.
multiplicative model should apply. These models can be
The Mean Absolute Deviation (MAD), (
|Ci ?Xi|)/N
tested using analysis of variance; they predict that spe-
is an accuracy measure that does not weight discrepancies
ci?c sources in a factorial analysis of variance will yield
via squaring. The Correlation between correct answers
signi?cant effects and that others will not (Weiss, 2006).
and responses is also an accuracy measure, but it does
If people do the task perfectly, then the model will ?t, and
not distinguish between truly accurate and linearly dis-
the answers will be accurate. However, it is possible for
crepant responses (Stewart & Lusk, 1994). All of these
the model to ?t and for the answers to be systematically
accuracy measures may be viewed as coherence based, in
inaccurate, e.g., by consistently placing higher weight on
that they compare correct answers to those speci?ed by a
the number presented on the left (primacy). A potential
mathematical formula.
weakness of the model ?t approach is that high variability
CWS. The CWS index (Weiss & Shanteau, 2003) for an
increases the likelihood that data will appear to support
individual’s performance is the ratio of discrimination to
the model because there is insuf?cient power to reject it.
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
169
Table 2: Performance for two individual participants across mental calculation tasks, as assessed by six indices.
Problems
Index
2 + 2
2 + 3
3 + 3
2 x 2
2 x 3
3 x 3
Participant G
MAD
0.06
0.08
0.40
3.12
24.59
250.56
MSD
0.75
12.64
80.45
12396.00
323660.13
39112205.50
Correlation
0.94
0.98
0.96
0.78
0.94
0.93
CWS
16.84
41.05
22.63
3.16
18.94
15.56
Consensus
9.98
15149.85
1519.92
66886.54
6480550.31
436290435.69
Model Fit
1.03
1.04
0.82
1.12
0.92
1.08
Participant E
MAD
0.90
17.49
0.68
3.78
625.25
2236.50
MSD
60.84
6996224.11
2135.85
14784.45 2886585990.68 68662171252.01
Correlation
0.54
0.30
0.64
0.68
-0.12
0.22
CWS
1.24
1.00
1.66
2.44
1.15
1.00
Consensus
78.69
9017726.92
3325.01
65329.33 2870853841.73 67531546588.76
Model Fit
1.00
1.00
1.08
1.06
0.62
0.13
2.3 Results
icant for most blocks for most participants. This non-
signi?cance is not attributable to lack of power, because
Performance. To convey the ?avor of the data, we present
the main effects and (for the multiplicative model) the
the performance indices achieved by one of the most suc-
bilinear component of the interaction were both signif-
cessful participants and by one of the least successful in
icant and sizable. Graphically, the appropriate pattern
Table 2. These extremes illustrate how the various indices
— parallelism in the case of addition, fan in the case of
track the same observed behavior. The three accuracy
multiplication — was observed for most of the individ-
measures (MAD, MSD, Correlation) all report the superi-
ual plots, especially for addition. Thus, the functional
ority of Participant G over Participant E in the same way,
measurement analysis does shed light on the behavior,
with lower values for all six of the problem types. MAD
telling us that people did follow the applicable combina-
and MSD also con?rm that multiplication is more dif?-
tion rule. However, the small F-ratios did not distinguish
cult than addition and that problem dif?culty increases
among participants of varying pro?ciency. In this appli-
with the number of digits, although Participant E had an
cation, the process analysis was uninformative regarding
especially hard time with adding 2 digit and 3 digit num-
differential pro?ciency; Presumption 4 could not be ver-
bers. On the other hand, Correlation was not effective in
i?ed. To be fair, functional measurement has never been
capturing these expected trends.
proposed by its adherents as a tool for assessing perfor-
Our featured index, CWS, did show the superiority of
mance; nor has the magnitude of a nonsigni?cant F-ratio
Participant G’s performance over that of Participant E,
ever been proposed to be meaningful.
but did not fare particularly well in capturing the dif?-
Participant Rankings. One of our primary purposes
culty we built into the design. The picture presented when
was to see how the indices compared in terms of scor-
Consensus was used as a surrogate for correctness was
ing the people. In employment contexts, rankings are the
comparable to that provided by MAD and MSD.
usual basis of decisions. For each index, we ranked the 20
We assessed model ?t using the F-ratio of the source
participants according to their average score4 across the
that captures deviations from the normative additive (test-
ing the Left x Right interaction) and multiplicative (test-
4To average the index values across the six conditions for each
ing the deviations from bilinearity) models (Weiss, 2006)
participant, we followed the recommendation of Weiss and Edwards
for the respective tasks. These F-ratios are shown in
(2005), transforming so that the averaging is carried out on the units
of the original measurements. For CWS, FM, MSD and Consensus,
the bottom line of Table 2. The normative models were
the appropriate average is the square of the mean of the square roots of
quite descriptive, in that the key F-ratios were nonsignif-
the six individual values. For Correlation, we employed Fisher’s r to z
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
170
Table 3: Rank order correlations (Spearman’s rs) between six performance measures on mental calculation tasks.
MSD
Correlation
Model Fit
CWS
Consensus
MAD
.78*
(–).65*
(–).05
(–).46*
.70*
MSD
(–).55*
(–).34
(–).46*
.92*
Correlation
.07
.86*
(–).50*
Model Fit
.37
(–).55*
CWS
(–).43
* p < .05. n = 20 for all correlations. Minus signs indicate direction only, and are unimportant to the strength
of the relationship.
Presumably, MSD generates slightly different orders be-
cause it weights large errors differently. Correlation is
a less sensitive index, in that it can fail to penalize re-
50000
sponses that are incorrect if the errors follow an orderly
pattern. Consensus did fairly well, perhaps re?ecting the
objective nature of the task. People are aiming at the same
target, the correct answer, and on average their guess
MAD
corresponds to that target reasonably well (Surowiecki,
30000
2004). Consensus and MSD, which both square devia-
tions, yielded similar rankings. CWS agreed moderately
with the gold standard order. CWS agreed moderately
with the gold standard order as is shown graphically in
10000
Figure 1.
0
0
10
20
30
40
50
60
3 Experiment 2: Golf putting
CWS
There is an obvious gold standard for a golf shot; the ball
Figure 1: CWS vs. MAD for mental calculation data
either goes in the hole or does not. Within the traditional
from nineteen students. Each data point represents the ap-
game, the degree of imperfection of a shot that misses is
propriate index-speci?c average over the six conditions.
measured by the number of subsequent shots required to
Spearman’s rs = (–).50, p <.05. In order to avoid dis-
get the ball in. However, the latter measure is confounded
torting the graphical impression of the relationship, we
with the quality of those subsequent shots. A more pure
omitted the data from an outlier whose average CWS was
measure of the imperfection of a shot is the distance be-
much higher than anyone else’s. With that twentieth stu-
tween its landing point and the hole.
dent included, rs = (–).46, p <.05.
We employed a laboratory version of golf putting that
has proven useful in understanding skilled performance
6 conditions, then examined the correspondence among
and its attentional limitations (Beilock & Carr, 2001;
those rankings using Spearman’s r
Beilock, Carr, MacMahon, & Starkes, 2002; Perkins-
s, the rank-order cor-
relation. The rank orders we compared were based on
Ceccato, Passmore, & Lee, 2003). In this stylized ab-
quality of performance as conveyed by each measure. For
straction of one of golf’s core skills, the task is to putt
MSD, MAD, Consensus, and Model Fit, lower scores
to a target. The distance between where the ball lands
indicate better performance. For Correlation and CWS,
and the target is analogous to the difference between the
higher scores indicate better performance. These correla-
correct answer and the stated response in our arithmetic
tions are presented in Table 3.
tasks.
We consider MAD to be the gold standard for the task.
For golf, we can invoke a correspondence assessment.
The other accuracy indices yielded rankings that agreed
The ball is supposed to hit the target, and experience
well, but not perfectly, with that established using MAD.
teaches golfers how to achieve that goal. We can also
parse each golfer’s putts into a CWS index. In accord
transformation. The ordinary arithmetic mean is appropriate for MAD.
with Presumption 5, we anticipated that higher CWS
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
171
would be related to more accurate putting. Our logic is
that greater discrimination means that the golfer knows
22
to hit the ball farther the more distant the target. Bet-
ter golfers should also be more consistent, because their
20
strokes are well-regulated. A Consensus criterion, again
invoking the argument about a common latent structure
guiding the golfers’ efforts, is also available.
18
16
3.1 Method
Participants. We report data from twelve experienced
14
golfers between the ages of 18–22. Participants were
MAD Distance (cm) Ball from Target
required to have two or more years of high school var-
12
sity golf experience or a Professional Golfers’ Associa-
tion (PGA) handicap less than 8. The session lasted ap-
2
3
4
5
6
7
proximately one hour. The golfers were paid $10 for their
time. There was no performance-based incentive.
CWS
Experimental design and procedure. Participants re-
ceived instructions to putt the ball from one of nine dif-
Figure 2: CWS vs. MAD for putting data from twelve
ferent starting points so it stopped as close as possible to
golfers. Spearman’s rs = (–).676, p < .05.
a target, marked by a taped X on a uniformly ?at syn-
thetic turf mat. Three of the locations were 1.2 m from
We calculated a separate CWS value for each golfer,
the target, three were 1.4 m from the target, and the other
entering a 21 (stimuli) x 2 (replications) single-S design
three were 1.5 m from the target. Following instructions
into the ANOVA program. The rankings from CWS were
and 10 practice putts, participants performed two blocks
signi?cantly correlated with those from accuracy as mea-
of 21 putts. Accuracy (more precisely, amount of inaccu-
sured by MAD (r
racy) for a single shot was measured by the distance from
s = (–).676, p < .05; n = 12). Thus, al-
though CWS is ignorant of how far away the target is, or
the center of the ball to the center of the target (in cm).
whether the shot is accurately directed, it yielded values
This target feature is slightly more challenging than real
that were reasonably well correlated with putting perfor-
golf, in that there is no hole in which the ball can come
mance as measured by a gold standard (distance between
to rest. Possibly, a shot that rolled gently over the target
the ?nal ball location and the target). The relationship is
(and thereby generated an error) might have gone into a
shown graphically in Figure 2.
real golf hole. The Mean Absolute Deviation (MAD) for
In the evaluation of Consensus, we used as the consen-
an individual at each starting location was computed by
sual answers5 the mean distance the ball was hit toward
averaging the single shot accuracy scores.
each of the targets, and calculated each individual’s devi-
We also measured the total distance the ball traveled (in
ations using the same de?nition employed for consensus
cm), for use in the CWS computation. CWS is de?ned as
in the mental calculation tasks, (
[M
the ratio of discrimination to inconsistency. Discrimina-
i ? Xi]2)/N . The
rankings for putting generated by the consensus criterion
tion, the numerator of the ratio, is calculated here as the
correlated very highly with the accuracy rankings given
mean square between the distances the ball was hit from
by MAD, r
different starting points. Inconsistency, the denominator,
s = .888 (p < .05, n = 12).
is divided by the mean square between replications, that
is, the mean square between the distances when the ball
4 General Discussion
was hit from the same starting point. Thus as in Experi-
ment 1, CWS is computed like a standard F-ratio.
In the present paper, we illustrate how CWS can be used
to assess complex performance in the absence of a true
3.2 Results
gold standard, using mental calculation and golf putting
as prototypes. We chose these tasks because in each case
The task was fairly challenging. Only 4 putts actually
there is an obvious gold standard against which to test
landed on the target. There were 22 additional putts that
the capability of the index. How much pro?ciency can be
had a zero angle error and a distance greater than the cor-
5
rect distance; some of these might have gone in a real
An alternative de?nition of the consensual answer might be the cen-
troid of the landing points of the putts toward each target. Our data were
hole. So all told, about 10% of the putts could have gone
not collected in a manner that would permit that centroid to be calcu-
in.
lated.
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
172
captured if we don’t know the right answers or whether
Shanteau, & Harries, 2006), and ergonomists determin-
the ball lands on the target? The answer seems to be
ing the risk of workers complaining about upper limb dis-
that a moderate amount of the pro?ciency can indeed
orders (Williams, Haslam, & Weiss, 2008). A lingering
be detected by CWS, a measure that looks only at the
concern has been that if a putative expert consistently dis-
discrimination and consistency exhibited by the respon-
criminates an irrelevant feature of the behavior, CWS can
dents. Combining these two necessary properties of good
be fooled (Weiss & Shanteau, 2003). It is crucial to tap
judgment yields an index that is able to capture a con-
into a dependent variable that captures the heart of per-
siderable amount of the variation in how well people did
formance on the task. Identifying the right variable re-
the tasks. The participants in these studies knew nothing
quires domain knowledge. For example, in CWS investi-
of CWS. They were trying to maximizing accuracy, not
gations of air traf?c control (Shanteau et al., 2005), time
discrimination and consistency. Because accuracy sub-
through sector eventually came to be recognized as the
sumes discrimination and consistency, CWS can serve as
dependent variable of choice. CWS indices built on time
a proxy.
through sector were found to be related to task dif?culty
CWS is a coherence criterion, albeit an unusual one
and to type of training. The evidence from the mental
in that there is no gold standard for a response. Ex-
calculation and golf studies presented here suggests that
periment 1 examined how CWS compared with other
CWS does indeed provide reasonable assessments of pro-
coherence-based measures.
Our Presumption 4, that
?ciency. So long as a sensitive independent variable has
Model Fit would be associated with CWS, was not sup-
been chosen, performance assessment can proceed with
ported. More generally, the several criteria we employed
the analyst blind to any individual characteristics of the
for mental calculation did not yield correlated rankings.
contenders.
So we may conclude that not all coherence criteria pro-
CWS’s ability to capture performance on the putting
duce similar evaluations. Experiment 2 compared CWS
task without target information is particularly impressive.
to a correspondence-based measure. Our Presumption 5,
Not only is the measure unaware of whether the ball was
that participants who score well according to an accuracy
struck accurately, it does not even know which of the tar-
criterion also should score well according to CWS, was
gets are farther away. Because the distance the ball trav-
con?rmed by the correlation in the rankings for both men-
eled turned out to contain useful information, in principle
tal calculation and golf putting. The golf results showed
it would have been possible to assess differential perfor-
that a theory-based coherence criterion can produce eval-
mance just by knowing how hard the ball was struck.
uations similar to those generated by a correspondence
Our outcome-based assessments would not have been
criterion. That fact that evaluations using coherence cri-
feasible without continuous error measures. Few of the
teria do not group themselves conveniently, and that eval-
answers in the arithmetic tasks (other than for two digit +
uations using coherence criteria do not stand apart from
two digit addition) were exactly correct. Capturing per-
that produced by a correspondence criterion, casts doubt
formance by whether a ball either goes in the hole or not
on the value of Hammond’s distinction in this context.
can be insensitive to how well a shot was hit. We noted
The technique we relied upon, comparing the rankings
that only about 10% of the putts could have gone in a
generated by the various indices, is constrained by the
hole.
true differential expertise of the participants. The more
We advocate MAD as the error penalization rule; but
similarly the contenders performed, the less differential
our results suggest that the traditional MSD index, in
performance there is for CWS (or any performance in-
which errors are squared, yields similar rankings. Be-
dex) to detect. To that end, we selected tasks our par-
cause we calculated CWS ratios using mean squares,
ticipants already knew how to do but at which they were
CWS effectively uses squared error penalization. The
not so skilled that all performances would be excellent.
use of mean squares is not critical to the formulation of
Accordingly, the exact magnitude of the correlations be-
CWS, and indeed it is feasible to construct an index using
tween CWS and MAD is not critical to our validation;
measures of discrimination and inconsistency based on
what does matter is that CWS has been shown to detect
MAD. Whether the analyst’s decision regarding error pe-
differences in demonstrated skill in much the same way
nalization will have serious consequences depends upon
that MAD did for both a judgment task and a physical
whether large errors occur frequently within the data set.
performance task.
The CWS methodology for assessing performance is
The CWS index has previously been used to assess the
limited to quanti?able, repeatable behaviors. CWS is,
judgmental performance of professionals in several do-
like all objective assessment techniques, limited in scope.
mains for which a true standard is unavailable. These in-
It does not address the quality of an actor’s performance,
clude physicians judging the likelihood that patients had
an artist’s creation, or a professor’s lecture. As well,
chronic heart failure (Weiss & Shanteau, 2003), occupa-
when comparing performers, every candidate must face
tional therapists prioritizing clients for therapy (Weiss,
essentially the same conditions. Repeatability within a
Judgment and Decision Making, Vol. 4, No. 2, March 2009
Criteria for performance evaluation
173
person can also be a limitation; some tasks can be done
pect good form to produce good results, but it remains an
meaningfully only once. The analyst applying CWS must
empirical question whether the quarterback who throws
be willing to assume that observations occurring at differ-
the most beautiful spiral also completes the most passes.
ent moments are in fact comparable (Weiss & Shanteau,
It is interesting to note that modern baseball analysts are
2003). Under those circumstances, which characterize
beginning to evaluate players according to process — for
much of the routine work of many professionals, when
example, how many pitches does a hitter swing at —
correspondence measures are unavailable, a reasonable
as opposed to traditional criteria such as batting average
assessment of performance can be achieved with CWS, a
(Lewis, 2003).
coherence criterion.
Applying objective process criteria to behaviors that
take place out of sight, most prominently thinking, is
challenging. Our Presumption 4 did invoke a process
5 Summary and confession
model for mental calculation. Although we had prescrip-
tive algebraic models for the two arithmetic tasks, the
Hammond (1996) used the coherence-correspondence
measure of discrepancy from model predictions was not
distinction to help distinguish among metatheories for
associated with the gold standard of correct answers or
scienti?c truth, where the metatheory provides a basis for
with most of the other indices. We concluded that func-
telling whether a theory is true. We attempted to map
tional measurement was not an effective tool for compar-
that distinction onto performance evaluation. We had
ing candidates. The moderate correlation between Model
thought that comparing observed to optimal responses
Fit and Consensus is an anomaly within this description.
was employing a correspondence criterion, while the use
We found it puzzling that Model Fit could be correlated
of CWS, an index based on a theory of expert perfor-
with consensus, Consensus with everything else, and yet
mance, was an application of a coherence criterion. Ac-
Model Fit with nothing else. Error variance may play a
cordingly, we examined evaluation methods that did or
role in this seeming paradox; but a more satisfying res-
did not incorporate optimal responses. The reviewers
olution is that contrary to intuition, correlations are not
showed us the error in our thinking, in that arithmetic is at
necessarily transitive (Langford, Schwertman, & Owens,
its core a theory-driven system that does not depend upon
2001).
a connection with consequences. So involving optimality
In a more positive vein, we were able to con?rm that
did not imply the use of a correspondence approach.
the CWS index was effective in capturing performance,
Our imperfect mapping leads us to suggest that the
and we now have further justi?cation for recommending
coherence-correspondence distinction may not be well-
it when valid outcome measures are not available. We
suited to dichotomizing performance criteria. Empiri-
also found that consensual answers can provide an effec-
cally, the distinction did not provide two distinct sorts of
tive substitute for correct answers when the task is one
results. One reason that correspondence does not stand
that people do reasonably well.
alone may be that correspondence criteria are likely to
be temporary in applied contexts. For example, a physi-
cian’s competence or a drug’s value might at ?rst be eval-
References
uated according to patient outcomes such as survival, a
correspondence criterion. But as medical science pro-
Anderson, N. H. (1968). Averaging of space and num-
gresses and theoretical insights evolve, the rather crude
ber stimuli with simultaneous presentation. Journal of
index of survival is replaced by physiological indicators.
Experimental Psychology, 77, 383–392.
Of course the physiological measures are ultimately con-
Anderson, N. H. (1979). Algebraic rules in psychological
nected to survival; but they are connected by a theory, and
measurement. American Scientist, 67, 555–563.
it is that theory that governs the construction of the instru-
Ashton, A. H. (1985). Does consensus imply accuracy
mentation that reports the measures. Lewin’s (1951) fa-
in accounting studies of decision making? Accounting
mous dictum that “there is nothing more practical than a
Review, 60, 173–185.
good theory” is very pertinent to performance evaluation.
Batchelder, W. H., & Romney, A. K. (1988). Test theory
An alternative dichotomization that might be proposed
without an answer key. Psychometrika, 53, 71–92.
for evaluating performance is using process criteria vs.
Beilock, S. L., & Carr, T. H. (2001). On the fragility
using outcome criteria. For example, one might evalu-
of skilled performance: What governs choking under
ate an athlete’s performance according to either purity of
pressure? Journal of Experimental Psychology: Gen-
style or to scoreboard result. Purity of style is the basis
eral, 130, 701–725.
in ?gure skating and diving. Most other sports, in con-
Beilock, S. L., Carr, T. H., MacMahon, C., & Starkes,
trast, are scored according to the ?nal result: who was
J. L. (2002). When paying attention becomes counter-
the fastest, who scored more points, etc. One might ex-
productive: Impact of divided versus skill-focused at-
Add New Comment