NIH Public Access
Author Manuscript
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
NIH-PA Author Manuscript
Published in final edited form as:
Psychol Rev. 2008 October ; 115(4): 787-835. doi:10.1037/a0013118.
A Theory of Eye Movements during Target Acquisition
Gregory J. Zelinsky
Departments of Psychology and Computer Science, Stony Brook University
Abstract
The gaze movements accompanying target localization were examined via human observers and a
computational model (Target Acquisition Model, TAM). Search contexts ranged from fully realistic
scenes, to toys in a crib, to Os and Qs, and manipulations included set size, target eccentricity, and
target-distractor similarity. Observers and the model always previewed the same targets and searched
the identical displays. Behavioral and simulated eye movements were analyzed for acquisition
accuracy, efficiency, and target guidance. TAM's behavior generally fell within the behavioral mean's
NIH-PA Author Manuscript
95% confidence interval for all measures in each experiment/condition. This agreement suggests that
a fixed-parameter model using spatio-chromatic filters and a simulated retina, when driven by the
correct visual routines, can be a good general purpose predictor of human target acquisition behavior.
Keywords
Overt visual search; computational models; saccade target selection; population coding; center-of-
gravity fixations
1 Introduction
A promissory note is coming due for the visual search community. For decades researchers
have relied on manual button press responses and relatively simple displays to build models
of search, with the promise that these models would one day generalize to more naturalistic
search situations. These efforts have yielded a wealth of basic information, making search one
of the best understood behaviors in all of visual cognition. However, these methodological
choices have also served to limit the types of tasks explored by the search community. Visual
NIH-PA Author Manuscript
search is much more than the time needed to press a button in response to a red vertical bar.
Rather, it is "how we look for things", a reply that most people would provide when asked
about this behavior. A great deal can be learned from this folk psychological definition. It
reminds us that search is active, a visual and motor interaction with the world characterized by
the convergence of gaze towards a target, with each eye movement changing slightly the visual
information used by the search process. It also reminds us that models of search must be general,
as the "things" that we search for are not often red vertical bars, but rather cups or people or
road signs.
Address correspondence to: Gregory J. Zelinsky, Department of Psychology, Stony Brook University, Stony Brook, NY 11794-2500,
Phone: 631-632-7827, Email: Gregory.Zelinsky@stonybrook.edu.
Publisher's Disclaimer: The following manuscript is the final accepted manuscript. It has not been subjected to the final copyediting,
fact-checking, and proofreading required for formal publication. It is not the definitive, publisher-authenticated version. The American
Psychological Association and its Council of Editors disclaim any responsibility or liabilities for errors or omissions of this manuscript
version, any version derived from this manuscript by NIH, or other third parties. The published version is available at
http://www.apa.org/journals/rev/
Zelinsky
Page 2
In this article I attempt one payment on this theoretical debt. I do this by characterizing eye
movement behavior across a range of search tasks, including the acquisition of targets in
NIH-PA Author Manuscript
realistic scenes, and by developing a model that inputs the identical stimuli shown to human
observers, and outputs for each trial a sequence of simulated eye movements that align gaze
with the target. Finding good agreement between this simulated and human gaze behavior
would suggest a computationally explicit understanding of overt search at the level of relatively
stimulus non-specific processes. Of course this goal presumes that it is useful to have a general
purpose model of search and to understand this behavior in terms of eye movements. These
topics will be considered briefly in the following sections.
1.1 Defining search in terms of eye movements
This article addresses the use of eye movements to acquire specific search targets, with an
emphasis on the computational underpinnings of this behavior. Given this focus on overt
search, purely attentional contributions to search will not be considered in depth. This includes
the excellent recent work showing the involvement of attention mechanisms during search
tasks (e.g., Bichot, Rossi, & Desimone, 2005; Chelazzi, Miller, Duncan, & Desimone, 2001;
Yeshurun & Carrasco, 1998; see Reynolds & Chelazzi, 2004, for a review). By neglecting this
work my intention is not to suggest that attention plays an unimportant role in search, but rather
that these processes and mechanisms are outside of the scope of the proposed model (see
Section 8.3 for additional discussion of this topic). However, there is one aspect of attention
NIH-PA Author Manuscript
that must be discussed in the current context, and that is the fact that attention can shift without
an accompanying movement of gaze (e.g., Klein, 1980; Klein & Farrell, 1989; Murthy,
Thompson, & Schall, 2001; Posner, 1980). Given the potential for purely covert shifts of
attention, why is it useful to understand where people move their eyes as they search? There
are several reasons.
First, eye movements can be used to study how attention is allocated during search. Although
the reason for their alignment can be debated (e.g., Deubel & Schneider, 1996; Findlay,
2005; Klein, 1980; Klein & Pontefract, 1994), overt and covert search movements, when they
co-occur, are likely in close spatial register. This is supported by studies showing that attention
is directed to a location in preparation for a saccade to that location (e.g., Deubel & Schneider,
1996; Henderson, 1993; Henderson, Pollatsek, & Rayner, 1989; Hodgson & Muller, 1995;
Hoffman & Subramaniam, 1995; Irwin & Gordon, 1998; Irwin & Zelinsky, 2002; Kowler,
Anderson, Dosher, & Blaser, 1995; Kustov & Robinson, 1996; Rayner, McConkie, & Ehrlich,
1978; Sheliga, Riggio, & Rizzolatti, 1994; Shepherd, Findlay, & Hockey, 1986; see Findlay
& Gilchrist, 2003, and Hoffman, 1998, for reviews), and that manual search measures correlate
highly with the number (and distribution) of gaze fixations occurring during search (Behrmann,
Watt, Black, & Barton, 1997; Bertera & Rayner, 2000; Williams, Reingold, Moscovitch, &
NIH-PA Author Manuscript
Behrmann, 1997; Zelinsky & Sheinberg, 1995, 1997). These relationships between overt and
covert search suggest that gaze fixations can be used to sample the attentive search process,
pinpointing this covert process to specific locations at specific times. If locations x, y, and z
were fixated in a search scene, one can be reasonably confident that attention visited these
locations as well. And even if one assumes an attention sampling frequency greater than our
capacity to shift gaze, as suggested by high-speed serial search models (e.g., Treisman &
Gelade, 1980; Horowitz & Wolfe, 2003; Wolfe, 1994; Wolfe, Alvarez, & Horowitz, 2000; but
see Findlay, 2004, Motter & Holsapple, 2007, Sperling & Weichselgartner, 1995, and Ward,
Duncan, & Shapiro, 1996), the 5 or so gaze-based samples of attention each second might still
provide a reasonable estimate of the scanpath traversed by covert search (e.g., Zelinsky, Rao,
Hayhoe, & Ballard, 1997).
A second reason for studying eye movements during search is related to the first; eye
movements are directly observable, movements of attention are not. Allocations of gaze during
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 3
search can be monitored and quantified with a fair degree of precision using an eye tracker.
The same cannot be said about attention. Unlike eye movements, there is as yet no machine
NIH-PA Author Manuscript
that can track the individual movements of attention during search (although see Brefczynski
& DeYoe, 1999, for work that is heading in this direction). Instead, the many covert search
movements assumed by high-speed serial models must be inferred from manual reaction times
(RTs), making their distribution, and even existence, necessarily more speculative. A manual
RT also provides no explicit spatial measure of search, and only a single temporal measure
marking the completion of the search process. By failing to capture search as it unfolds, RTs
arguably discard the most interesting aspects of search behavior. In contrast, the saccades and
fixations accompanying search provide a comparatively rich source of information about the
spatio-temporal evolution of the search process. And if oculomotor behavior (fixations and
saccades) is measured from the onset of the search display until the button press search
response, no information is lost relative to a manual RT measure. The RT is simply redefined
in terms of a sequence of fixation durations (Zelinsky & Sheinberg, 1995, 1997), meaning that
even if an eye movement does not occur on a particular trial, that trial would still have a single
fixation duration equal to the RT. The advantages of supplementing a RT measure of search
with oculomotor measures are therefore many, with no meaningful costs.
Third, the rich and directly observable oculomotor record makes for a very challenging test of
a search theory. The potential for eye movements to inform search theory has not gone
NIH-PA Author Manuscript
unnoticed, with several predominantly covert theories of search also making implicit (Itti &
Koch, 2000; Koch & Ullman, 1985; Olshausen, Anderson, & van Essen, 1993; Wolfe, 1994),
and occasionally explicit (Tsotsos et al., 1995; Wolfe & Gancarz, 1996) claims that overt eye
movement behavior will follow naturally from hypothesized covert search dynamics. Although
few theoretical treatments have systematically compared simulated search behavior to human
eye movements (see Rao, Zelinsky, Hayhoe, & Ballard, 1996, 2002, and Navalpakkam & Itti,
2005, for exceptions), there is good reason why this should become common practice. Manual
dependent measures do not adequately constrain search theory, as best exemplified by the co-
existence of serial search models (Treisman & Sato, 1990; Wolfe, 1994, 1998a) and signal
detection approaches (Eckstein, 1998; Palmer, 1994, 1995; Shaw, 1982; Swensson & Judy,
1981; see also Townsend, 1976, 1990) as explanations for the same patterns of RT x set size
functions. Such theoretical debates exist, in part, because the RT dependent measure lacks the
resolution to tease apart conflicting perspectives. By enriching a data set with eye movement
measures, such theoretical debate can be lessened, as it would be very unlikely that two
fundamentally different theories can explain a rich data set equally well. In general, the fixation-
by-fixation movements of gaze impose considerable constraints on any search theory, and any
theory would be strengthened in proportion to its ability to capture these spatio-temporal search
dynamics.
NIH-PA Author Manuscript
Fourth, unless instructed otherwise people overwhelmingly elect to move their eyes as they
search, and these behaviors deserve a theoretical explanation. At about 3-5 each second,
saccadic eye movements are our most frequently occurring observable behaviors, and many
of these behaviors are made in the service of visual search (for early observations, see Engel,
1977; Gould, 1973; and Williams, 1966; for reviews, see Findlay & Gilchrist, 2003; Rayner,
1978, 1998; and Viviani, 1990). Yet despite its prevalence in many of our daily activities, few
theories have been devoted specifically to explaining overt search behavior (see Eckstein et
al., 2007; Eckstein, Drescher, & Shimozaki, 2006; Geisler and Chou, 1995; Geisler, Perry, &
Najemnik, 2006; Najemnik & Geisler, 2005, for recent exceptions). Rather, there is a tradition
of treating overt search as a less interesting cousin of covert search, and of subsuming the
discussion of this topic under covert search theory. The rationale for this thinking again has its
roots in the high-speed serial attention model. Although eye movements and covert search
movements are highly correlated, the overt movement, because it has an actual motor
component, is slower and therefore lags behind the faster covert movement. According to this
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 4
perspective, if it were possible to speed up the frequency of eye movements, overt and covert
movements would visit the same display locations during search. This premise of the high-
NIH-PA Author Manuscript
speed search model should be treated with skepticism, for two reasons. First, search with eye
movements is not the same as search without eye movements. Eye movements can both
facilitate search by removing peripheral acuity limitations (Geisler and Chou, 1995), as well
as occasionally decrease search efficiency through the introduction of strategic biases
(Zelinsky, 1996; Zelinsky & Sheinberg, 1997). Overt and covert search can therefore not be
equated; searching with eye movements qualitatively changes the search dynamics. Second,
there is growing reason to believe that oculomotor scanning, and not purely covert shifts of
attention, may be the more natural search behavior during a free-viewing task (Findlay,
2004; Findlay & Gilchrist, 1998, 2001). Using a probabilistic model and a free viewing task,
Motter and Holsapple (2007) recently demonstrated that covert shifts of attention occur too
infrequently to dramatically affect search behavior. If true, this means that the role of covert
scanning in search has been overestimated. Searches relying on purely covert shifts of attention
may be the exceptions rather than the rule, with these exceptions limited to fairly unnatural
search tasks when eye movement behavior is highly constrained.
1.2 Describing search in real-world contexts
The past decade has seen rapid growth in the number of studies using complex objects and
scenes as stimuli, and this trend is likely to continue. Real-world stimuli have long been used
NIH-PA Author Manuscript
to study memory and object recognition (e.g., Nickerson, 1965, 1968; Palmer, 1975; Shepard,
1967; Standing, 1973), and more recently have also appeared prominently in the visual
perception and attention literatures (e.g., Rensink, O'Regan, & Clark, 1997; Simons & Levin,
1997, 1998; Thorpe, Fize, & Marlot, 1996; see also Buswell, 1935). If search is to remain an
attractive topic of scientific enquiry, it too must evolve to accommodate complex and naturally
occurring stimuli.
For the most part this has happened, with search studies now spanning a wide range of contexts
from simple to complex. Simple search contexts are valuable in that they can reveal the visual
features that are, and are not, preattentively available to the search process (e.g., Enns &
Rensink, 1990, 1991; He & Nakayama, 1992; Julesz, 1981; Treisman & Gormican, 1988; for
review, see Wolfe, 1998b), as well as those features that can be used to guide search to a
designated target (e.g., Motter & Belky, 1998; Wolfe, Cave, & Franzel, 1989). To a large extent
the search literature was conceived and nourished on simple stimuli, and the key role that they
continue to play in understanding search behavior should not be underestimated. However,
search targets can also be complex, and several studies have now used complex patterns as
search stimuli, both in the context of object arrays (e.g., Biederman, Blickle, Teitelbaum, &
Klatsky, 1988; Levin, 1996; Levin, Takarae, Miner, & Keil, 2001; Neider & Zelinsky,
NIH-PA Author Manuscript
2006a; Newell, Brown, & Findlay, 2004; Zelinsky, 1999) as well as targets embedded in simple
and complex scenes (e.g., Aks & Enns, 1996; Biederman, Glass, & Stacy, 1973; Brockmole
& Henderson, 2006; Henderson, Weeks, and Hollingworth, 1999; McCarley et al., 2004;
Neider & Zelinsky, 2006b; Oliva, Wolfe, & Arsenio, 2004; Wolfe, Oliva, Horowitz, Butcher,
& Bompas, 2002; Zelinsky, 2001; Zelinsky et al., 1997). This adoption of complex stimuli has
fueled a new brand of image-based search theory (e.g., Itti & Koch, 2000; Navalpakkam &
Itti, 2005; Oliva, Torralba, Castelhano, & Henderson, 2003; Parkhurst, Law, & Niebur,
2002; Pomplun, 2006; Rao et al., 2002; Torralba, Oliva, Castelhano, & Henderson, 2006;
Zelinsky, 2005a; see Itti & Koch, 2001, for a review), but this theoretical development is still
in its infancy. Consequently, many basic search questions, such as how search is guided to a
complex target, are still not well understood.
Optimistically, one might think that issues of generalization from simple to complex search
contexts are nothing more than a minor theoretical nuisance. Given that target guidance may
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 5
rely on relatively basic features, it might not matter whether these features describe a simple
object or a realistic scene. Indeed, this view is central to the "modal model" conception of
NIH-PA Author Manuscript
search; complex patterns are decomposed into a set of feature primitives, then re-integrated or
"bound" into objects following the application of covert processing. Simple and complex
stimuli might therefore differ in terms of their feature compositions, but acting on these features
would be the same underlying search processes. Less optimistically, the generalization from
simple to complex search patterns might not be straightforward. Finding an unambiguous
representation for a coffee cup target in a real-world context will likely require a feature space
larger than what is normally assumed for colored-bar stimuli. Such increases in the
dimensionality of a feature space can qualitatively change the way features are used by a
system, making a complex pattern potentially more than just the sum of its parts (Kanerva,
1988). This qualitative change might arise due to capacity limits on visual working memory
(Alvarez & Cavanagh, 2004; Luck & Vogel, 1997; Zelinsky & Loschky, 2005), or by
differences in the coding of similarity relationships. For example, if complex objects were
coded using only two dimensions (e.g., color and orientation), this dimensionality constraint
would likely force subsets of these objects to have artificially high estimates of similarity, and
other subsets to have inflated estimates of dissimilarity. However, if this representational
constraint were lessened by coding two hundred dimensions rather than two, these same objects
would likely have very different similarity relationships, with far fewer peaks and valleys. In
some sense, this distinction between simple and complex search stimuli is analogous to the
NIH-PA Author Manuscript
distinction between artificial (e.g., Bourne, 1970) and natural categories (e.g., Rosch, 1973) in
the concept literature. Categories can be learned for a set of colored geometric objects, but
different rules seem to apply when the category is squirrels, or vehicles, or chairs. Ultimately,
the applicability of a search theory to realistic contexts must be demonstrated--it is not a
foregone conclusion.
Complicating the extension of a search theory to realistic contexts is the selection of an
appropriate representational space. The problem is that the dimensions of this space are largely
unknown. Although most people would agree that a coffee cup consists of more visual features
than a colored bar, it is not apparent what these features are. Once the obvious list of candidate
features is exhausted, considerable disagreement will likely arise over what new feature
dimensions to represent (Treisman & Gormican, 1988; Wolfe, 1998b). Restricting discussion
to simple stimuli is one way of avoiding this problem. The features of a colored-oriented bar
are readily apparent; if the bar is green and vertical then these features, and only these features,
require coding. In other words, it is possible to hand pick the feature representation to match
the stimuli. Extending this solution to real-world objects, however, is likely to be arbitrary and
unsatisfying. Moreover, a model of search that uses hand picked features is necessarily more
limited in the range of stimuli to which it can be applied. Models hard-wired to "see" letters
NIH-PA Author Manuscript
(e.g., Humphreys & Muller, 1993) or oriented color bars (e.g., Wolfe, 1994) might therefore
work well for letter or bar stimuli, but may fail utterly if given realistic objects or scenes.
A general search theory, meaning one able to work with arbitrary designations of targets and
search contexts, should have at least three properties. First, it should be computationally
explicit, and preferably implemented as a working model. When it comes to working with
realistic stimuli, the devil is often in the details. One cannot be certain that a theory will
generalize across contexts unless this generalization is actually demonstrated. Second, a
model's operations should be relatively stimulus independent. If stimulus class A requires one
set of parameters and stimulus class B requires another set, and these parameter settings must
be supplied by a user, the search model cannot be described as general. Third, the model should
be able to flexibly accommodate stimuli ranging in complexity from simple patterns to fully
realistic scenes. One method of achieving such breadth is to represent search patterns using a
featurally diverse repertoire of spatio-chromatic filters (e.g., Itti & Koch, 2000; Rao et al.,
1996, 2002; Zelinsky, 2003). Similar filter-based techniques have been used with great success
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 6
to describe early visual processing within the computational vision community (e.g., Daugman,
1980, Lades et al., 1993; Leung & Malik, 2001; Malik & Perona, 1990; Olshausen & Field,
NIH-PA Author Manuscript
1996; Rohaly, Ahumada, & Watson, 1997; see Landy & Movshon, 1991, for a review), and
their inclusion in a search model would lend a measure of biological plausibility to the
approach. By using a large number of such filters, each tuned to a specific chromatic and spatial
property, a high-dimensional representation can be obtained that makes it unnecessary to hand
pick features to match stimuli. A green vertical bar would generate responses in those parts of
the feature vector coding for "green" and "vertical", and a coffee cup would generate responses
in whatever feature dimensions are specific to the coffee cup.
1.3 Overview
We should ask more of our search theories. Given the many ways that eye movement data can
enrich descriptions of search behavior, theories should strive to predict where each eye
movement will be directed in a scene, and the temporal order of these eye movements (where
do searchers look first, second, etc.). In short, theories of search should also be theories of eye
movements during search. Moreover, a theory should be able to make these predictions
regardless of stimulus complexity, meaning that it should work with realistic objects and scenes
as well as Os and Qs. Such a description of the eye movement behavior accompanying search
would constitute an extremely rigorous test of a theory, perhaps unrealistically so. Still, theories
should aspire towards meeting this standard, as even partial successes will help us to evaluate
NIH-PA Author Manuscript
what we know, and do not yet know, about search.
The work described in this article will take a small first step towards meeting this rigorous
theoretical challenge. The general approach is to have human and simulated searchers perform
the same tasks and to "see" the same displays. Model testing will consist of comparing the
simulated gaze behavior to the spatially and temporally exact eye position data from the
behavioral experiments. For the reasons outlined in Sections 1.1 and 1.2, this effort will focus
on how gaze becomes aligned with a designated target across a diverse range of tasks. Given
this focus, two important dimensions of search behavior will not be addressed. First, a model
describing the convergence of gaze on a target requires that a target be present; target absent
behavior will therefore not be considered here. Second, although the model will reflect time
in terms of a sequencing of eye movements, no attempt will be made to describe the durations
of individual fixations or to estimate the search response time by summing these fixation
durations. Correcting these omissions, a topic discussed more fully in Section 9, would require
adding decision criteria and stopping rules to the model that do not yet exist. In the interest of
keeping this initial version of the model relatively simple and focused on the model's spatial
behavior, treatment of these topics will be deferred to a future study. To acknowledge this
narrowed focus, I will henceforth refer to the behavior of this model as target acquisition, not
NIH-PA Author Manuscript
search, with the term `acquisition' here referring to the alignment of gaze with a target.
Similarly, the task required of the human observers is best described as target localization, not
target detection, although the two tasks are obviously related (Bundesen, 1991; Sagi & Julesz,
1985).
The organization of the article is as follows. Section 2 introduces the Target Acquisition Model
(TAM), with more detailed information pertaining to TAM's representations and processes
provided in individual subsections. The basic flow of processing in this model is shown in
Figure 1. Generally, computational vision techniques are used to represent scenes in terms of
simple and biologically-plausible visual feature-detector responses (e.g., colors, orientations,
scales). Visual routines (e.g., Ullman, 1984;Hayhoe, 2000) then act on these representations
to produce a sequence of simulated eye movements. Sections 3-7 describe five experiments
comparing human and simulated target acquisition behavior across a range of tasks. Section 3
describes a task requiring the localization of a target in fully realistic scenes, and Section 4
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 7
describes a task using simpler scenes to test more specific predictions of the model. In Sections
5-7, experiments are described that use O and Q stimuli to evaluate TAM under more restrictive
NIH-PA Author Manuscript
conditions, as well as to better relate its behavior to the basic search literature. These
experiments include a test for set size effects (Section 5), a search asymmetry (Section 6), and
an effect of target-distractor similarity (Section 7). A general discussion is provided in Section
8, in which broad implications for search and attention are discussed, and comparisons are
made to specific search models. Section 9 discusses TAM's limitations, and the article ends
with a brief conclusion in Section 10.
2 The Target Acquisition Model
This section introduces TAM and describes its representations and processes.1 As an overview,
the spatial and temporal dynamics of eye movements during target acquisition are simulated
using processes acting on map-based perceptual representations. Techniques borrowed from
the image processing community are used to obtain a fixation-by-fixation retina transformation
of the search image reflecting the visual acuity limitations at each gaze position. Other image
processing techniques then represent these retina-transformed scenes as collections of
responses from biologically plausible visual feature detectors. Following this feature
decomposition stage, the target and scene representations are compared, with the product of
this comparison being a map indicating the visual similarity between the target and each point
in the search scene (the target map). A proposed location for an eye movement is then defined
NIH-PA Author Manuscript
in scene space by taking the geometric average of the activity on this map. At any given moment
in processing, the model therefore knows where it is currently fixated and where it would move
its fixation, based on the averaging computation at that given moment. When the distance
between these two coordinates reaches a critical threshold, an eye movement is made to the
proposed location in the scene. A temporal dynamic is introduced by iteratively excluding
activation values from the target map that offer below-threshold evidence for the target. As the
activity on this map changes, so does the geometric average and the proposed fixation point.
Eventually, perhaps after several eye movements, this process isolates the most active values
on the target map. As this happens, the target's role in the averaging computation increases,
resulting in the guidance of gaze towards the location of the suspected target. If the fixated
pattern is determined not to be the target, this false target is inhibited and the cycle begins again
with the selection of a new target candidate for inspection. Processing stops when the target
match exceeds a high detection threshold, which often occurs only after the high-resolution
simulated fovea becomes aligned with the actual target (i.e., the target is fixated).
From the above overview it is clear that processing in this model is a highly dynamic interaction
between several intertwined representations and visual routines. However, what may have been
NIH-PA Author Manuscript
less clear is that these representations depend, not only on the visual properties of each specific
search scene and target, but also on the specific sequence in which the simulated fovea is
repositioned within a scene. The retina transformation, by progressively blurring peripheral
regions of the search image, will influence how well the target matches the scene. Only those
matches arising from the foveally viewed region of the scene will have the potential to yield a
good match to the target, other matches will be lessened in proportion to the peripheral
degradation of the search image. Because the target map is a representation of these matches,
the visual routines responsible for averaging and thresholding the target map will therefore be
using retina-transformed information when determining the location of the next gaze shift, and
this new gaze shift will result in a new retina-transformed view of the search image and
ultimately a new map of activation values. Given these intertwined retina and activation
1Aspects of this model were presented at the 46th meeting of the Psychonomics Society (Zelinsky, 2005b) and at the 2005 Neural
Information Processing Systems meeting (Zelinsky et al., 2006).
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 8
constraints, even a small change in fixation position early in a trial might propagate through
this dynamical system and produce a radical change in the final simulated scanpath.
NIH-PA Author Manuscript
Figure 1 shows more concretely the dynamic flow of processing through this model. This
processing can be conceptually divided into four broad stages: (1) the creation of a target map,
(2) target detection, (3) the visual routines involved in eye movement generation, and (4) the
rejection of fixated false targets. These 4 stages are indicated by the dashed boxes in the figure.
The following sub-sections provide a more detailed description of the representations and
processes specific to each of these key stages.
2.1 Creating the Target Map
2.1.1 Input Images--For each simulated search trial, the model accepts two images as input;
one a high-resolution (1280x960 pixel) image of the search scene, the other a smaller,
arbitrarily sized image of the search target. Target images in this study were created by clipping
a patch from the search image, with the target pattern centered in this patch. The model therefore
has precise information about the target's appearance in the search image, although obviously
not its location. Neither the search nor the target images were preprocessed or annotated in any
way.
2.1.2 Retina Transform--Any model making claims about eye movement behavior must
NIH-PA Author Manuscript
include a foveated retina, without which eye movements would be unnecessary. Human
neuroanatomical constraints are such that the resolution of patterns imaged on the retina is
highest for the central region, known as the fovea. Resolution decreases with increasing
distance from the fovea, with patterns imaged on the peripheral retina appearing blurred.
Although we are often unaware of this peripheral degradation (e.g., McConkie & Rayner,
1976), its implication for search is profound. A foveally viewed target, because it is not
degraded, will yield a good match when compared to a target template stored in working
memory; the same target viewed peripherally will yield a poorer match. By aligning the fovea
with the target, eye movements therefore improve the signal-to-noise ratio and facilitate target
detection (Geisler & Chou, 1995).
In order to capture this basic human constraint on information entering the visual system, TAM
includes a simplified simulated retina (for retina transformations in the context of reading, see
Engbert, Nuthmann, Richter, & Kliegl, 2005, Reichle, Rayner, & Pollatsek, 2003, and Reichle
& Laurent, 2006). The method used to implement retina transformations was borrowed from
Geisler and Perry (1998, 2002; see also Perry & Geisler, 2002), and the interested reader should
consult this earlier work for technical details. Briefly, the approach describes the progressive
blurring of an image originating from a point designated as the center of gaze. The method
NIH-PA Author Manuscript
takes an image and a fixation coordinate as input, and outputs a retina-transformed version of
the image relative to the fixation coordinate. To accomplish this transformation, a multi-
resolution pyramid of the image (Burt & Adelson, 1983) is pre-computed, and a resolution
map is obtained indicating the degree of low-pass filtering applied to each image point with
respect to its distance from fixation. The retina-transformed image is created by interpolating
over different levels of the pyramid, with the specific levels and interpolation coefficients
determined by the resolution map. Importantly, none of the parameters needed to implement
the retina transformation were free to vary in this study. Computational experiments were
conducted based on a 20x15 simulated field of view and a half-resolution eccentricity (e2)
of 2.3, a value that provides a reasonable estimate of human contrast sensitivity as a function
of viewing eccentricity for a range of spatial frequencies (Geisler & Perry, 1998; see also Levi,
Klein, & Aitsebaomo, 1985). Figure 2 illustrates the effect of this retina transformation for a
representative image used in Experiment 1. Note that retina transformations were performed
only on the search images. Target representations are assumed to be visual working memory
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 9
representations formed through foveal viewing of a target preview, and as such not subject to
acuity limitations.
NIH-PA Author Manuscript
2.1.3 Collect Filter Responses--Prior to filtering, the target and retina-transformed search
images were separated into one luminance and two opponent-process color channels, similar
to the representation of color in the primate visual system (Hurvich, 1981). The luminance
channel was created by averaging the Red, Green, and Blue components of the RGB images.
Color was coded by R-G and B-Y channels, where Yellow was the average of Red and Green.
For a given image, visual information was extracted from each color-luminance channel using
a bank of 24 Gaussian derivative filters (GDFs). These 24 filters consisted of 2 filter types
(1st and 2nd order Gaussian derivatives), each appearing at 3 spatial scales (7, 15, and 31 pixels)
and 4 orientations (0, 45, 90, and 135). Convolving these filters with the color-luminance
separated images yielded 24 filter responses per channel, or 72 filter responses relative to a
given location in the composite image.2Figure 3 shows the responses from these 72 filters
aligned in space over the midpoint of the teddy bear image. Such a 72-dimensional feature
vector provides a sort of feature signature that can be used to uniquely identify the location of
a complex visual pattern in an image. See Zelinsky (2003) for a similar representation applied
to a change detection task, as well as for additional discussion of how GDFs can be used to
extract visual features from images of objects.
NIH-PA Author Manuscript
Feature vectors were obtained from every pixel location in the retina-transformed search image,
and from one pixel location in the target image, with the coordinate of this target image pixel
referred to here as the Target Vector (TV) point.3 Note that although only a single feature vector
was used to represent the target, this one vector represents information over a patch of the target
image (961 pixels, in the current implementation), with the size of this patch determined by
the scale of the largest filter. In the context of a search task, the filter responses collected from
the image region surrounding the TV point is intended to correspond to a working memory
representation of the target's visual features, similar to the property list of visual features that
is believed to underlie an object file representation (Irwin, 1996;Irwin & Andrews, 1996). As
for the comparatively dense array of feature vectors computed for the search image, this
representation bears a conceptual similarity to the visual feature analysis performed by
hypercolumns in striate cortex (Hubel & Wiesel, 1962). Like an individual hypercolumn, each
feature vector in this array is dedicated to the spatially localized analysis of a visual pattern
appearing at a circumscribed region of visual space.
2.1.4 Create Target Map--A typical search task requires the following three steps: (1) a
target must be represented and held in memory, (2) a search display must be presented and
represented, and (3) the target representation must be compared in some manner to the search
NIH-PA Author Manuscript
display representation. Steps 1 and 2 are accomplished by TAM through the collection of filter
responses, after which the retina-transformed search image and the target are represented in
the same 72-dimensional feature space. Step 3 is accomplished by correlating the target feature
vector with the array of feature vectors derived for the retina-transformed search image (Figure
4, top). Obtaining these correlations for each point in the search image produces what will be
referred to here as a target map, TM. More formally, if Ft is the target feature vector and Fp is
the feature vector obtained at point p in the retina-transformed search image, then the
corresponding point p in the target map is defined as:
2Little effort was made in this study to determine the set of optimal features for target acquisition, and the reader should bear this in mind
so as not to attach too much importance to the specific features listed here. Indeed, the goal was just the opposite, to represent stimuli
using fairly generic features, and to keep their number small so as to reduce computation time. By doing this, one can be certain that
TAM's behavior does not hinge on the inclusion of highly specialized features in its base representation.
3Although multiple TV points could be computed on the target image, pilot work determined that these feature vectors would be highly
redundant and that the denser target representation would not meaningfully affect task performance, at least not for the relatively small
targets used in the current study.
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Zelinsky
Page 10
NIH-PA Author Manuscript
(1)
Unlike a saliency map, which computes a measure of feature contrast between points in an
image (Itti & Koch, 2000; Koch & Ullman, 1985), each point in the target map represents a
measure of visual similarity between the corresponding point in the retina-transformed search
scene and the search target. A typical target map is shown in Figure 4 (bottom). Intensity
represents correlation strength, with brighter points indicating greater target-scene similarity.
The brightest points in the Figure 4 target map correspond to the location of the teddy bear
target in the retina-transformed search image, which is to be expected given that the object in
the search scene most similar to the target is usually the target itself. Note however that TAM
does not guarantee that the brightest points on the target map will correspond to the location
of the target. If the target appears at a more eccentric display location relative to a target-similar
distractor, the target pattern would undergo greater retinal blurring and, consequently, might
have a lower correlation on the target map.
2.1.5 Add Noise to Target Map--A small amount of noise was added to each value in the
target map in order to correct the infrequent occurrence of a computational artifact. I observed
during pilot testing that when a configuration of identical search distractors was perfectly
NIH-PA Author Manuscript
symmetric around the simulated fixation point, their identical correlations on the target map
would create a stable activation state that would cause TAM's gaze to freeze at an intermediate
display position (see Sections 2.3.2 - 2.3.4 for additional details). The introduction of noise
(normally distributed between .0000001 and .0001 in the current implementation) served to
break these deadlocks and allow gaze to converge on an individual object. This minute level
of noise is consistent with noise inherent in the visual system.4
2.1.6 Update Target Map with Inhibition Map--The final stage in the creation of the
target map consists of updating this map with information about previously rejected distractors.
After a distractor is fixated and determined not to be the target, the TAM tags this location on
the target map with a burst of Gaussian distributed inhibition (see Section 2.4.1 for details) so
as to prevent gaze from returning to the attractive lure. This amounts to a form of inhibition of
return (IOR; e.g., Maylor & Hockey, 1985; Posner & Cohen, 1984). The Inhibition Map
(IM) maintains an enduring spatial record of these inhibitory bursts (see Dickinson & Zelinsky,
2005, for a behaviorally explicit use of an inhibition map). Because the target map is derived
anew after each change of gaze (so as to reflect the information present in the new retina-
transformed search image), TAM must have some mechanism in place for inserting into each
new target map the inhibition associated with previously rejected distractors. The current
NIH-PA Author Manuscript
processing stage accomplishes this updating operation. With each new fixation, values on the
inhibition map (in the range of [-.5, 0] in the current implementation), are added to the new
target map (Figure 5). After a target is detected, the inhibition map is reset to zeros in
preparation for the next run. Like IOR, these inhibitory bursts are assumed to accumulate in a
scene-based reference frame (e.g., Muller & von Muhlenen, 1996; Tipper, Weaver, Jerreat, &
4Although this deadlock behavior reveals a potential weakness of the model, one should keep in mind that the very few observed cases
of this were limited to search displays consisting of extremely simple patterns (such as the O and Q patterns used in Experiment 3), and
only when the target was absent from the search display. Because only target present data are reported here, none of the simulations
conducted for this study would have resulted in deadlock in the absence of noise. Regarding the effect of noise on TAM's behavior, it is
unlikely that its introduction would have resulted in a meaningful re-prioritization of target map signals; noise ranged from .0000001-.
0001 whereas values on the target map ranged from 0-1. Indeed, in pilot simulations conducted with and without the addition of noise,
simulated scanpaths were virtually indistinguishable. TAM uses noise solely to discourage identical values from appearing on the target
map (which is itself an unlikely occurrence in a biological system), not to affect search guidance. This is not true for all models of search,
where noise often plays a more central role. For example, if noise were excluded from Wolfe's (1994) guided search model, the target
would be the first item visited by attention on an unrealistically large proportion of trials.
Psychol Rev. Author manuscript; available in PMC 2009 October 1.
Add New Comment