This is not the document you are looking for? Use the search form below to find more!

Report home > World & Business

User Variance and its Impact on Video Retrieval Benchmarking

0.00 (0 votes)
Document Description
In this paper, we describe one of the largest multi-site inter- active video retrieval experiments conducted in a laboratory setting. Interactive video retrieval performance is difficult to cross-compare as variables exist across users, interfaces and the underlying retrieval engine. Conducted within the framework of TRECVID 2008, we completed a multi-site, multi-interface experiment. Three institutes participated involving 36 users, 12 each from Dublin City University (DCU, Ireland), University of Glasgow (GU, Scotland) and Centrum Wiskunde & Informatica (CWI, the Netherlands). Three user interfaces were developed which all used the same search service. Using a latin squares arrangement, each user completed 12 topics, leading to 6 TRECVID runs per site, 18 in total. This allows us to isolate the factors of user groups and interfaces from retrieval performance. In this paper we present an analysis of both the quantitative and qualitative data generated from this experiment, demonstrating that for interactive video retrieval with “novice” users, performance can vary by up to 300% for the same system using different sets of users, whilst differences in performance of different interfaces was in comparison not statistically different. Our results have implications for the manner in which interac- tive video retrieval experiments using non-expert users are evaluated. The primary focus of this paper is in highlight- ing that non-expert users generate very large performance fluctuations which may either mask or discount system vara- bility. The discussion of why this happened is not covered by this paper.
File Details
Submitter
  • Username: shinta
  • Name: shinta
  • Documents: 4332
Embed Code:

Add New Comment




Related Documents

Worldwide Tablet PC Present and Future Market Scope (2010 – 2015) and its Impact on Various Sectors

by: renubresearch, 17 pages

Renub Research report title “Worldwide Tablet PC Present and Future Market Scope (2010 – 2015) and its Impact on Various Sectors” highlights the following key points. • ...

Wedding Invitation Designs and Their Impact on Guests

by: oneclickinformation, 4 pages

Wedding invitation designs are what which gets most of the focus and a huge time is spent on them when a wedding is planned.

Overview of SGPT And Its Effects On Liver

by: ahmednasser, 1 pages

The liver performs many important functions in the body, and it is essential they stay in good shape. To keep the organ healthy, knowing the various signs of liver ailments will be necessary. Some of ...

Total body composition measurement by using bioelectric impedance analysis and its impact on Obesity

by: Sapthagirivasan, 4 pages

We are going to propose a novel total body composition measurement device which is based on Bioelectrical-Impedance Analysis (BIA) which is temperature compensated with the aid of a computer. We are ...

Inflation Uncertainty and its Impact on Economic Activity in Estonia

by: shinta, 14 pages

In this paper we investigate inflation uncertainty in Estonia during the period 1993- 1999. We analyse inflation expectations using survey results and find that inflation expectations are ...

Dependence on Imported Power (Electricity) and Its Impact on Supply and Service Quality: A Case of Zimbabwe

by: Gilbert Manhangwe, 114 pages

The power supply challenges confronting the Electricity Industry threaten the economic well being of Zimbabwe as a country. The socio-political and economic commentary surrounding these challenges is ...

Social Shopping and Its Impact on Marketing

by: mosseo, 2 pages

MOS SEO Services, a leading SEO services company based in Tulsa, Oklahoma, offers professional search engine optimization, social media optimization and Internet marketing solutions.

Joe Bard - The use of mix bus compression during the mixing process and its impact on the mastering process

by: joebard, 11 pages

An essay I wrote during my third year Music Technology BSc (Hons) degree studies on the subject of Mix Bus Compression.

MarketReportsOnline.com - Ethical and Environmental Consumerism and its Impact on the US Personal Care Market

by: charlesmartin17, 1 pages

Over 70% of American consumers agree that environmental damage is a key threat to the world, but is this concern translating into action? Find out as we explore “green” hot buttons that ...

ICD10 codes and its Impact on Denial Management

by: medicalbillers, 2 pages

The transition from ICD-9 codes to ICD-10 codes presents a huge challenge for medical billers and also affects health care delivery system and physicians’ revenues. The staggering number of ...

Content Preview
User Variance and its Impact on Video Retrieval
Benchmarking
Peter Wilkins1, Raphaël Troncy2, Martin Halvey3, Daragh Byrne1, Alia Amin4,
P. Punitha3, Alan F. Smeaton1, Robert Villa3
1Centre for Digital Video Processing & CLARITY, Dublin City University, Ireland
2EURECOM, 2229, route des Cretes, Sophia Antipolis, France
3Computing Science Dept., University of Glasgow, UK
4CWI Amsterdam, The Netherlands
pwilkins@computing.dcu.ie
ABSTRACT
General Terms
In this paper, we describe one of the largest multi-site inter-
Experimentation, Human Factors, Measurement
active video retrieval experiments conducted in a laboratory
setting. Interactive video retrieval performance is di?cult
Keywords
to cross-compare as variables exist across users, interfaces
and the underlying retrieval engine. Conducted within the
TRECVID, CBMIR, Video Retrieval, User Study
framework of TRECVID 2008, we completed a multi-site,
multi-interface experiment.
Three institutes participated
1.
INTRODUCTION
involving 36 users, 12 each from Dublin City University
Interactive video retrieval performance is di?cult to assess
(DCU, Ireland), University of Glasgow (GU, Scotland) and
due to a variety of factors. These include the e?ect of the
Centrum Wiskunde & Informatica (CWI, the Netherlands).
‘user in the loop’, search expertise or aptitude of the user,
Three user interfaces were developed which all used the same
the graphical interface and the retrieval engine. The inter-
search service. Using a latin squares arrangement, each user
play between these various factors has always been di?cult
completed 12 topics, leading to 6 TRECVID runs per site, 18
to disambiguate, particularly within benchmarked video re-
in total. This allows us to isolate the factors of user groups
trieval evaluations which favors reporting of mean average
and interfaces from retrieval performance. In this paper we
precision - a measure of system performance - rather than
present an analysis of both the quantitative and qualitative
human performance measures. In TRECVID [16], attempt-
data generated from this experiment, demonstrating that for
ing to disambiguate these factors which may e?ect the per-
interactive video retrieval with “novice” users, performance
formance of interactive retrieval is di?cult, as only a list of
can vary by up to 300% for the same system using di?erent
saved ‘shots’ is returned by systems. Motivated by this and
sets of users, whilst di?erences in performance of di?erent
as part of the TRECVID 2008 interactive video search task,
interfaces was in comparison not statistically di?erent. Our
the K-Space12 group undertook a novel experiment: con-
results have implications for the manner in which interac-
ducting a cross-site evaluation using three di?erent search
tive video retrieval experiments using non-expert users are
interfaces with a common search engine. In total 36 users, 12
evaluated. The primary focus of this paper is in highlight-
from three di?erent sites were employed to perform searches
ing that non-expert users generate very large performance
using each interface. By including multiple geographic sites
?uctuations which may either mask or discount system vara-
it diversi?ed our user base, whilst by decoupling the retrieval
bility. The discussion of why this happened is not covered
engine from the interfaces standard and consistent perfor-
by this paper.
mance was provided to each, thereby facilitating examina-
tion of both the user and interface e?ect and the extent to
which these factors may impact on retrieval performance in-
Categories and Subject Descriptors
dependent of algorithmic performance.
H.5.1 [Information Interfaces and Presentation]: Mul-
To the best of our knowledge, a content-based interac-
timedia Information Systems—Evaluation/methodology, Video;
tive video retrieval experiment within laboratory conditions
H.3.3 [Information Storage and Retrieval]: Information
of this size has not previously been undertaken. This ex-
search & retrieval—Information ?ltering, search process
periment demonstrates that by providing a common search
engine to multiple user interfaces, whilst gathering quanti-
tative and qualitative metrics from participants, signi?cant
insights can be obtained into the factors in?uencing retrieval
performance. We ?nd through the use of transparent cross-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
site human performance measurements, that the largest fac-
not made or distributed for pro?t or commercial advantage and that copies
tor determining search performance is the users chosen to
bear this notice and the full citation on the ?rst page. To copy otherwise, to
1K-Space is a European Network of Excellence (NoE) in se-
republish, to post on servers or to redistribute to lists, requires prior speci?c
mantic inference for semi-automatic annotation and retrieval
permission and/or a fee.
of multimedia content
CIVR ’09 July 8-10, 2009 Santorini, GR
2
Copyright 2009 ACM 0-12345-67-8/90/01 ...$5.00.
Rapha¨el Troncy participated in this work whilst at CWI.

perform the task. We note that the e?ect of a ‘human in
tive Track of TREC, beginning in TREC-6 [11]. The objec-
the loop’ may be extremely unpredictable (Section 5) and as
tive of this activity is the same as ours, “isolating the e?ects
such, there is de?nite need to share such measures, as they
of topics, human searchers, and other site-speci?c factors”
are key to understanding the factors which determine the
[11]. Similar to what we have attempted, this activity re-
reported success of a given interactive search system.
quired that participating sites not only record documents
This paper is setup as follows. We present related work in
that user’s saved, but also to record complex interaction
the following section. The retrieval engine and interfaces are
logs, demographic data and qualitative metrics. However,
outlined in Section 3. The background and expertise of the
the key advantage of this activity which we currently do
users was carefully documented as well as their perception of
not have in TRECVID, was that participants in the Inter-
the interfaces and search sessions (Section 4). Furthermore
active Track were required to run in conjunction with their
we captured logs of all user interaction to quantify the users
own retrieval systems a baseline retrieval system supplied
behavior. Section 5 presents our results and analysis of the
by NIST. This allowed for a comparative evaluation of the
outcomes of our experiment. We ?nish with a discussion
abilities of the searchers involved at each site, thus allowing
of the results and their potential impact on benchmarking
comparisons of systems which took this into consideration.
activities in Section 6.
This motivation directly applies to our work, as we seek
to replicate this scenario by having all three user interface
2.
RELATED WORK
variations being used at each site with each interface using
a common search engine, giving us an idea of the variance
Within the information retrieval (IR) community, a num-
within our user set. The objective of having a baseline re-
ber of evaluation and benchmarking activities have been es-
trieval system has always been an intention of NIST [15].
tablished. These share the common goal of providing a large
However given the complexities of content-based multime-
scale collection of data in order to achieve ‘benchmarked’ or
dia information retrieval systems this has proven di?cult to
comparable evaluation across various sites. Benchmarking
achieve.
provides a standardized, metricated evaluation to enable the
To the best of our knowledge, no e?orts have been made
comparison between information retrieval systems based on
to share interaction data from TRECVID or other multi-
performance. Several benchmarking e?orts exist, including
media initiatives. However, the Open Video Digital Library
TREC and CLEF, but perhaps one of the best known mul-
(OVDL) has provided a repository of digital content, and
timedia initiatives is the TREC Video Retrieval Evaluation
an open interface for browsing and searching the data [12].
(TRECVID) initiative [16]. Since its inception in 2000, the
Despite the lack of shared interaction data from TRECVID
National Institute of Standards and Technology (NIST) has
evaluations, the common collections and benchmarking re-
coordinated this activity annually. The goal of this bench-
sources provided by this initiative have facilitated a great
marking activity is ‘to promote progress in content-based re-
deal of research into both interactive retrieval and its asso-
trieval from digital video via open, metrics-based evaluation’
ciated human factors. For example, MediaMill at the Uni-
[16]. In so doing, TRECVID encourages research on multi-
versity of Amsterdam build rich visualizations of the result-
media information retrieval by providing a large test collec-
space (e.g. the Fork-, Cross- and Rotor-Browser) that en-
tion, uniform scoring procedures, and a forum for the clear
able users to easily explore the full depth of often-complex
comparison results to those working in the domain. As part
result-sets [6]. The team from the National University of
of the evaluation, the teams are provided with a develop-
Singapore pushes the boundaries of ‘extreme retrieval’ by
ment and test corpus of broadcast video footage. Each team
forcing the user to make judgments on a result’s relevance
then builds a retrieval system using the development data
within a very limited time window [13]. FXPAL has evalu-
to guide them and performs a series of topic-based experi-
ated a collaborative retrieval system under the TRECVID
ments. In the interactive search task, a user is provided with
benchmark [1].
a graphical user interface and must complete a set of search
Perhaps of most interest are the explorations conducted
topics. For each topic, the user is provided with a set of vi-
by researchers at Carnegie Mellon who have extensively ex-
sual exemplars and descriptive text e.g. “Find shots of one
plored user-centered issues in video retrieval. Christel and
or more people walking up stairs” and the search must be
Conescu previously investigated how best to support the
completed within 10 minutes. In the 2008 task, users could
novice within the retrieval process through techniques such
perform 24 topics and teams were allowed to submit up to 6
as shot suppression and by encouraging the use of di?erent
runs. The human judged relevant items were then reported
access mechanisms within a shot-based interface [5]. Inter-
to NIST and validated. The interactive search task is par-
estingly, they show that the suppression of previously seen
ticularly important for two major reasons. First, it aims
shots did not have the anticipated positive e?ect on perfor-
to replicate a real world scenario in which the searcher can
mance. Christel has also discusses the distinctions between
react to the search results by, for example, reformulating
the novice and experts and outlines the design considera-
the queries. Second, it is evident that video retrieval with a
tions required to cater for these roles within the retrieval
‘human in the loop’ will far outperform any automatic meth-
process [3]. Additionally, Christel has considered the use
ods. It is as such essential to not only understand the a?ect
of storyboards, a grid layout of thumbnail images as sur-
of the underlying retrieval engine when discussing system
rogates representing video for video search, a commonly-
performance but also to quantify the role of the user, the
adopted metaphor within video retrieval interfaces. He re-
interface and their interplay [5].
marks that such story boards o?er many advantages in ex-
Interest in interactive retrieval is widespread across the In-
ploratory, shot-based retrieval but moving forward, support
formation Science domain. Within the text retrieval commu-
for longer term search activities needs to be considered [4].
nity, there have been multiple e?orts to attempt to disam-
Hauptmann and Christel [8] have also surveyed the ‘state
biguate variables which contribute to retrieval performance.
of the art’ in TRECVID search systems discussing the fea-
One of the most notable of these activities was the Interac-

tures which contribute to the success within an interactive
earlier, ranking within each was handled by the similarity
retrieval. They highlight the importance of text retrieval
measures as speci?ed by the MPEG7 speci?cation. These
noting it to be “much more robust than any of the visual
measures for the most part are similar to Euclidian distance.
features”. Moreover, they highlight the utility of tempo-
Our retrieval engine also made use of High-Level Features
ral context in interactive retrieval, a topic which Yang and
(i.e. concepts), which were generated by the K-Space part-
Hauptmann have further explored as a means by which to
ners and covered the 36 semantic features required for par-
augment the ranking of search results [8]. They de?ne tem-
ticipation in TRECVID 2007. Further details can be found
poral consistency as “the tendency that the relevant shots ...
in [17].
appear in temporal proximity” for a given semantic concept
The previous content-analysis techniques could be accessed
or query. They note that while the degree to which relevant
via two mechanisms within the search engine. The ?rst
items are temporally proximal is dependent on the topic,
method was to use the outputs of the previous methods
temporal context is extremely useful in video retrieval.
as ‘?lters’ on a result set of shots. The ?lters could have
Building upon some of this prior work, the K-Space group
three states, ‘show only shots matching the ?lter’, ‘shots
conducted an interactive retrieval experiment as part of
not matching the ?lter’ and no e?ect (default).
TRECVID 2007 [2]. The investigation was designed to fur-
The second method of access incorporated not only the K-
ther explore the role of temporal context within interactive
Space content-analysis results, but also results from the CU-
search.
This was achieved by creating two search inter-
VIREO374 collection donated by City University of Hong
faces which o?ered the polar extremes of temporal context
Kong and Columbia University [10] for which we are very
and by logging all user interactions throughout participants
grateful. We took the names of the concepts detectors and
search sessions. The ?rst variant was recall-oriented, o?ering
ran these through Wordnet obtaining the synonyms for these
a large number of results without any context information
terms. Therefore for each shot we had a bag of words which
while the second was context-oriented by placing each result
described the visual aspects of that shot. This text for each
in the context of the full broadcast. Apart from sharing the
shot was then augmented with the translated ASR text pro-
same retrieval engine, both systems also shared a common
vided by the University of Twente [9]. This therefore pro-
query input panel, topic description panel and saved shot
duced for each shot a collection of terms which described the
area. The only major di?erence was in the presentation of
content of the shot incorporate both visual and audio infor-
the results from the underlying retrieval engine. Further-
mation. The text was then indexed by Terrier [14], with
more, the a?ect of context-provision was explored for both
retrieval results provided through a vector space model.
novice and expert searchers. While performance in both sys-
tems was comparable, with experts notably outperforming
3.1
Three interfaces for the Interactive Search
novices, the progression of the search and the search strate-
The following subsections provided an overview of the user
gies adopted by the users was markedly di?erent for each
interfaces. Further description can be found in [17].
interface. Interestingly, users of the recall-oriented system
often failed to ?nd relevant temporal siblings for a relevant
3.1.1
Shot based Interface (DCU-1)
shot. As such the authors suggest that the presentation of
The ‘shot based’ system presented to the user the ranked
some temporal context within shot-based interfaces can be
list of shots direct from the retrieval engine. The ranked
used to signi?cantly and e?ectively augment the number of
shots are organized left to right, top to bottom (Figure 1). It
relevant items while minimizing user e?ort (where e?ort is
can be thought of as the more traditional result display that
search reformulation).
has been used for content-based retrieval interfaces. This
interface displays no context for any of the returned results.
3.
SYSTEM DESCRIPTION
The three user interfaces developed for the search exper-
iment leveraged a common search engine that makes use of
several content-analysis techniques. We brie?y detail these,
with a more complete explanation in [17].
As no common keyframe set was released by TRECVID,
we extracted our own set of keyframes. Our keyframe selec-
tion strategy was to extract every second I-Frame from each
shot. We extracted low-level visual features from K-frames
using several feature descriptors based on the MPEG-7 XM.
These descriptors were implemented as part of the aceTool-
box3, a set of low-level audio and visual analysis tools de-
veloped in the EU aceMedia project. We made use of six
di?erent global visual descriptors. These descriptors were
Colour Layout, Colour Moments, Colour Structure, Homo-
geneous Texture, Edge Histogram and Scalable Colour.
The common search engine leveraged multiple modalities
Figure 1: Shot-based user interface
to form a response to an information need from a user. The
search engine allows for multiple query by example, text
3.1.2
Broadcast based Interface (DCU-2)
queries and mixed modality queries. For visual components
of queries we made use of six global visual features identi?ed
The ‘broadcast based’ system takes the idea of context to
its extreme by ranking not shots, but broadcasts. The maga-
3https://kspace.cdvp.dcu.ie/secure/aceToolbox.zip
zine/documentary broadcasts which compose the TRECVID

2008 corpus tend to be about one major subject, whilst in
previous years a news broadcast could be seen as containing
many subjects. With this in mind we can assume that the
shots within a broadcast are more homogeneous. As such
ranking broadcasts as opposed to shots appears as an inter-
esting alternative. In Figure 2, we can see a horizontal line
of shots in rows across the results area. Each of these rows is
a ranked entire broadcast, with the best-matching broadcast
being the ?rst row. When a user issues a query, the ranked
list of broadcasts is presented, and within each broadcast’s
row the row will be centered on the highest matching shot
within that broadcast.
Figure 3: Zooming user interface
needed to complete a current task before proceeding to the
next one.
The time given to complete a task was 10 minutes. Micro
breaks were introduced between tasks to allow participants
to refresh themselves. Participants were given a question-
naire during the experiment, with background information
on the individual and their search experience collected prior
to training. For each topics, users interactions were logged
(time, video searched and browsed, video played, video saved
and video removed), providing an extensive amount of infor-
Figure 2: Broadcast-based user interface
mation on the participants interaction with the system. Fol-
lowing each task, participants were asked to appraise their
3.1.3
Zooming Interface (GU)
performance, while at the completion of tasks for an inter-
The Zooming interface leveraged temporal context as well
face, they were required to assess the system. The question-
as a diversity re-ranking of the search results. The value
naire were based on the AttrakDi? questionnaire [7] and
and importance of a search result appears to be based on it’s
probed the hedonic and ergonomic aspects, usability, and
value as a good starting point for a user to ?nd other relevant
positive and negative experiences of using the system. After
shots within a video, by browsing the video, as much as the
the evaluation, we conducted debrie?ng interviews with the
relevance of the result itself. Based on the ability of users to
participants to gain more informal feedback.
easily browse videos, and the willingness of many users to
All users were instructed to save as many shots as possible
do so in order to ?nd relevant material, we constructed an
that matched the TRECVID topic description. If a user was
interface that (a) emphasizes results provide are good start-
unsure, it was left to his/her discretion whether to save the
ing points from which to ?nd material (point-?nding within
shot or not, but the emphasis of the task was to ?nd as many
videos), and (b) extend the video browsing elements of the
matches as possible.
user interface to enable users to more easily view and browse
In total 36 people participated in our study. They are stu-
videos. To address (a) we introduced a diversity based re-
dents or researchers, recruited equally from the 3 di?erent
ranking to the search results which o?ers more ‘starting o?
institutions mentioned above. Participants are mostly male
points’ for browsing. In order to achieve (b) a zooming in-
(75%), aged between 25 to 56 years old (M =29.1, SD=6.2).
terface was implemented to aid users in exploring more of a
While most participants were experienced searchers, they
videos content when engaged in neighborhood search.
are novice video searchers and occasionally use video search-
ing applications (see Table 4). A few (4 from 36 people) were
4.
EVALUATION
advanced video searchers and frequently use video search-
We conducted our interactive video retrieval evaluation
ing applications. We anticipated that some population bias
under laboratory settings, carrying out the experiment in
would be in e?ect, the participants from DCU were from
three geographically di?erent locations (CWI in Amster-
our research group and would be more familiar with the
dam (NL), DCU in Dubin (IE), GU in Glasgow (UK)). Each
concepts of content-based retrieval, than graduate students
participant used the 3 di?erent search interfaces described
from CWI whose specialities lie elsewhere.
above: Shot-based, Broadcast-based, and Zooming interface
(within subject design). In total, they were required to com-
plete 12 video search tasks (4 video search topics per inter-
5.
RESULTS
face) taken from the TRECVID 2008. Users completed a
In this section, we will present an analysis of our user ex-
training session prior to the main task to ensure they were
periment, speci?cally examining issues concerning users and
familiar with the interfaces operations and functionality and
their impact in retrieval performance, from both quantita-
that they fully understood the search tasks. A participant
tive and qualitative perspectives. This analysis will demon-

Age:
22-56 years old,
1 means that the sets are the same. We compute for each
M =29.1, SD=6.2
topic the Jaccard index within each system (DCU-1, DCU-
Gender:
Male (27), Female (9)
2, GU) across sites, and compare this against the Jaccard
Education:
graduate students (13),
index within each site (CWI, DCU, GU). The ?rst compar-
researchers (17), other (6)
ison gives us an indication if users using the same system
General search exp.*:
M =4.6, SD=0.9
save similar shots, whilst the second tells us that within a
Video search exp.*:
M =2.0, SD=1.1
location (e.g. CWI) if users for a topic are saving the same
A?liation:
DCU (12), CWI (12), GU (12)
shots regardless of system used. The results are plotted in
* 1:none, 3:fairly (1 search daily), 5:very frequently (several daily)
Figure 4.
Table 1: Participants demographic (Total: 36)
Saved Shot Overlap: Site v. System
0.4
System
InfAP
P10
DCU-1
DCU-2
DCU-1 (Max)
0.0366
0.4125
GU
0.35
Equivalence
DCU-2 (Max)
0.0306
0.3750
GU (Max)
0.0366
0.3750
0.3
DCU-1 (Med)
0.0123
0.2104
DCU-2 (Med)
0.0077
0.1437
0.25
GU (Med)
0.0121
0.1875
0.2
Table 2: 2008 Interactive Search Results
Site Jaccard Index
0.15
strate that given the same retrieval engine and user inter-
0.1
faces, the performance of individual users has far greater
0.05
impact on search performance than previously anticipated.
Table 2 presents an analysis of our results using the TRECVID
0
evaluation metrics of Inferred Average Precision (InfAP)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
System Jaccard Index
and Precision@10 (P10). For each system we present two
runs: ‘max’ and ‘med’. We evaluated every user’s perfor-
Figure 4: Saved Shot Set Overlaps
mance for every topic to compose these runs. The ‘max’
run is the selection of topic results which achieved the best
Points which occur on the line in the graph have equal val-
performance in terms of InfAP for that system. The ‘med’
ues for system and site set overlaps. Points which occur to
run is the median run, where the selection of results was
the right of this line have greater overlaps due to the system
obtained by calculating the median InfAP value for every
being used, whilst points to the left have greater overlaps
topic for each system.
due to the site. The points are labeled according to the
system used. Our ?rst observation is that for the majority
The results presented in Table 2 startling illustrate the im-
of topics, there is very little overlap in the shots saved by
pact of user selection in comprising runs for submission to a
users, as the majority of points lie in the range [0.05: 0.05],
benchmarking activity, and the degree to which user’s per-
meaning that there is great variability in the shots selected
formance varies. Each of our ‘max’ runs obtains three times
as relevant by our users. However we do see some artifacts in
the performance of it’s equivalent median run. This result
the graph, for example the GU system (the star points) has
was unexpected, whilst we anticipated some variability in
multiple points to the right of the line, indicating that users
our results due to di?erences in user populations, the mag-
of the GU system for certain topics were more likely to save
nitude of the observed di?erence is alarming.
the same shots, regardless of site. Alternatively, we see that
Evaluating these runs, we ran signi?cance tests using a
the DCU-2 system for certain topics features points on the
? = 0.01 and found that whilst every ‘max’ run was signi?-
left of the line, indicating for those topics that users within
cantly better than the ‘med’ runs, that within each ‘class’ of
a site using that system found the similar shots. As this
run (i.e. ‘max’ and ‘med’) there was no signi?cant di?erence
interface promoted browsing the collection, groups (such as
(i.e. all ‘max’ runs were not signi?cantly di?erent to each
DCU) who have previous experience with content-based re-
other, likewise for the ‘med’ runs). On the one hand, this
trieval may have examined the collection in similar methods,
means that given equivalent sets of users, each of the inter-
resulting in a higher site overlap. The purpose of this graph
faces performed at approximately the same level. However
was to establish if there was any commonality in the sets of
within the same system we note the massive discrepancy be-
shots the users saved, and if a bias could account for this
tween the performance of the ‘max’ and ’med’ runs. From
(i.e. did users using the same system save the same shots,
the same set of users, we were able to produce representa-
did users from the same site save the same shots). However
tive runs which varied wildly, indicating large performance
on the whole there was little intersection of the shots users
variance in our user set.
saved, lending further evidence towards the indications of
To investigate this further, we examined which shots a
massive variability of performance in our user base.
user saved for each topic for each system, to determine if
An alternative method for examining the variability of
there was any commonalities. We utilize the Jaccard index
user performance is to examine the amount of shots saved
which provides a measure of how similar or dissimilar two
by each user. In examining this, we make the assumption
sets are. A Jaccard index value is in the range [0:1], a value
that if a user saved a shot, that for that user the shot is
of 0 means that the sets are mutually exclusive, a value of
relevant. For every topic we determine the average number

of shots saved by each system. Transforming the number of
shots saved for each topic by each user into a Z-Score, we are
able to express for every topic how close or far the number
of shots saved by a user was in comparison to the mean.
We aggregated this data together at the site level. This
allows us to express for a given site, how its users performed
on average with regards to the average performance overall.
Figure 5 displays these graphs, one for every site. On the
X-axis of this graph are listed the standard deviations, +2?
indicates that users were saving twice as many more shots
than the average, while ?2? is the opposite. Each bar on
the graph represents the average amount of saves for a given
system.
The data displayed here con?rms our earlier observations
about variance in our user population. We can see for the
CWI site demonstrated in the ?rst graph, that the data
follows a normal distribution. The majority of the users
for CWI are saving about the average number of shots per
topic for each system (i.e. the bulk of the mass is located
+/ ? 1?). However, this contrasts to both the DCU and GU
sites where both exhibit a skewed distribution. In the case
of DCU, the distribution is skewed to the left, indicating
that on average compared to the other sites, users at DCU
were saving more than the average number of shots for any
given topic and system. Conversely the GU data is skewed
to the right, showing that on average the GU users saved
less than the average number of shots.
When the previous evidence is taken together, it presents
an unexpected picture. We have found the variability of
users performance to be far greater than expected. User’s
saved few shots in common despite using the same infor-
mation systems for the same task. Likewise when we con-
structed entire ‘retrieval’ runs for evaluation, by taking the
best performing result for a topic and the median result, that
the di?erence between these runs was approximately 300%,
yet in each case runs of the same class were not statisti-
cally di?erent. We also gathered qualitative data to obtain
further insights into the search session.
5.1
Qualitative Data
With a a large number of users and consequently a large
number of the same topics completed by users, there is
an obvious challenge in both constructing retrieval runs for
evaluation. One approach is to conduct the saved shot anal-
ysis as outlined above, while an alternative is to survey the
users to gain a subjective measure of performance with an
interface for any given topic. As part of the evaluation our
participants were asked to gauge this after the completion
of each topic. This o?ers us the ability to compare actual
performance with perceived performance and to validate its
potential utility in the composition of retrieval runs. The
results of this are presented in Figure 6.
This graph aggregates each users estimate of their per-
ceived performance per site for each of the interfaces (listed
on the X-axis). The range for this data was [+3, -3]. From
this, we can determine that all sites thought that they per-
formed poorly using the DCU-2 system, which does correlate
with the actual performance shown in Table 2. However
Figure 5: Average Number Shots Saved by Site
the users from CWI and DCU expressed no perception of
strongly positive performance for any of the systems, whilst
users from GU thought that they performed well using the
system developed within GU. This is indicative of a potential
site bias present in the qualitative data and would suggest

over three geographic sites. We found exceptionally highly
levels of user variance in this activity. The fundamental
implications of this re?ect upon how we evaluate retrieval
systems and if the conclusions we draw are robust.
We are certainly not the ?rst to highlight this issue be-
fore, indeed many researchers have commented on this, and
NIST itself would like these issues addressed as highlighted
in TRECVID 2003 [4, 11, 15]. The conclusions of this pa-
per may appear alarmist, calling into question observations
reached from previous retrieval experiments in TRECVID
as there was no user normalization. However, groups in re-
cent years participating in TRECVID have avoided these
complications by engaging in “expert” runs. That is the use
of a single user who was involved in the creation of the re-
trieval system, but isolated from the test collection. This
user is able to maximize the performance of the system they
developed. We can then propose that groups who perform
this style of interactive experiment are able to make more
robust observations as they can compare against other “ex-
Figure 6: User estimation of topic performance for
pert” systems.
each system
The question of why our experiments illicted this large
variability in user performance is an important question,
one possibility raised by peer-review was the impact of topic
complexity and its relationship to our observations. We con-
caution should be used in applying this data in selection
ducted an examination of the correlation between our aver-
strategies.
age user variance per topic and the overall TRECVID me-
As mentioned in the Section 3, each of the three inter-
dian average precision score per topic, ?nding a pearson cor-
faces presented results in uniquely di?erent format. In the
relation of 0.49, with variance greater when the median AP
post-system questionnaires, we solicited subjects’ opinions
was high. Given that TRECVID 2008 was a low performing
on these interfaces and the techniques used to browse and
benchmark with regards to AP, this may account for some
present results. Participants were asked to rate their over-
of our observations. However why we observed variance is
all perceived performance with the interface, its ease of use,
not the focus of this work, the fact remains that large user
learnability, and its general appeal using 7-point Semantic
variance did in fact occur which leads us to question how to
di?erential scales, which yielded results in the range of -3 to
interpret the resulting metrics.
+3. To assess the general appeal of the interface, users were
Interactive video retrieval, both from an implementation
administered twenty-three separate questions on 7 point se-
and execution level (building a system and running an ex-
mantic di?erential scale. These were based on the AttrakDi?
periment) is undoubtedly hard. Within TRECVID, we are
questionnaire [7] and allowed the assessment of the user in-
observing the rapid increase in popularity in fully automatic
terfaces in three broad categories, namely: ergonomic qual-
search, which with the user removed allows for robust com-
ity; hedonic quality; and appeal. We applied two-way anal-
parisons of retrieval algorithms. Yet as video retrieval in-
ysis of variance (ANOVA) to each di?erential across all 3
creases in popularity, it becomes ever more paramount for
systems and the 24 topics to test the signi?cance of these
us to develop a better understanding of the human involved
results, which are presented in Table 3. The mean value
in the video retrieval context. The question becomes, what
is displayed along with the standard deviation in brackets,
mechanisms can we employ to conduct non-expert retrieval
values in bold are statistically signi?cant.
experiments and achieve robust results?
From the results in Table 3, it appears that participants
The ideal solution would be what was employed in the
have a mixed reaction to the interfaces presented, with con-
TREC Interactive Track [11], a baseline system which could
trasting views particularly for the DCU-2 and GU systems.
be used to benchmark the users involved.
This though
User’s found that the DCU-2 system was the hardest to use,
would require signi?cant e?ort and is unlikely to happen
whilst the GU system was the easiest. The DCU-2 system
in the near future. Other possibilities include the report-
however was deemed the most ergonomic of the interfaces,
ing of additional evaluation metrics [12], the move away
a potential artifact of the large amount of temporal context
from analyst-oriented deep information seeking tasks, or the
displayed. The GU system scored the highest for hedonic
change to more generalized corpora which are less dependent
quality. Reinforced in the previous section, users found that
on “shots”.
the DCU-2 system was likely to result in poor performance.
However the activities of benchmarking evaluations such
as TRECVID respond to the needs expressed by the com-
6.
DISCUSSION AND CONCLUSION
munity. We need to discuss what important factors should
be being captured so that greater understandings of interac-
It is widely accepted that the performance of interactive
tive video retrieval can be made. A simple beginning point
video retrieval is impacted by many factors, the interface,
would be the timestamping or inclusion in submitted results
the retrieval experts utilized, the selection of keyframe ex-
only of shots saved by a user in a search session. This would
traction strategies and the user’s employed to undertake the
allow us to cross-compare even at a simple level how users
experiment. In this paper, we presented a large-scale re-
varied across systems, which currently is not possible with
trieval experiment, making use of non-expert users, spread

Table 3: Interface Comparison1) (Total: 36 people)
Mean Score (SD)
p.
Interface assessment:
Shot (DCU-1)
Broadcast (DCU-2)
Zooming (GU)
Easy to use
0.87(1.36)
-0.74(1.54)
1.06(0.49)
F(2,33)=14.51, p<.05
Easy to learn
0.22(1.42)
-0.29(1.26)
0.79(0.79)
F(2,33)=9.20, p<.05
Ergonomic quality2)
-0.72(0.66)
0.07(0.99)
-0.83(0.88)
F(2,33)=25.8, p<.05
Hedonic quality2)
-0.58(-0.57)
-0.64(0.68)
-0.03(-0.07)
F(2,33)=6.84, p<.05
Appeal2)
-0.74(0.72)
-0.25(1.01)
-0.35(1.10)
F(2,33)=3.01, p=.09
Self assessment:
Overall performance using the interface
0.37(1.28)
-1.50(1.17)
1.42(0.97)
F(2,33)=36.43, p<.05
1) Min:-3, Max:3
2) AttrakDi? questionnaire [7]
many existing search results.
[9] M. Huijbregts, R. Ordelman, and F. de Jong.
Annotation of heterogeneous multimedia content using
Acknowledgments
automatic speech recognition. In Proceedings of the
second international conference on Semantics And
This paper was supported by the European Commission un-
digital Media Technologies (SAMT), Lecture Notes in
der contract FP6-027026 (K-Space) and by Science Founda-
Computer Science, Berlin, December 2007. Springer
tion Ireland under grant 07/CE/I1147 (CLARITY: Centre
Verlag.
for Sensor Web Technologies).
[10] Y.-G. Jiang, A. Yanagawa, S.-F. Chang, and C.-W.
7.
REFERENCES
Ngo. CU-VIREO374: Fusing Columbia374 and
[1] J. Adcock, J. Pickens, M. Cooper, F. Chen, and
VIREO374 for Large Scale Semantic Concept
P. Qvarfordt. Fxpal interactive search experiments for
Detection. Technical report, Columbia University,
trecvid 2007. In TRECVid 2007 - Text REtrieval
August 2008.
Conference TRECVid Workshop, 2007.
[11] E. Lagergren and P. Over. Comparing interactive
[2] D. Byrne, P. Wilkins, G. Jones, A. F. Smeaton, and
information retrieval systems across sites: the trec-6
N. O’Connor. Measuring the impact of temporal
interactive track matrix experiment. In SIGIR ’98:
context on video retrieval. In CIVR 2008 - ACM
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in
International Conference on Image and Video
information retrieval, pages 164–172, New York, NY,
Retrieval, 2008.
USA, 1998. ACM.
[3] M. G. Christel. Establishing the utility of non-text
search for news video retrieval with real world users.
[12] G. Marchionini, B. M. Wildemuth, and G. Geisler.
The open video digital library: A m
In MULTIMEDIA ’07: Proceedings of the 15th
¨
obius strip of
research and practice. Journal of the American Society
international conference on Multimedia, pages
707–716, New York, NY, USA, 2007. ACM.
for Information Science and Technology,
57(12):1629–1643, 2006.
[4] M. G. Christel. Supporting video library exploratory
[13] S.-Y. Neo, H. Luan, Y. Zheng, H.-K. Goh, and T.-S.
search: when storyboards are not enough. In CIVR
Chua. Visiongo: bridging users and multimedia video
’08: Proceedings of the 2008 international conference
retrieval. In CIVR ’08: Proceedings of the 2008
on Content-based image and video retrieval, pages
447–456, New York, NY, USA, 2008. ACM.
international conference on Content-based image and
video retrieval, pages 559–560, New York, NY, USA,
[5] M. G. Christel and R. M. Conescu. Mining novice user
2008. ACM.
activity with trecvid interactive retrieval tasks. In
[14] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.
CIVR, pages 21–30, 2006.
Research directions in terrier. Novatica/UPGRADE
[6] O. de Rooij, C. G. Snoek, and M. Worring. Balancing
Special Issue on Web Information Access, Ricardo
thread based navigation for targeted video search. In
Baeza-Yates et al. (Eds), Invited Paper, 2007.
CIVR ’08: Proceedings of the 2008 international
[15] A. F. Smeaton, W. Kraaij, , and P. Over. Trecvid 2003
conference on Content-based image and video retrieval,
pages 485–494, New York, NY, USA, 2008. ACM.
- an overview. In TRECVID 2003 - Text REtrieval
Conference TRECVID Workshop, MD, USA, 2003.
[7] M. Hassenzahl, A. Platz, M. Burmester, and
National Institute of Standards and Technology.
K. Lehner. Hedonic and ergonomic quality aspects
[16] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation
determine a software’s appeal. In CHI ’00:
Proceedings of the SIGCHI conference on Human
campaigns and trecvid. In MIR 2006 - 8th ACM
factors in computing systems, pages 201–208, New
SIGMM International Workshop on Multimedia
York, NY, USA, 2000. ACM.
Information Retrieval, 2006.
[17] P. Wilkins and et al. KSpace at TRECVid 2008. In
[8] A. G. Hauptmann and M. G. Christel. Successful
approaches in the trec video retrieval evaluations. In
TRECVid 2008 – Text REtrieval Conference,
MULTIMEDIA ’04: Proceedings of the 12th annual
TRECVID Workshop, Gaithersburg, Md., 17-18
ACM international conference on Multimedia, pages
November 2006, 2008.
668–675, New York, NY, USA, 2004. ACM.

Download
User Variance and its Impact on Video Retrieval Benchmarking

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share User Variance and its Impact on Video Retrieval Benchmarking to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share User Variance and its Impact on Video Retrieval Benchmarking as:

From:

To:

Share User Variance and its Impact on Video Retrieval Benchmarking.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share User Variance and its Impact on Video Retrieval Benchmarking as:

Copy html code above and paste to your web page.

loading