Computer Pidgin Language:
A new language to talk to your computer?
Stephen Hinde, Guillaume Belrose
Internet Systems and Storage Laboratory
HP Laboratories Bristol
HPL-2001-182
July 30th , 2001*
E-mail: {Stephen_Hinde, Guillaume_Belrose}@hpl.hp.com
CPL, voice
This paper explores a new concept called Computer Pidgin
speech, telecoms
Language (CPL). This is a radical new approach to
dealing with the problem of humans talking to computers.
The new approach is to teach people a new language
that is efficient for dialogues with computers - a sort of
artificial spoken language. We see this as being analogous
to how people learn scribble on a PDA. We explore in this
paper the motivation for CPL from the appliance,
e-service, and infrastructure perspective. We explore some
early results from a proof of concept demo that we have
built to test these ideas. We also explore some of the wider
implications of CPL and longer-term research directions.
* Internal Accession Date Only
Approved for External Publication
© Copyright Hewlett-Packard Company 2001
Computer Pidgin Language:
A new language to talk to your computer?
Stephen Hinde and Guillaume Belrose
13th July 2001
HP Laboratories
Filton Road, Stoke Gifford, Bristol, BS34 8QZ, UK.
{Stephen_Hinde, Guillaume_Belrose}@hpl.hp.com
spoken language that is optimized to maximize the
efficiency of Automatic Speech Recognition. We call a
Abstract
language of this type a Computer Pidgin language or
This paper explores a new concept called Computer
CPL.
Pidgin Language (CPL). This is a radical new approach
to dealing with the problem of humans talking to
computers. The new approach is to teach people a new
One of the inspirations for thinking about CPL as an
language that is efficient for dialogues with computers
approach to spoken language recognition is observing
– a sort of artificial spoken language. We see this as
the evolution of handwriting recognition. There was
being analogous to how people learn scribble on a
little progress in the field of handwriting recognition
PDA. We explore in this paper the motivation for CPL
over a period of 20-30 years, until the move to a
from
the
appliance,
e-service,
and infrastructure
technique called “Scribble Matching”[17]. “Scribble
perspective. We explore some early results from a
Matching” was first introduced on the Apple Newton in
proof of concept demo that we have built to test these
1993 and was later reintroduced in a much better form
ideas. We also explore some of the wider implications
on the
Palm.
The
innovation came with the
of CPL and longer-term research directions.
realization that handwriting could be simplified to a
series of standard strokes that could be taught to a
human, and which a computer could then recognize
1. An
Introduction
to
Computer
Pidgin
easily. This turned handwriting recognition from a
Languages
long-term research area to a commercial technology.
Our question is if a similar step is possible in the field
CPL or Computer Pidgin Language is a radical
of speech recognition.
departure
from the
normal approach to Speech
Recognition Systems. CPL is inspired by a frustration
at a perceived lack of progress in Spoken Language
The authors of this paper have started some initial work
Research over the last 20-30 years. The authors believe
in this direction to assess whether CPL languages do
that systems that only understand people 85% of the
exist by performing some simple proof of concept
time are hardly usable, so speech recognition is very
experiments.
much a last resort technology or a curiosity. This led us
to reflect on what could we do which is radically
different to improve this?
The simplest class of CPL would consist of a small
vocabulary language for use on a mobile phone, a
child’s toy, an appliance or PDA. Essentially command
The radical departure we are proposing is that instead
and control languages. Its is this class of CPL language
of training a computer system to recognize human
we have considered in our initial proof of concept
speech, we could train the human to speak a new
experiment, and which we will discuss later in this
the human is in danger of being the “weakest-link” in
paper.
this world.
However we are aware that potentially there is a vast
In the PC world, the “Windows” human computer
field of long-term research into the area of CPL, the
interface is now common currency amongst the
more long-term questions raised by CPL concern
computer literate community – but the interface is far
thinking about deriving Artificial Languages with
from intuitive and excludes a large section of the
similar complexity to human languages. Linguists have
population.
been deriving Artificial Languages for many years, but
as far as we are aware no one has worked on a language
for talking to Computers. The motivation for doing this
In the world of small mobile appliances there is no
would be to create the spoken language of Cyber Space
effective interface for accessing the Internet, WAP,
that human beings and machines could use to converse:
DTMF phone and small PDA interfaces relatively new
the “Latin” of the cyber scholar.
interfaces and have their limitations. For many years
Spoken Language System (SLS) researchers have
advocated speech as a rich and natural method of
The early slave traders derived simple “Pidgin”
interaction with small devices. However 30-40 years of
languages for talking to slaves, as they believed the
Spoken Language Systems research has still has only
slaves to lack the intelligence to learn Western
lead to inadequate and unsatisfactory results.
languages. This inspired our name Computer Pidgin
Language. The paradox is that these languages have
evolved
to
form
highly
grammatically
complex
The new driver of mobility and appliance computing is
languages – the Creole languages. So maybe in the
creating a strong business pull for an efficient human
future
there
will
be
CCL
–
Computer
Creole
computer interfaces – however there is this strong
Languages.
tension between the humans ability to communicate
and the computers ability to understand – figure 1.This
represents where we believe innovation is required
Human languages are constantly evolving and have
breaking this tension.
been an integral part of evolution. With the moves
towards Genetic Algorithms and Programming it is
interesting to speculate whether CPL can also undergo
Tension
evolution and optimization as it goes. Our initial work
U s er
S y ste m
in this area has used GA techniques to find CPL
S y ste m c a p a b ilitie s:
vocabularies. This GA approach to ASR has proved
H u m a n /a p p lia n c e :
- S p ee c h
-In ad eq u ate A S R /N L
interesting.
- T ex t v ia p h o n e
- F o rm s
- IV R v ia p h o n e
- In a d eq u a te fr e e f o r m
-S m all k e y b o a rd s
T ex t/ N L
- S m a ll d isp la y s
There are however many interesting anthropological,
social, linguistic, cultural and psychological questions
that would require answering if CPL were to progress
towards wide spread acceptance. Many of these
Figure 1: Computer dialog system inefficiency
questions would be better answered by the academic
measured against efficient appliance based
community rather than by our small research group so
human interfaces.
our intention is to stimulate other communities to think
about the CPL concept.
The current largest application of Speech technology is
to the Interactive Voice Response (IVR) market. The
2. A review of Appliance, Infrastructure, E-
high cost of human operators in the call center has
Service integration
motivated call center operators to deploy IVR systems.
However the interfaces remain very simple, the set up
There is much speculation by Mobile Service Providers
costs are high, and usability is less than optimum.
of the rich new possibilities brought about by the
intersection of Appliances, E-services and Computer
Systems linked by “the always on infrastructure”. In
With the convergence
of Internet and Telecom
this section we are going to argue that the interface to
technologies there has been a webifying of the IVR
story and IVR systems have been renamed VoiceWeb
2
systems[11] or WebIVR.
These new systems offer
Another artificial language, was invented by Dr.
improved interfaces to the Web back end technologies
Ludwig L. Zamenhof of Poland, and was first presented
and a standard Markup language for authoring IVR
to the public in 1887. It has enjoyed some recognition
applications. However the basic speech technologies
as an international language, being used, for example,
remain the same so that efficiency of the human
at
international
meetings
and
conferences.
The
interface remains poor.
vocabulary of Esperanto is formed by adding various
affixes to individual roots and is derived chiefly from
Science fiction has led popular imagination to expect us
Latin,
Greek,
the
Romance
languages,
and
the
to talk to computers like in Star Trek or other science
Germanic languages. The grammar is based on that of
fiction – figure 2. There seems no evidence in current
European languages but is greatly simplified and
research that we are even close to making this happen.
regular. Esperanto has a phonetic spelling. It uses the
symbols of the Roman alphabet, each one standing for
only one sound. A simplified revision of Esperanto is
Captain Kirk: “Computer….”
Ido, short for Esperandido. The French philosopher
Louis Couturat introduced Ido in 1907, but it failed to
Computer Voice: “Yes Captain”
replace Esperanto.
Captain: “What can you tell me about the Phaos
system?”
As far as we are aware there hasn’t been an artificial
Computer Voice; “B-type planet, can support human
spoken language produced for its ability to talk to
life, 3 moons”..
computers.
Figure 2: Star Trek computer interface.
4. E-Inclusion and Computer Dialogues
The MIT Galaxy system represents the State-of-the-Art
E-Inclusion is a current term that is being used to talk
in terms of advanced research systems. These systems
about how to give people of varying cultural diversity
move
speech
systems
into
the
realm
of
free
and social backgrounds access to E-Services. We
conversation
or “mixed initiative”.
They feature
observe that currently there are some 60,000 plus active
languages in the world. In fact linguists have no idea as
domains of discourse such as talking about weather or
to what is the exact number of active languages. The
travel[1][2].
current approach to E-Inclusion is “Localization” or to
force people to learn English or some other widely
spoken language. One approach to Spoken Language E-
The most advanced SLS in the best research groups in
Inclusion is to advocate an international language,
the world such as the DARPA Communicator system
which would be universally available around the world
and the MIT Galaxy system should give us a glimpse of
to talk to devices. If this spoken language was more
what will be possible commercial systems in say 10
efficient for talking to computers than the existing
years. However these systems still suffer inadequacies
language people might be motivated to learn it.
and problems, in terms of high set up cost, errors, and
limited domains of discourse[4].
5. Human languages and protocols
3. Artificial languages
Humans have an innate ability to learn complex
languages. This is one of the prime differentiators of
All language is man-made, but artificial languages are
the human species over other species – their ability to
made systematically for some particular purpose. They
use
complex
language
structures
this
ability is
take many forms, from mere adaptations of an existing
markedly strong in children and at later ages varies
writing system (numerals), through completely new
according to environment and individual capability. It
notations (sign language), to fully expressive systems
is interesting to speculate how languages evolved. We
of speech devised for fun (Tolkien) [8], or secrecy
can obviously see the origins of language in more
(Poto and Cabenga) or learnability (Esperanto)[9].
primitive species but also we could think of languages
There have been artificial languages produced of no
as being a particular form of protocol, and the all
value at all such as Dilingo [10] and artificial language
species have a very strong ability to learn and work
toolkits[5].
with protocols – systems of operating outside their
immediate bodies. Our hypothesis is that the human
being is a highly evolved protocol engine, where as the
computer is a new boy on the block in terms of
3
evolution of protocols. Interestingly with the move
towards
genetic
algorithms
(GA)
and
genetic
programming (GP) we see computers starting to use
6.2. The optimization problem
evolutionary methods to work with protocols. We have
We use a genetic algorithm ([15][16]) exploring the
started to use GA’s in our design of CPL.
space of phonemes to find a vocabulary with the lowest
confusability. The GA manipulates a population of N
individuals. Each individual contains a DNA string
6. Creation of a CPL vocabulary
coding a set of k words (i.e. a vocabulary V). The DNA
string is a sequence of k*p*2 phonemes, for example
This part describes a method that was put in place to
“f’aa’dh’er k’aa’s’ay” (the symbol ‘ is used to separate
create small size CPL vocabularies. Such a vocabulary
two phonemes).
can be used to control a device with simple commands.
It consists of a set of words that is designed to optimize
the efficiency of automatic speech recognizer.
We randomly create an initial seed of N individuals.
Each individual is evaluated by a fitness function. After
evaluation the N fittest individuals are selected to form
6.1. Data representation
the next generation of the population. We use a ranking
selection algorithm whereby the probability of selecting
A CPL word is represented as a sequence of p
an individual is related to its rank within the population.
consonant vowel units represented with the ARPAbet
The crossover exchanges for two individuals fragments
notation[18]. We use a subset of the British English
of DNA cut around a cross over point selected
phone set that does not contain the following phones:
randomly. The mutation operator replaces a consonant
“ia” as in peer, “ea” as in pair, “oh” as in pot and “ua”
by a consonant, and a vowel by a vowel. The
as in poor. These phones are not used in American
frequencies of cross over and mutation are controlled
English.
by probabilities.
ARPAbet
Example
ARPAbet Example
6.3. The fitness function
B
But
Iy
Bean
The fitness function measures the confusion of the
P
Put
Ih
Pit
vocabulary created by an individual. The goal of the
D
Den
Ae
Pat
GA is to minimize this function.
T
Ten
Aa
Barn
G
Game
Ah
Putt
We dispose of a phoneme confusion matrix
A
K
Can
Ao
Born
generated from recognizing a training set of British
English utterances with the ABBOT speech recognizer.
F
Full
Ay
Buy
This
recognizer
is
a
hybrid
RNN/HMM
large
V
Very
Ax
About
vocabulary continuous speech recognizer that was
developed by Cambridge University and University of
S
Some
Ey
Bay
Sheffield.
Z
Zeal
Eh
Pet
This
matrix
provides the
conditional
probability
Dh
Then
Er
Burn
aji = pr( y = pi | x = pj) of recognizing a phoneme
Sh
Ship
Ow
No
as pj when it is actually pi .
L
Like
Aw
Now
R
Run
Oy
Boy
We created a confusion function conf ( pi, pj) with
Y
Yes
Uh
Good
the following properties:
W
Went
Uw
Boon
conf ( pi, pi) = 1
Hh
Hat
Ng
Long
conf ( pi, pj) = aji
M
Man
Ch
Chain
if (aji = 0)conf ( pi, pj) = ε
N
Not
Jh
Jane
Figure 3 Phonetic alphabet.
4
For some entries in the matrix, the confusion between
two phonemes can be null. However, in practice, there
If an individual does not respect one of these
still is a probability of these phonemes to be confused.
constraints, it receives a penalty that increases its fitness
We
set
this
probability
to
a
small
value
ε
and reduces its chance of surviving the selection
( ε = 0.0001 ).
process.
We propose two methods to evaluate the confusion
6.5. Results of the evolution
between two words from the vocabulary. Given two
words
{
Wi p 1
i , pi2,... pi2 p} and
{
Wj pj1, pj2,...pj2 p} ,
their confusion can be the sum or the product of the
We created different sets of 26 words with various
confusion of the phonemes composing these words.
parameters and confusion matrices. In all cases, the
algorithm quickly converges towards a stable solution.
*
2 p
( A)conf ( i
W , j
W ) = ∏conf ( pit, pjt)
t =1
The graph below shows the evolution of the best
*
2 p
individual fitness for a population of 5000 individuals
(B)conf ( i
W , j
W ) = ∑conf ( it
p , jt
p )
that evolved during 300 iterations. The GA converges
t =1
to a solution where the fitness of the best individual is
0.011036.
We proposed three methods to evaluate the overall
confusion of a set of words. For a given vocabulary V,
E v o l u t i o n o v e r t i m e o f t h e b e s t i n d i v i d u a l f i t n e s s w i t h m e t h o d s ( A ) a n d
its confusion can be the average confusion of the words
( D ) , 5 0 0 0 i n d i v i d u a l s , 3 0 0 i t e r a t i o n s
composing this vocabulary, or the total confusion of all
0. 07
the words from the vocabulary, or the worst (i.e.
0. 06
highest) confusion.
0. 05
0. 04
k
0. 03
C
( )conf V
( ) = ∑∑conf Wi
(
,Wj)
0. 02
i=1 j≠i
0. 01
k
1
0
(D)conf V
( ) =
∑∑conf Wi
(
,Wj)
1
21
41
61
81
101
121
141
161
181
201
221
241
261
281
k(k − )
1
I t er at i on
i=1 j≠i
B es t i ndi v i dual f i t nes s
(E)conf V
( ) =
conf
max(
(wi, wj)), i ≠ j
Figure 4: Evolution overtime of the fitness of the
best individual in the population.
6.4. Constraining the evolution
While generating words, we noticed that the GA
We devised a batch mode program that runs a certain
produced words that were not easy to pronounce. We
number of times the GA with the same parameters. We
added arbitrary constraints to the structure of the words
use it to determine whether or not the GA converges
in order to tackle this problem.
towards similar solutions. The batch mode ran 90 times
with a population of 5000 individuals evolving during
• Only one diphthong is allowed per word.
500 iterations with the methods (A) and (C). The graph
Diphthongs are: ‘ey’, ‘ay’, ‘oy’, ‘ow’, ‘aw’.
indicates that the results of the GA are consistent
•
overtime.
Short vowels are not allowed at the end of the
word.
Short
vowels
are
‘aa’,’ao’,’ih’,’eh’,’ae’,’ah’,’uh’ and ‘aw’.
• The phones zh and th are not used
• The phone ng is not a valid starting consonant.
• The phones r, y and w are not allowed either
side of a diphthong.
• The phone hh is not allowed as a second
consonant.
5
Among the British English speakers, one was a female
V a r i a t i on of t he be s t i ndi v i dua l f i t ne s s wi t h me t hods ( A ) a nd ( D )
5 0 0 0 i ndi v i dua l s , 5 0 0 i t e r a t i ons , 9 0 t r i e s
speaker. The waveforms were passed to the ABBOT
speech recognizer for analysis.
0 . 0 16
0 . 0 14
Speakers
Spelling Alphabet CPL set
0 . 0 12
Male Speaker 1
2
2
0 . 0 1
Female Speaker
1
3
0 . 0 0 8
MS2
2
0
0 . 0 0 6
MS3
1
1
0 . 0 0 4
MS4
4
1
0 . 0 0 2
MS5
3
0
0
1
2 1
4 1
6 1
8 1
Total errors
13
7
T r ie s
Best i ndi vidual f i tness
Figure 7: Recognition results
These early results show that the CPL set performs
Figure 5: Comparison of the fitness of the best
better than the English equivalent, with a number of
individual for 90 tries.
errors reduced by half. Nevertheless, if these early
results sound promising, further experiments are
6.6. Experiments
required to prove the efficiency of CPL over English.
For the proof of concept, we selected a CPL set created
with the methods (A) and (C) and compared its
efficiency against the International Spelling Alphabet.
7. CPL Applications
7.1. Overview.
Spelling
Spelling
CPL can be used to control and interact with simple
CPL
CPL
alphabet
alphabet
speech enabled devices. With a phone for instance, the
Zeejoy
Failu
Alpha
November
user could say funny sounding sentences to send SMS
Highma
Seree
Bravo
Oscar
messages to their friends (like “juicy failu”). We
thought of speech-enabled toys, like virtual dogs that
Wooper
Farngoy
Charlie
Papa
can could walk, bark, execute tricks, move and
Gowka
Poingy
Delta
Quebec
communicate with their masters using CPL.
Zappay
Wass-eye
Echo
Romeo
Neefa
Kozer
Foxtrot
Sierra
We strongly believe that there is a place for CPL in the
children and teenagers market, where the users are
Yeffoy
Persha
Golf
Tango
actually motivated to learn new ways of communicating
Pooboy
Loicher
Hotel
Uniform
to enjoy fun experiences.
Ggeamay
Showshu
India
Victor
Jower
Hoochay
Juliet
Whiskey
Juicy
Shoiby
Kilo
X-ray
Mgu
Saanga
Lima
Yankee
Shuki
Norshoy
Mike
Zulu
Figure 6: CPL set and International Spelling
alphabet.
We recorded six speakers reading all the words from
the CPL vocabulary and the international spelling
alphabet. Among the six speakers, five were British
English speakers, and one was a native French speaker.
6
7.2. The CPL phone
By setting up user studies, we could collect answers to
questions such as “How easy is it easy to use and learn
CPL?”, or “Would you like to use it in your everyday
life?” and tackle the problem of usability of such
artificial languages.
Nevertheless, if a demo such as the CPL phone proves
to give good results and to be a fun and novel
experience, we are pretty confident that the user would
be motivated to learn and use the language.
8.2. New research areas
Figure 8: Graphical User Interface of the CPL
During the development of the proof of concept, we
Phone
came across interesting research problems.
As a proof of concept, we devised a software simulation
of a simple hands-free phone using the Microsoft
synthesizer and recognizer. This phone allows the user
8.2.1.
Cool words
to perform actions such as answering or ending a call,
We
noticed
while
running
the
word
generation
dialing a number or sending pre defined SMS messages
experiment that some words produced by the GA
to recipients stored in a directory. In total, a vocabulary
sounded quite familiar, “cool” or funny. Words such as
of 24 words was used to control all the functionalities.
“juicy”, “wooper” or “wass eye” fall in this category,
and we actually found out that they are quite easy to
pronounce and remember. On the other hand, the CPL
The CPL phone demo provides an English and a CPL
vocabulary contains words like “farngoy” or “poingy”
version (this set of commands is a cut down version of
which are less attractive to the ear and therefore more
the CPL set).
The authors of the demo made an
difficult to use.
arbitrary mapping between the CPL words and the
corresponding commands. The user is free to switch
between the two languages at any time and can access
The idea of “cool sounding” words is quite difficult to
help in the course of the dialogue.
grasp and formalize in some criteria that can be used by
a computer program. However, what we propose is for
the user to set a list of favorite words he likes to listen to
He can also take part in an interactive tutorial to learn
(like “hey man”, “what’s up”, “see ya”). The fitness
the language. This tutorial involves two animated
function of the GA can be modified to create words that
cartoon characters called Mrs. and Mr. Mike, who teach
are similar sounding to those from the favorite list (for
the user the CPL words by saying (via Text to Speech)
example using phone correlation techniques [19]).
and displaying them on a screen.
8.2.2.
Spoken scribble English
8. Further development
Our approach to CPL was to create a completely new
As it stands today, the CPL concept is still at an early
language that people would have to learn and practice.
stage. The word generation methods put in place
promise
interesting
results.
However,
further
We are aware that many people would not have the
experiments are required to demonstrate the whole
motivation or the time require to learn yet another
potential of CPL.
language. We could take a similar approach to scribble
as it used on PDA. Scribble does not force the user to
learn a new written language. It just encourages the user
to modify the way to write characters so they can be
8.1. Further experiments
recognized more easily. We could use a similar
First of all, we need to put in place more tests with users
approach for CPL by creating a simple set of rules,
in order to be more confident on how CPL is better than
which, when applied, produce slight variations to the
English in terms of recognition accuracy.
English language to make it easier to understand.
7
[15] Genetic Programming, Bradford Books, Dec. 1992, John
P. Koza
9. Acknowledgements
[16] Genetic Algorithms, Springer-Verlag, 1992, Zbigniew
We would like to thank the few people who encouraged
Michalewicz
us in pursuing this outrageous line of enquiry including
[17] Scribble Matching, Hull, Richard; Reynolds, Dave;
John Manley and Steve Wright. Roger Tucker played a
Gupta, Dipankar , HP labs external technical report HPL-
major role, giving us advice and assistance in the
94-61 July 14, 1994
experimental results. Michael Mc Ternan worked with
[18] Arpabet http://www.billnet.org/phon/arpabet.php
us producing the CPL phone demo. We would also like
to thank the members of the Voice Web team Marianne
[19] “A high level approach to confidence estimation in
Hickey, Paul Brittan, and Lawrence Wilcock who
speech recognition”, Stephen Cox, Srinandan
created a supportive community where ideas could be
Dasmahapatra, University of East Anglia
born.
[20] Microsoft Speech API
http://www.microsoft.com/speech/.
10. References
[1] Polifroni J. and Seneff S., "GALAXY-II as an
Architecture for Spoken Dialogue Evaluation" Proc. 2nd
Int. Conf. on Language Resources and Evaluation
(LREC), Athens, Greece, May 31-June 2, 2000.
[2] Zue V. et al., "JUPITER: A Telephone-Based
Conversational Interface for Weather Information," IEEE
Trans. on Speech and Audio Proc., V.8, N.1, Jan 2000.
[3] Taylor P.A., Black A. and Caley R., "The Architecture of
the Festival Speech Synthesis System", Third ESCA
Workshop in Speech Synthesis, 1998, pp. 147-151.
[4] Lessons from the Development of a Conversational
Interface. M. Hickey, P. Brittan, TR, HP Labs Bristol.
[5] The Artificial Language Construction Kit Web Chapter
http://zompist.com/kit.html
[6] Science Fiction and Society: Artificial Languages
By Christopher B. Jones
[7] http://www.i5ive.com/linkcategory.cfm/6513/10597
[8]
Ardalambion: Of the Tongues of Arda, the invented
world of J.R.R. Tolkien
http://www.uib.no/people/hnohf/
[9] Esperanto and Science Fiction: Jules Verne, article
October 29,1999
http://www.i5ive.com/article.cfm/1146/27066
[10] Dilingo 2000 – an artificial language of no particular use
to anyone. http://www.dilingo.com
[11] VoiceWeb, Nuance overview.
http://www.nuance.com/partners/voiceweb.html
[12] Where do Languages come from? Merritt Ruhlen.
http://www.exploratorium.edu/exploring/language/ind
ex.html
[13] Constructed grammar FAQ
http://personalweb.sierra.net/~spynx/FAQ/index.html
[14] Constructed Human Languages
http://www.quetzal.com/conlang.html
8
Add New Comment