This is not the document you are looking for? Use the search form below to find more!

Report home > Education

Spoken Arabic Dialect Identification Using Phonotactic Modeling

0.00 (0 votes)
Document Description
The Arabic language is a collection of multiple variants, among which Modern Standard Arabic (MSA) has a special status as the formal written standard language of the media, culture and education across the Arab world. The other variants are informal spoken dialects that are the media of communication for daily life.
File Details
Submitter
  • Name: ufuk
Embed Code:

Add New Comment




Related Documents

Dialect identification using Gaussian Mixture Models • Pedro A ...

by: terttu, 4 pages

Recent results in the area of language identification have shown a significant improvement over previous systems. In this paper, we evaluate the related problem of dialect identification using one of ...

Investigating the Design of Arabic Web Interfaces Using Hofstede's ...

by: hendrik, 5 pages

This study examines the design characteristics of Web interfaces from Arab countries using Hofstede's cultural dimensions. Organizational and graphical elements on a sample of 15 home pages of ...

Estimation of Genetic Effects and Genotype-Phenotype Maps

by: danae, 11 pages

Determining the genetic architecture of complex traits is a necessary step to understand phenotypic changes in natural, experimental and domestic populations. However, this is still a major challenge ...

Hong Kong

by: rika, 6 pages

Hong Kong's official languages are English and Chinese. The most commonly spoken Chinese dialect is Cantonese, but Mandarin-the official language of China, known in Hong Kong as Putonghua - is ...

A Domain-Specific Risk-Taking (DOSPERT) scale for adult populations

by: shinta, 15 pages

This paper proposes a revised version of the original Domain-Specific Risk-Taking (DOSPERT) scale developed by Weber, Blais, and Betz (2002) that is shorter and applicable to a broader range of ...

License to Sell: The Effect of Business Registration Reform on Entrepreneurial Activity in Mexico

by: shinta, 45 pages

The number of procedures for registering a business varies from 2 in Canada to 21 in the Dominican Republic. Many have argued, on the basis of cross country evidence, that complex ...

On-Line Student Modeling for Coached Problem Solving Using Bayesian Networks

by: samanta, 12 pages

This paper describes the student modeling component of ANDES , an Intelligent Tutoring System for Newtonian physics. ANDES 'student model usesa Bayesiannetwork to do long-term knowledge assessment, ...

Subtitling Slang and Dialect

by: urjasz, 11 pages

The purpose of this study is to examine the feasibility of translating spoken slang or dialect into subtitles. As the use of slang and/or dialect is associated with spoken rather than written ...

Dialect and Standard in Second Language Phonology: The Case of ...

by: imogen, 25 pages

This paper shows that students who learn Standard Arabic before a dialect take 'an etymological trip' in learning the phonology of the dialect in question. The paper also discusses instructional ...

Arabic Parsing Using Grammar Transforms Lamia Tounsiand Josefvan ...

by: wasil, 4 pages

We investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. We study how morphosyntactic as well as function tag information ...

Content Preview
Spoken Arabic Dialect Identification Using Phonotactic Modeling
Fadi Biadsy and Julia Hirschberg
Nizar Habash
Department of Computer Science
Center for Computational Learning Systems
Columbia University, New York, USA
Columbia University, New York, USA
{fadi,julia}@cs.columbia.edu
habash@ccls.columbia.edu
Abstract
mal written standard language of the media, cul-
ture and education, and the informal spoken di-
The Arabic language is a collection of
alects that are the preferred method of communi-
multiple variants, among which Modern
cation in daily life. While there are commercially
Standard Arabic (MSA) has a special sta-
available Automatic Speech Recognition (ASR)
tus as the formal written standard language
systems for recognizing MSA with low error rates
of the media, culture and education across
(typically trained on Broadcast News), these rec-
the Arab world. The other variants are in-
ognizers fail when a native Arabic speaker speaks
formal spoken dialects that are the media
in his/her regional dialect. Even in news broad-
of communication for daily life. Arabic di-
casts, speakers often code switch between MSA
alects differ substantially from MSA and
and dialect, especially in conversational speech,
each other in terms of phonology, mor-
such as that found in interviews and talk shows.
phology, lexical choice and syntax. In this
Being able to identify dialect vs. MSA as well as to
paper, we describe a system that automat-
identify which dialect is spoken during the recog-
ically identifies the Arabic dialect (Gulf,
nition process will enable ASR engines to adapt
Iraqi, Levantine, Egyptian and MSA) of a
their acoustic, pronunciation, morphological, and
speaker given a sample of his/her speech.
language models appropriately and thus improve
The phonotactic approach we use proves
recognition accuracy.
to be effective in identifying these di-
Identifying the regional dialect of a speaker will
alects with considerable overall accuracy
also provide important benefits for speech tech-
— 81.60% using 30s test utterances.
nology beyond improving speech recognition. It
1
Introduction
will allow us to infer the speaker’s regional origin
For the past three decades, there has been a great
and ethnicity and to adapt features used in speaker
deal of work on the automatic identification (ID)
identification to regional original. It should also
of languages from the speech signal alone. Re-
prove useful in adapting the output of text-to-
cently, accent and dialect identification have be-
speech synthesis to produce regional speech as
gun to receive attention from the speech science
well as MSA – important for spoken dialogue sys-
and technology communities. The task of dialect
tems’ development.
identification is the recognition of a speaker’s re-
In Section 2, we describe related work. In Sec-
gional dialect, within a predetermined language,
tion 3, we discuss some linguistic aspects of Ara-
given a sample of his/her speech. The dialect-
bic dialects which are important to dialect iden-
identification problem has been viewed as more
tification. In Section 4, we describe the Arabic
challenging than that of language ID due to the
dialect corpora employed in our experiments. In
greater similarity between dialects of the same lan-
Section 5, we explain our approach to the identifi-
guage. Our goal in this paper is to analyze the ef-
cation of Arabic dialects. We present our experi-
fectiveness of a phonotactic approach, i.e. making
mental results in Section 6. Finally, we conclude
use primarily of the rules that govern phonemes
in Section 7 and identify directions for future re-
and their sequences in a language — a techniques
search.
which has often been employed by the language
2
Related Work
ID community — for the identification of Arabic
dialects.
A variety of cues by which humans and machines
The Arabic language has multiple variants, in-
distinguish one language from another have been
cluding Modern Standard Arabic (MSA), the for-
explored in previous research on language identi-

fication. Examples of such cues include phone in-
ferences exist between Western and Eastern Ara-
ventory and phonotactics, prosody, lexicon, mor-
bic. The analysis of these differences is done by
phology, and syntax.
Some of the most suc-
comparing percentages of vocalic intervals (%V)
cessful approaches to language ID have made
and the standard deviation of intervocalic inter-
use of phonotactic variation.
For example, the
vals (∆C) across the two groups. These features
Phone Recognition followed by Language Model-
have been shown to capture the complexity of the
ing (PRLM) approach uses phonotactic informa-
syllabic structure of a language/dialect in addition
tion to identify languages from the acoustic sig-
to the existence of vowel reduction. The com-
nal alone (Zissman, 1996). In this approach, a
plexity of syllabic structure of a language/dialect
phone recognizer (not necessarily trained on a re-
and the existence of vowel reduction in a language
lated language) is used to tokenize training data for
are good correlates with the rhythmic structure of
each language to be classified. Phonotactic lan-
the language/dialect, hence the importance of such
guage models generated from this tokenized train-
a cue for language/dialect identification (Ramus,
ing speech are used during testing to compute lan-
2002).
guage ID likelihoods for unknown utterances.
As far as we could determine, there is no
Similar cues have successfully been used for
previous work that analyzes the effectiveness of
the identification of regional dialects. Zisssman
a phonotactic approach, particularly the parallel
et al. (1996) show that the PRLM approach yields
PRLM, for identifying Arabic dialects. In this pa-
good results classifying Cuban and Peruvian di-
per, we build a system based on this approach and
alects of Spanish, using an English phone recog-
evaluate its performance on five Arabic dialects
nizer trained on TIMIT (Garofolo et al., 1993).
(four regional dialects and MSA). In addition, we
The recognition accuracy of this system on these
experiment with six phone recognizers trained on
two dialects is 84%, using up to 3 minutes of test
six languages as well as three MSA phone recog-
utterances. Torres-Carrasquillo et al. (2004) devel-
nizers and analyze their contribution to this classi-
oped an alternate system that identifies these two
fication task. Moreover, we make use of a discrim-
Spanish dialects using Gaussian Mixture Models
inative classifier that takes all the perplexities of
(GMM) with shifted-delta-cepstral features. This
the language models on the phone sequences and
system performs less accurately (accuracy of 70%)
outputs the hypothesized dialect. This classifier
than that of (Zissman et al., 1996). Alorfi (2008)
turns out to be an important component, although
uses an ergodic HMM to model phonetic dif-
it has not been a standard component in previous
ferences between two Arabic dialects (Gulf and
work.
Egyptian Arabic) employing standard MFCC (Mel
Frequency Cepstral Coefficients) and delta fea-
3
Linguistic Aspects of Arabic Dialects
tures. With the best parameter settings, this system
achieves high accuracy of 96.67% on these two
3.1
Arabic and its Dialects
dialects. Ma et al. (2006) use multi-dimensional
MSA is the official language of the Arab world.
pitch flux features and MFCC features to distin-
It is the primary language of the media and cul-
guish three Chinese dialects. In this system the
ture. MSA is syntactically, morphologically and
pitch flux features reduce the error rate by more
phonologically based on Classical Arabic, the lan-
than 30% when added to a GMM based MFCC
guage of the Qur’an (Islam’s Holy Book). Lexi-
system. Given 15s of test-utterances, the system
cally, however, it is much more modern. It is not
achieves an accuracy of 90% on the three dialects.
a native language of any Arabs but is the language
Intonational cues have been shown to be good
of education across the Arab world. MSA is pri-
indicators to human subjects identifying regional
marily written not spoken.
dialects. Peters et al. (2002) show that human sub-
The Arabic dialects, in contrast, are the true na-
jects rely on intonational cues to identify two Ger-
tive language forms. They are generally restricted
man dialects (Hamburg urban dialects vs. North-
in use to informal daily communication.
They
ern Standard German).
Similarly, Barakat et
are not taught in schools or even standardized, al-
al. (1999) show that subjects distinguish between
though there is a rich popular dialect culture of
Western vs. Eastern Arabic dialects significantly
folktales, songs, movies, and TV shows. Dialects
above chance based on intonation alone.
are primarily spoken, not written. However, this
Hamdi et al. (2004) show that rhythmic dif-
is changing as more Arabs gain access to elec-

tronic media such as emails and newsgroups. Ara-
a large gray area in between and it is often filled
bic dialects are loosely related to Classical Ara-
with a mixing of the two forms.
bic. They are the result of the interaction between
In this paper, we focus on classifying the di-
different ancient dialects of Classical Arabic and
alect of audio recordings into one of five varieties:
other languages that existed in, neighbored and/or
MSA, GLF, IRQ, LEV, and EGY. We do not ad-
colonized what is today the Arab world. For ex-
dress other dialects or diglossia.
ample, Algerian Arabic has many influences from
Berber as well as French.
3.2
Phonological Variations among Arabic
Arabic dialects vary on many dimensions –
Dialects
primarily, geography and social class.
Geo-
Although Arabic dialects and MSA vary on many
linguistically, the Arab world can be divided in
different levels — phonology, orthography, mor-
many different ways. The following is only one
phology, lexical choice and syntax — we will
of many that covers the main Arabic dialects:
focus on phonological difference in this paper.1
• Gulf Arabic (GLF) includes the dialects of
MSA’s phonological profile includes 28 conso-
Kuwait, Saudi Arabia, Bahrain, Qatar, United
nants, three short vowels, three long vowels and
Arab Emirates, and Oman.
two diphthongs (/ay/ and /aw/). Arabic dialects
vary phonologically from standard Arabic and
• Iraqi Arabic (IRQ) is the dialect of Iraq. In
each other. Some of the common variations in-
some dialect classifications, Iraqi Arabic is
clude the following (Holes, 2004; Habash, 2006):
considered a sub-dialect of Gulf Arabic.
The MSA consonant (/q/) is realized as a glot-
tal stop /’/ in E

GY and LEV and as /g/ in GLF and
Levantine Arabic (LEV) includes the di-
IRQ. For example, the MSA word /tari:q/ ‘road’
alects of Lebanon, Syria, Jordan, Palestine
¯
appears as /tari:’/ (EGY and LEV) and /tari:g/ (GLF
and Israel.
¯
¯
and IRQ). Other variants also are found in sub di-
• Egyptian Arabic (EGY) covers the dialects
alects such as /k/ in rural Palestinian (LEV) and
of the Nile valley: Egypt and Sudan.
/dj/ in some GLF dialects. These changes do not
apply to modern and religious borrowings from
• Maghrebi Arabic covers the dialects of
MSA. For instance, the word for ‘Qur’an’ is never
Morocco, Algeria, Tunisia and Mauritania.
pronounced as anything but /qur’a:n/.
Libya is sometimes included.
The MSA alveolar affricate (/dj/) is realized as
Yemenite Arabic is often considered its own
/g/ in EGY, as /j/ in LEV and as /y/ in GLF. IRQ
class. Maltese Arabic is not always consid-
preserves the MSA pronunciation. For example,
ered an Arabic dialect. It is the only Arabic
the word for ‘handsome’ is /djami:l/ (MSA, IRQ),
variant that is considered a separate language
/gami:l/ (EGY), /jami:l/ (LEV) and /yami:l/ (GLF).
and is written with Latin script.
The MSA consonant (/k/) is generally realized
as /k/ in Arabic dialects with the exception of GLF,
Socially, it is common to distinguish three sub-
IRQ and the Palestinian rural sub-dialect of LEV,
dialects within each dialect region: city dwellers,
which allow a /ˇc/ pronunciation in certain con-
peasants/farmers and Bedouins. The three degrees
texts. For example, the word for ‘fish’ is /samak/
are often associated with a class hierarchy from
in MSA, EGY and most of LEV but /simaˇc/ in IRQ
rich, settled city-dwellers down to Bedouins. Dif-
and GLF.
ferent social associations exist as is common in
The MSA consonant /θ/ is pronounced as /t/ in
many other languages around the world.
LEV and EGY (or /s/ in more recent borrowings
The relationship between MSA and the dialect
from MSA), e.g., the MSA word /θala:θa/ ‘three’
in a specific region is complex. Arabs do not think
is pronounced /tala:ta/ in EGY and /tla:te/ in LEV.
of these two as separate languages. This particular
IRQ and GLF generally preserve the MSA pronun-
perception leads to a special kind of coexistence
ciation.
between the two forms of language that serve dif-
ferent purposes. This kind of situation is what lin-
1It is important to point out that since Arabic dialects are
guists term diglossia. Although the two variants
not standardized, their orthography may not always be con-
sistent. However, this is not a relevant point to this paper
have clear domains of prevalence: formal written
since we are interested in dialect identification using audio
(MSA) versus informal spoken (dialect), there is
recordings and without using the dialectal transcripts at all.

The MSA consonant /δ/ is pronounced as /d/
dialects: Gulf Arabic, Iraqi Arabic, Egyptian Ara-
in LEV and EGY (or /z/ in more recent borrow-
bic, and Levantine Arabic. These are corpora of
ings from MSA), e.g., the word for ‘this’ is pro-
spontaneous telephone conversations produced by
nounced /ha:δa/ in MSA versus /ha:da/ (LEV) and
native speakers of the dialects, speaking with fam-
/da/ EGY. IRQ and GLF generally preserve the
ily members, friends, and unrelated individuals,
MSA pronunciation.
sometimes about predetermined topics. Although,
The MSA consonants /d/ (emphatic/velarized
the data have been annotated phonetically and/or
¯
d) and /δ/ (emphatic /δ/) are both normalized to
orthographically by LDC, in this paper, we do not
¯
/d/ in EGY and LEV and to /δ/ in GLF and IRQ.
make use of any of annotations.
¯
¯
For example, the MSA sentence /δalla yadrubu/
We use the speech files of 965 speakers (about
¯
¯
‘he continued to hit’ is pronounced /dall yudrub/
41.02 hours of speech) from the Gulf Arabic
¯
¯
(LEV) and /δall yuδrub/ (GLF). In modern bor-
conversational telephone Speech database for our
¯
¯
rowings from MSA, /δ/ is pronounced as /z/ (em-
Gulf Arabic data (Appen Pty Ltd, 2006a).2 From
¯
¯
phatic z) in EGY and LEV. For instance, the word
these speakers we hold out 150 speakers for test-
for ‘police officer’ is /δa:bit/ in MSA but /za:bit/
ing (about 6.06 hours of speech).3 We use the Iraqi
¯
¯
¯
¯
in EGY and LEV.
Arabic Conversational Telephone Speech database
In some dialects, a loss of the emphatic feature
(Appen Pty Ltd, 2006b) for the Iraqi dialect, se-
of some MSA consonants occurs, e.g., the MSA
lecting 475 Iraqi Arabic speakers with a total du-
word /lati:f/ ‘pleasant’ is pronounced as /lati:f/ in
ration of about 25.73 hours of speech.
From
¯
the Lebanese city sub-dialect of LEV. Empha-
these speakers we hold out 150 speakers4 for test-
sis typically spreads to neighboring vowels: if a
ing (about 7.33 hours of speech).
Our Levan-
vowel is preceded or succeeded directly by an em-
tine data consists of 1258 speakers from the Ara-
phatic consonant (/d/, /s/, /t/, /δ/) then the vowel
bic CTS Levantine Fisher Training Data Set 1-3
¯
¯
¯
¯
becomes an emphatic vowel. As a result, the loss
(Maamouri, 2006). This set contains about 78.79
of the emphatic feature does not affect the conso-
hours of speech in total. We hold out 150 speakers
nants only, but also their neighboring vowels.
for testing (about 10 hours of speech) from Set 1.5
Other vocalic differences among MSA and the
For our Egyptian data, we use CallHome Egyp-
dialects include the following: First, short vow-
tian and its Supplement (Canavan et al., 1997)
els change or are completely dropped, e.g., the
and CallFriend Egyptian (Canavan and Zipperlen,
MSA word /yaktubu/ ‘he writes’ is pronounced
1996). We use 398 speakers from these corpora
/yiktib/ (EGY and IRQ) or /yoktob/ (LEV). Sec-
(75.7 hours of speech), holding out 150 speakers
ond, final and unstressed long vowels are short-
for testing.6 (about 28.7 hours of speech.)
ened, e.g., the word /mata:ra:t/ ‘airports’ in MSA
Unfortunately, as far as we can determine, there
¯
becomes /matara:t/ in many dialects. Third, the
is no data with similar recording conditions for
¯
MSA diphthongs /aw/ and /ay/ have mostly be-
MSA. Therefore, we obtain our MSA training data
come /o:/ and /e:/, respectively.
These vocalic
from TDT4 Arabic broadcast news. We use about
changes, particularly vowel drop lead to different
47.6 hours of speech. The acoustic signal was pro-
syllabic structures. MSA syllables are primarily
cessed using forced-alignment with the transcript
light (CV, CV:, CVC) but can also be (CV:C and
to remove non-speech data, such as music. For
CVCC) in utterance-final positions. EGY sylla-
testing we again use 150 speakers, this time iden-
bles are the same as MSA’s although without the
tified automatically from the GALE Year 2 Dis-
utterance-final restriction. LEV, IRQ and GLF al-
tillation evaluation corpus (about 12.06 hours of
low heavier syllables including word initial clus-
speech). Non-speech data (e.g., music) in the test
ters such as CCV:C and CCVCC.
2We excluded very short speech files from the corpora.
3
4
Corpora
The 24 speakers in devtest folder and the last 63 files,
after sorting by file name, in train2c folder (126 speakers).
When training a system intended to classify lan-
The sorting is done to make our experiments reproducible by
other researchers.
guages or dialects, it is of course important to use
4Similar to the Gulf corpus, the 24 speakers in devtest
training and testing corpora recorded under simi-
folder and the last 63 files (after sorting by filename) in
lar acoustic conditions. We are able to obtain cor-
train2c folder (126 speakers)
5We use the last 75 files in Set 1, after sorting by name.
pora from the Linguistic Data Consortium (LDC)
6The test speakers were from evaltest and devtest folders
with similar recording conditions for four Arabic
in CallHome and CallFriend.

corpus was removed manually. It should be noted
we employ a logistic regression classifier as our
that the data includes read speech by anchors and
back-end combiner. We have experimented with
reporters as well as spontaneous speech spoken in
different classifiers such as SVM, and neural net-
interviews in studios and though the phone.
works, but logistic regression classifier was supe-
rior. The system is illustrated in Figure 1.
5
Our Dialect ID Approach
We hypothesize that using multiple phone rec-
ognizers as opposed to only one allows the system
Since, as described in Section 3, Arabic dialects
to capture subtle phonetic differences that might
differ in many respects, such as phonology, lex-
be crucial to distinguish dialects.
Particularly,
icon, and morphology, it is highly likely that
since the phone recognizers are trained on differ-
they differ in terms of phone-sequence distribu-
ent languages, they may be able to model different
tion and phonotactic constraints. Thus, we adopt
vocalic and consonantal systems, hence a different
the phonotactic approach to distinguishing among
phonetic inventory. For example, an MSA phone
Arabic dialects.
recognizer typically does not model the phoneme
/g/; however, an English phone recognizer does.
5.1
PRLM for dialect ID
As described in Section 3, this phoneme is an
As mentioned in Section 2, the PRLM approach to
important cue to distinguishing Egyptian Arabic
language identification (Zissman, 1996) has had
from other Arabic dialects. Moreover, phone rec-
considerable success. Recall that, in the PRLM
ognizers are prone to many errors; relying upon
approach, the phones of the training utterances of
multiple phone streams rather than one may lead
a dialect are first identified using a single phone
to a more robust model overall.
recognizer.7 Then an n-gram language model is
trained on the resulting phone sequences for this
5.2
Phone Recognizers
dialect. This process results in an n-gram lan-
In our experiments, we have used phone recogniz-
guage model for each dialect to model the dialect
ers for English, German, Japanese, Hindi, Man-
distribution of phone sequence occurrences. Dur-
darin, and Spanish, from a toolkit developed by
ing recognition, given a test speech segment, we
Brno University of Technology.8 These phone rec-
run the phone recognizer to obtain the phone se-
ognizers were trained on the OGI multilanguage
quence for this segment and then compute the per-
database (Muthusamy et al., 1992) using a hybrid
plexity of each dialect n-gram model on the se-
approach based on Neural Networks and Viterbi
quence. The dialect with the n-gram model that
decoding without language models (open-loop)
minimizes the perplexity is hypothesized to be the
(Matejka et al., 2005).
dialect from which the segment comes.
Since Arabic dialect identification is our goal,
Parallel PRLM is an extension to the PRLM ap-
we hypothesize that an Arabic phone recognizer
proach, in which multiple (k) parallel phone rec-
would also be useful, particularly since other
ognizers, each trained on a different language, are
phone recognizers do not cover all Arabic con-
used instead of a single phone recognizer (Ziss-
sonants, such as pharyngeals and emphatic alveo-
man, 1996). For training, we run all phone recog-
lars. Therefore, we have built our own MSA phone
nizers in parallel on the set of training utterances
recognizer using the HMM toolkit (HTK) (Young
of each dialect. An n-gram model on the outputs of
et al., 2006). The monophone acoustic models
each phone recognizer is trained for each dialect.
are built using 3-state continuous HMMs without
Thus if we have m dialects, k x m n-gram models
state-skipping, with a mixture of 12 Gaussians per
are trained. During testing, given a test utterance,
state. We extract standard Mel Frequency Cepstral
we run all phone recognizers on this utterance and
Coefficients (MFCC) features from 25 ms frames,
compute the perplexity of each n-gram model on
with a frame shift of 10 ms. Each feature vec-
the corresponding output phone sequence. Finally,
tor is 39D: 13 features (12 cepstral features plus
the perplexities are fed to a combiner to determine
energy), 13 deltas, and 13 double-deltas. The fea-
the hypothesized dialect. In our implementation,
tures are normalized using cepstral mean normal-
ization. We use the Broadcast News TDT4 corpus
7The phone recognizer is typically trained on one of the
languages being identified. Nonetheless, a phone recognize
(Arabic Set 1; 47.61 hours of speech; downsam-
trained on any language might be a good approximation,
pled to 8Khz) to train our acoustic models. The
since languages typically share many phones in their phonetic
inventory.
8www.fit.vutbr.cz/research/groups/speech/sw/phnrec

1(2$/(3*&*()!
72.8*0!$'%5()!
!"#$%&'(&
)*+,&'(&
!*/0'"()1#-+(
-./01%#2&'(&
2+"#.-'3+*((
'34#21%23&'(&
(56&'(&
456/*)'!$'%5()!
!"#$%&'(&
)*+,&'(&
4-.5'%1()1#-+(
!"#$%&'"(
7/"894-:(
-./01%#2&'(&
2+"#.-'3+*((
)*+,*#"+%%'-.!
;5/%%'<'+*((
'34#21%23&'(&
(56&'(&
9.$.5()(!$'%5()!
"#$%&'()*+(,!
!"#$%&'(&
-*./(0&!
)*+,&'(&
6/,/-+%+()1#-+(
-./01%#2&'(&
2+"#.-'3+*((
'34#21%23&'(&
(56&'(&
Figure 1: Parallel Phone Recognition Followed by Language Modeling (PRLM) for Arabic Dialect Identification.
!
pronunciation dictionary is generated as described
formance of our system to Alorfi’s (2008) on the
in (Biadsy et al., 2009). Using these settings we
same two dialects (Gulf and Egyptian Arabic).
build three MSA phone recognizers: (1) an open-
The second is to attempt to classify four collo-
loop phone recognizer which does not distinguish
quial Arabic dialects. In the third experiment, we
emphatic vowels from non-emphatic (ArbO), (2)
include MSA as well in a five-way classification
an open-loop with emphatic vowels (ArbOE), and
task.
(3) a phone recognizer with emphatic vowels and
with a bi-gram phone language model (ArbLME).
6.1
Gulf vs. Egyptian Dialect ID
We add a new pronunciation rule to the set of
To our knowledge, Alorfi’s (2008) work is the
rules described in (Biadsy et al., 2009) to distin-
only work dealing with the automatic identifica-
guish emphatic vowels from non-emphatic ones
tion of Arabic dialects. In this work, an Ergodic
(see Section 3) when generating our pronunciation
HMM is used to model phonetic differences be-
dictionary for training the acoustic models for the
tween Gulf and Egyptian Arabic using MFCC and
the phone recognizers. In total we build 9 (Arabic
delta features. The test and training data used in
and non-Arabic) phone recognizers.
this work was collected from TV soap operas con-
6
Experiments and Results
taining both the Egyptian and Gulf dialects and
from twenty speakers from CallHome Egyptian
In this section, we evaluate the effectiveness of the
database. The best accuracy reported by Alorfi
parallel PRLM approach on distinguishing Ara-
(2008) on identifying the dialect of 40 utterances
bic dialects. We first run the nine phone recog-
of duration of 30 seconds each of 40 male speakers
nizers described in Section 5 on the training data
(20 Egyptians and 20 Gulf speakers) is 96.67%.
described in Section 4, for each dialect. This pro-
Since we do not have access to the test collec-
cess produces nine sets of phone sequences for
tion used in (Alorfi, 2008), we test a version of our
each dialect. In our implementation, we train a
system which identifies these two dialects only on
tri-gram language model on each phone set using
our 150 Gulf and 150 Egyptian speakers, as de-
the SRILM toolkit (Stolcke, 2002). Thus, in total,
scribed in Section 4. Our best result is 97.00%
we have 9 x (number of dialects) tri-grams.
(Egyptian and Gulf F-Measure = 0.97) when us-
In all our experiments, the 150 test speakers of
ing only the features from the ArbOE, English,
each dialect are first decoded using the phone rec-
Japanese, and Mandarin phone recognizers. While
ognizers. Then the perplexities of the correspond-
our accuracy might not be significantly higher than
ing tri-gram models on these sequences are com-
that of Alorfi’s, we note a few advantages of our
puted, and are given to the logistic regression clas-
experiments. First, the test sets of both dialects
sifier. Instead of splitting our held-out data into
are from telephone conversations, with the same
test and training sets, we report our results with
recording conditions, as opposed to a mix of dif-
10-fold cross validation.
ferent genres. Second, in our system we test 300
We have conducted three experiments to eval-
speakers as oppose to 40, so our results may be
uate our system. The first is to compare the per-
more reliable. Third, our test data includes female

4 dialects
seconds
accuracy
Gulf
Iraqi
Levantine
Egyptian
5
60.833
49.2
52.7
58.1
83
15
72.83
60.8
61.2
77.6
91.9
30
78.5
68.7
67.3
84
94
45
81.5
72.6
72.4
86.9
93.7
60
83.33
75.1
75.7
87.9
94.6
120
84
75.1
75.4
89.5
96
)$$#
Dur.
Acc. (%)
Phone Recognizers
("#
5s
60.83
ArbOE+ArbLME+G+H+M+S
($#
15s
72.83
ArbOE+ArbLME+G+H+M
'"#
30s
78.50
ArbO+H+S
45s
81.5
ArbE+ArbLME+H+G+S
'$#
60s
83.33
ArbOE+ArbLME+E+G+H+M
&"#
D#
120s
84.00
ArbOE+ArbLME+G+M
&$#
%"#
Table 1: Accuracy of the four-way classification (four col-
%$#
,--./0-1#
loquial Arabic dialects) and the best combination of phone
2.34#567809./8#
""#
recognizers used per test-utterances duration; The phone
:/0;<#567809./8#
"$#
recognizers used are:
E=English, G=German, H=Hindi,
=8>0?@?8#567809./8#
M=Mandarin, S=Spanish, ArbO=open-loop MSA without
!"#
AB1C@0?#567809./8#
"#
)"#
*$#
!"#
%$#
)+$#
emphatic vowels, ArbOE=open-loop MSA with emphatic
E89F6.G8/0?-8#H./0<I?#<?#98-I?H9#
vowels, ArbLME=MSA with emphatic vowels and bi-gram
phone LM
Figure 2: The accuracies and F-Measures of the four-way
classification task with different test-utterance durations
We observe that the MSA phone recognizers are
the most important phone recognizers for this task,
speakers as well as male, so our results are more
usually when emphatic vowels are modeled. In all
general.
scenarios, removing all MSA phone recognizers
6.2
Four Colloquial Arabic Dialect ID
leads to a significant drop in accuracy. German,
Mandarin, Hindi, and Spanish typically contribute
In our second experiment, we test our system on
to the classification task, but English, and Japanese
four colloquial Arabic dialects (Gulf, Iraqi, Levan-
phone recognizers are less helpful. It is possible
tine, and Egyptian). As mentioned above, we use
that the more useful recognizers are able to cap-
the phone recognizers to decode the training data
ture more of the distinctions among the Arabic di-
to train the 9 tri-gram models per dialect (9x4=36
alects; however, it might also be that the overall
tri-gram models). We report our 10-fold cross val-
quality of the recognizers also varies.
idation results on the test data in Figure 2. To
analyze how dependent our system is on the du-
6.3
Dialect ID with MSA
ration of the test utterance, we report the system
accuracy and the F-measure of each class for dif-
Considering MSA as a dialectal variant of Ara-
ferent durations (5s – 2m). The longer the ut-
bic, we are also interested in analyzing the perfor-
terance, the better we expect the system to per-
mance of our system when including it in our clas-
form. We can observe from these results that re-
sification task. In this experiment, we add MSA as
gardless of the test-utterance duration, the best dis-
the fifth dialect. We perform the same steps de-
tinguished dialect among the four dialects is Egyp-
scribed above for training, using the MSA corpus
tian (F-Measure of 94% with 30s test utterances),
described in Section 4. For testing, we use also
followed by Levantine (F-Measure of 84% with
our 150 hypothesized MSA speakers as our test
30s), and the most confusable dialects, according
set. Interestingly, in this five-way classification,
to the classification confusion matrix, are those of
we observe that the F-Measure for the MSA class
the Gulf and Iraqi Arabic (F-Measure of 68.7%,
in the cross-validation task is always above 98%
67.3%, respectively with 30s). This confusion is
regardless of the test-utterance duration, as shown
consistent with dialect classifications that consider
in Figure 3.
Iraqi a sub-dialect of Gulf Arabic, as mentioned in
It would seem that MSA is rarely confused with
Section 3.
any of the colloquial dialects: it appears to have a
We were also interested in testing which phone
distinct phonotactic distribution. This explanation
recognizers contribute the most to the classifica-
is supported by linguists, who note that MSA dif-
tion task. We observe that employing a subset of
fers from Arabic dialects in terms of its phonology,
the phone recognizers as opposed to all of them
lexicon, syntax and morphology, which appears to
provides us with better results.
Table 1 shows
lead to a profound impact on its phonotactic distri-
which phone recognizers are selected empirically,
bution. Similar to the four-way classification task,
for each test-utterance duration condition.9
we add it back. We have experimented with an automatic
9Starting from all phone recognizers, we remove one rec-
feature selection methods, but with the empirical (‘greedy’)
ognizer at a time; if the cross-validation accuracy decreases,
selection we typically obtain higher accuracy.

4 dialects
seconds
accuracy
Gulf
Iraqi
Levantine
Egyptian
5
68.6667
54.5
50.7
60
77.9
15
76.6667
57.3
62.6
73.8
90.7
30
81.6
68.3
71.7
79.4
90.2
45
84.8
69.9
73.6
86.2
94.9
60
86.933
76.8
76.5
85.4
96.3
120
87.86
79.1
77.4
90.1
93.6
(ArbO, ArbOE, and/or ArbLME). Removing them
)$$#
("#
completely leads to a significant drop in accu-
($#
racy. In this classification task, we observe that all
'"#
phone recognizers play a role in the classification
'$#
task in some of the conditions.
&"#
E#
&$#
,--./0-1#
7
Conclusions and Future Work
%"#
2.34#567809./8#
%$#
:/0;<#567809./8#
In this paper, we have shown that four Arabic
""#
=8>0?@?8#567809./8#
colloquial dialects (Gulf, Iraqi, Levantine, and
"$#
AB1C@0?#567809./8#
!"#
7D,#567809./8#
Egyptian) plus MSA can be distinguished using
"#
)"#
*$#
!"#
%$#
)+$#
F89G6.H8/0?-8#I./0<J?#<?#98-J?I9#
a phonotactic approach with good accuracy. The
parallel PRLM approach we employ thus appears
Figure 3: The accuracies and F-Measures of the five-way
to be effective not only for language identification
classification task with different test-utterance durations
but also for Arabic dialect ID.
We have found that the most distinguishable
Dur.
Acc. (%)
Phone Recognizers
dialect among the five variants we consider here
5s
68.67
ArbO+ArbLME+H+M
is MSA, independent of the duration of the test-
15s
76.67
ArbLME+G+H+J+M
30s
81.60
ArbO+ArbOE+E+G+H+J+M+S
utterance (F-Measure is always above 98.00%).
45s
84.80
ArbOE+ArbLME+E+G+H+J+M+S
Egyptian Arabic is second (F-Measure of 90.2%
60s
86.93
ArbOE+ArbLME+G+J+M+S
with 30s test-utterances), followed by Levantine
120s
87.86
ArbO+ArbLME+E+S
(F-Measure of 79.4%, with 30s test). The most
Table 2: Accuracy of the five-way classification (4 colloquial
confusable dialects are Iraqi and Gulf (F-Measure
Arabic dialects + MSA) and the best combination of phone
of 71.7% and 68.3%, respectively, with 30s test-
recognizers used per test-utterances duration; The phone
recognizers used are:
E=English, G=German, H=Hindi,
utterances). This high degree of Iraqi-Gulf confu-
J=Japanese, M=Mandarin, S=Spanish, ArbO=open-loop
sion is consistent with some classifications of Iraqi
MSA without emphatic vowels, ArbOE=open-loop MSA
Arabic as a sub-dialect of Gulf Arabic. We have
with emphatic vowels, ArbLME=MSA with emphatic vow-
els and bi-gram phone LM
obtained a total accuracy of 81.60% in this five-
way classification task when given 30s-duration
utterances. We have also observed that the most
Egyptian was the most easily distinguished dialect
useful phone streams for classification are those
(F-Measure=90.2%, with 30s test utterance) fol-
of our Arabic phone recognizers — typically those
lowed by Levantine (79.4%), and then Iraqi and
with emphatic vowels.
Gulf (71.7% and 68.3%, respectively). Due to the
As mentioned above, the high F-measure for
high MSA F-Measure, the five-way classifier can
MSA may be due to the MSA corpora we have
also be used as a binary classifier to distinguish
used, which differs in genre from the dialect cor-
MSA from colloquial Arabic (Gulf, Iraqi, Levan-
pora. Therefore, one focus of our future research
tine, and Egyption) reliably.
will be to collect MSA data with similar record-
It should be noted that our classification results
ing conditions to the other dialects to validate
for MSA might be inflated for several reasons: (1)
our results. We are also interested in including
The MSA test data were collected from Broad-
prosodic features, such as intonational, durational,
cast News, which includes read (anchor and re-
and rhythmic features in our classification. A more
porter) speech, as well as telephone speech (for in-
long-term and general goal is to use our results to
terviews). (2) The identities of the test speakers in
improve ASR for cases in which code-switching
the MSA corpus were determined automatically,
occurs between MSA and other dialects.
and so might not be as accurate.
Acknowledgments
As a result of the high identification rate of
MSA, the overall accuracy in the five-way clas-
We thank Dan Ellis, Michael Mandel, and Andrew Rosenberg
for useful discussions. This material is based upon work sup-
sification task is higher than that of the four-way
ported by the Defense Advanced Research Projects Agency
classification. Table 2 presents the phone recog-
(DARPA) under Contract No. HR0011-06-C-0023 (approved
nizers selected the accuracy for each test utterance
for public release, distribution unlimited).
Any opinions,
findings and conclusions or recommendations expressed in
duration. We observe here that the most impor-
this material are those of the authors and do not necessarily
tant phone recognizers are those trained on MSA
reflect the views of DARPA.

References
P. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds.
2004. Dialect identification using Gaussian Mixture Mod-
F. S. Alorfi. 2008. PhD Dissertation: Automatic Identifica-
els. In Proceedings of the Speaker and Language Recog-
tion Of Arabic Dialects Using Hidden Markov Models. In
nition Workshop, Spain.
University of Pittsburgh.
S. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore,
Appen Pty Ltd. 2006a. Gulf Arabic Conversational Tele-
J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Wood-
phone Speech Linguistic Data Consortium, Philadelphia.
land. 2006. The HTK Book, version 3.4.
Appen Pty Ltd. 2006b. Iraqi Arabic Conversational Tele-
M. A. Zissman, T. Gleason, D. Rekart, and B. Losiewicz.
phone Speech Linguistic Data Consortium, Philadelphia.
1996.
Automatic Dialect Identification of Extempora-
neous Conversational, Latin American Spanish Speech.
M. Barkat, J. Ohala, and F. Pellegrino. 1999. Prosody as a
In Proceedings of the IEEE International Conference on
Distinctive Feature for the Discrimination of Arabic Di-
Acoustics, Speech, and Signal Processing, Atlanta, USA.
alects. In Proceedings of Eurospeech’99.
M. A. Zissman. 1996. Comparison of Four Approaches to
F. Biadsy, N. Habash, and J. Hirschberg. 2009. Improv-
Automatic Language Identification of Telephone Speech.
ing the Arabic Pronunciation Dictionary for Phone and
IEEE Transactions of Speech and Audio Processing, 4(1).
Word Recognition with Linguistically-Based Pronuncia-
tion Rules. In Proceedings of NAACL/HLT 2009, Col-
orado, USA.
A. Canavan and G. Zipperlen. 1996. CALLFRIEND Egyp-
tian Arabic Speech Linguistic Data Consortium, Philadel-
phia.
A. Canavan, G. Zipperlen, and D. Graff.
1997.
CALL-
HOME Egyptian Arabic Speech Linguistic Data Consor-
tium, Philadelphia.
J. S. Garofolo et al.
1993.
TIMIT Acoustic-Phonetic
Continuous Speech Corpus Linguistic Data Consortium,
Philadelphia.
N. Habash. 2006. On Arabic and its Dialects. Multilingual
Magazine, 17(81).
R. Hamdi, M. Barkat-Defradas, E. Ferragne, and F. Pelle-
grino. 2004. Speech Timing and Rhythmic Structure in
Arabic Dialects: A Comparison of Two Approaches. In
Proceedings of Interspeech’04.
C. Holes. 2004. Modern Arabic: Structures, Functions, and
Varieties. Georgetown University Press. Revised Edition.
B. Ma, D. Zhu, and R. Tong. 2006. Chinese Dialect Iden-
tification Using Tone Features Based On Pitch Flux. In
Proceedings of ICASP’06.
M. Maamouri. 2006. Levantine Arabic QT Training Data
Set 5, Speech Linguistic Data Consortium, Philadelphia.
P. Matejka, P. Schwarz, J. Cernocky, and P. Chytil. 2005.
Phonotactic Language Identification using High Quality
Phoneme Recognition. In Proceedings of Eurospeech’05.
Y. K. Muthusamy, R.A. Cole, and B.T. Oshika. 1992. The
OGI Multi-Language Telephone Speech Corpus. In Pro-
ceedings of ICSLP’92.
J. Peters, P. Gilles, P. Auer, and M. Selting. 2002. Iden-
tification of Regional Varieties by Intonational Cues. An
Experimental Study on Hamburg and Berlin German.
45(2):115–139.
F. Ramus. 2002. Acoustic Correlates of Linguistic Rhythm:
Perspectives. In Speech Prosody.
A. Stolcke. 2002. SRILM - an Extensible Language Model-
ing Toolkit. In ICASP’02, pages 901–904.

Download
Spoken Arabic Dialect Identification Using Phonotactic Modeling

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share Spoken Arabic Dialect Identification Using Phonotactic Modeling to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share Spoken Arabic Dialect Identification Using Phonotactic Modeling as:

From:

To:

Share Spoken Arabic Dialect Identification Using Phonotactic Modeling.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share Spoken Arabic Dialect Identification Using Phonotactic Modeling as:

Copy html code above and paste to your web page.

loading