Modeling Context and Language Variation for Non-Native Speech Recognition
Tien-Ping Tan, Laurent Besacier
LIG/GETALP Laboratory, UMR CNRS 5217
UJF - BP 53, 38041 Grenoble Cedex 9, France
2. Related Works
Non-native speakers often face difficulty in pronouncing
In general, three common approaches for non-native acoustic
like the native speakers. This paper proposes to model
modeling are acoustic model reconstruction (re-training),
pronunciation variation in non-native speaker’s speech using
acoustic model interpolation and acoustic model merging.
only acoustics models, without the need for the corpus.
Variation in term of context and language will be modeled.
2.1. Acoustic Model Reconstruction
The combination of both modeling resulted in the reduction
of absolute WER as much as 16% and 6% for native
Through acoustic model training, a new model tailored to
Vietnamese and Chinese speakers of French.
non-native speakers can be created. Current ASR systems
Index Terms: non-native ASR, context modeling,
which use sharp triphone context dependent (CD) information
language modeling, interpolation, merging
do not perform very well with non-natives. On the other hand,
a context independent (CI) model or a context dependent
model with smaller and more appropriate shared state  may
turn out to suit non-native speakers better. Non-native
Automatic speech recognition applications are becoming
speaker’s language cues can also be fused into target language
increasing popular. However, as automatic speech recognition
acoustic model by training it with some non-native speech or
matured, speech recognition performance on non-native
the speaker’s native language . Another interesting
speakers is still low.
approach is to incorporate language variations through state
Non-native speech has the characteristics of slower
tying and interpolation during acoustic model reconstruction
speaking rate, broader distribution compare to native speech,
pronunciation mistakes, smaller working vocabulary and
disfluency. Non-native speakers face difficulty in articulating
2.2. Acoustic Model Interpolation
the target language phonemes. For new phonemes which are
not found in the speaker’s mother tongue, it is a challenge for
Instead of modeling variation between two models using
beginners at least to articulate unfamiliar phonemes. On the
acoustic model reconstruction, acoustic model interpolation
other hand for similar phonemes which exist in both the target
can also be used to produce the effect. Acoustic model
language and the speaker’s native language, they may have
interpolation can be performed by estimating a-priori weights
trouble changing certain articulation habits which are specific
to be multiplied to two or more acoustic models involved. The
to their mother tongue .
Current automatic speech recognition systems take
(constructed from speaker’s native language (L1)  or
advantage of context clues for accurate recognition. As a
constructed from a small amount of non-native speech (L2)
result, non-native speakers are often unable to profit from it.
) is selected for interpolation for every Gaussian in target
On the contrary, it may even end up deteriorating their
state using certain distance measure. Other variants can also
recognition. So, for non-native speakers, we may want to
be found .
revisit the context dependent acoustic modeling concept to
reduce the sharpness of the context and combine the features
2.3. Acoustic Model Merging
of both the speaker’s native language (L1) and non-native
In acoustic model merging, non-native speaker’s language
speech (L2) to model these variations.
features are represented by combining one or more
Getting non-native speech for acoustic modeling is often
corresponding models from target and source (can be L1 or
difficult and in some cases unfeasible. Therefore, research in
L2) acoustic to form a new model . The transition
acoustic model adaptation attempts to use limited non-native
weights can be estimated automatically or manually. For
speech or the speaker’s mother tongue to improve the target
example in Figure 1 below, a target model /stg/ and a source
language acoustic models. The proposed method needs only
model /ssc/ are combined to create a new model /p/ with six
the target language acoustic models and the speaker’s native
states. Transition weights w1 and w2 can be assigned manually
language acoustic model to model context and language
or automatically to the target and source models.
features in non-native speakers.
This paper is organized as following. In Section 2, we
present related works in non-native’s acoustic modeling. In
Section 3, we describe our approach in modeling context and
language variations. Section 4 gives the experimental results
and finally, conclusions are drawn in Section 5.
Figure 1. A merged model
3. Modeling Context and Language
Acoustic model merging and interpolation are interesting
approaches to model pronunciation variation which exists
among non-native speakers in a simple and fast manner. We
propose in this paper a hybrid approach of merging and
interpolation to model context and language variations.
The general approach of interpolation is to select the
Figure 2. Acoustic space. pFR, s1g1 is French /p/ of state 1,
nearest corresponding Gaussian from source state for
Gaussian 1 and pVN, s1g1 is Vietnamese /p/ of state 1, Gaussian
interpolation for every Gaussian in target state using certain
1. pVN, s1g1 and pVN, s1g2 will be associated with pFR, s1g1. Both
distance measure. Instead, we propose to carry out
will be interpolated with pFR, s1g1. pFR, s1g2 and pVN,s1g3 which
interpolation in a different manner, where every Gaussian in
are far away from each other will be merged (both Gaussians
target state is treated like the ‘centroid’ for the Gaussians in
are kept and their mixture weight values will be recalculated)
source state. The subsequent step is to find the nearest
associated centroid or target Gaussian for all the source
3.2. Context Variation
Gaussians using distance measure like Euclidean distance or
approximated divergence distance.
Context independent model is a bit flat while context
dependent model can be too sharp for non-native speakers.
Euclidean distance = ( Σ (µ
So, the approach discussed above can be used for modeling
j – µk)2 )1/2
Approximated divergence distance = ( Σ (µ
between contexts. The context can be modeled to reach an
j – µk)2 / (σj σk))1/2
intermediate state between these two extreme. When
Every source Gaussian will be associated with only one
modeling context variation, the model with a smaller number
target Gaussian. Certain target Gaussians will be instead
of states will be treated as the target model while the other
associated with zero or more source Gaussians. When the
will be considered as the source model. The process of
distance between the associated target Gaussian and the
modeling context variation is similar to modeling language
source Gaussian is below a threshold, their means, variances
variation discussed in previous section. One thing different is
and mixture weights will be interpolated (case 1). Otherwise,
that since all models with bigger number of states also belong
merging is performed: for those target Gaussians without any
to the model with smaller number of states (both are from the
associated source Gaussian (case 3) or for the source
same language), all source Gaussians are assumed to have a
Gaussian that are far (more than the threshold) from their
target Gaussian interpolation partner. So, no threshold needs
associated target Gaussians (case 2). In case 2 and 3, their
to be set.
mixture weights will be reduced by the interpolation weight.
For example, if we have a CI model (target model) and a
The threshold can be calculated for example by measuring the
CD model (source model). All CD triphones are matched to
average distance among the Gaussians, and then multiplying it
their corresponding CI monophones. Next, the corresponding
with a constant. The resulted model is a hybrid model of
CI Gaussian for every CD Gaussian is found using certain
interpolation and merging.
distance measure. Interpolation is then performed on CI
Gaussians with their associated CD Gaussians, while the CI
Gaussian without any interpolation partner will be merged.
new,sn = w . ptg,sn + (1-w) psc,sn, psc,sn ≠ Ø,
d(ptg,sn, psc,sn) <= dist (1)
new,sn = psc,sn,, ωnew,sn= (1-w) . ωsc,sn, ptg,sn ≠ Ø,
d(ptg,sn, psc,sn) > dist (2)
The experiments were carried out on our non-native French
pnew,sn = ptg,sn , ωnew,sn = (w) . ωtg,sn, psc,sn = Ø (3)
corpus  using CMU Sphinx 3 ASR. There are two groups
of non-native speakers: Chinese and Vietnamese. Each
where pnew,sn = interpolated/ merged Gaussian, ptg,sn = target
speaker read about a hundred sentences related to tourism
Gaussian, and psc,sn = source Gaussian. w = interpolation
domain. Baseline French Continuous HMM acoustic model
weight, 0<=w<=1.0. ω is the mixture weight for the Gaussian.
was created from BREF120 corpus . As for the source
d() is a distance function and dist is a threshold distance
languages, we have a 15 hours Vietnamese corpus  and a
5 hours Mandarin Chinese corpus . The general domain
In cases where the models are context dependent model,
trigram language model was created using Le Monde
the matching triphone context will be looked upon. If there is
newspapers text, and subsequently interpolated with a tourism
no matching triphone context, the context independent
domain language model (from NESPOLE project).
context will be used.
In most cases, the interpolated acoustic model will have
3.1. Language Variation
different number of Gaussians per state. Currently Sphinx
architecture is not capable of handling varied number of
Research showed that there is an interaction between target
Gaussians per state. To model this, we set all states to the
language and native phonology system for non-native
maximum number of Gaussians possible, the means, variances
speakers . The proposed method can be used to model
and mixture weights for the empty Gaussians are set to zeros.
language variation between a target language acoustic model
(target model) and a speaker’s native language acoustic model
4.1. Baseline Experiment Results
(source model). Before the proposed method is carried out,
Some baseline tests which model the context and language
the target state and their corresponding source states can first
features were prepared for comparison purpose against our
be matched using knowledge based approach such as IPA
table or data driven methods like confusion matrix. Figure 2
below shows an example of what will take place.
4.1.1. Context Variation
Sixteen Gaussians CI French and the speaker’s native
Acoustic model with different number of tied-states were
language (Vietnamese and Chinese) acoustic model were used
trained while maintaining the number of Gaussians at 16.
in the experiment. The results showed that acoustic model
Besides that, CI models with different number of Gaussians
merging performs better than interpolation in most cases.
were also prepared. Below are the results:
4.2. Proposed Approach Experiment Results
Table 1. WER of non-natives using CD acoustic models at
different number of tied-state, with 16 Gaussians
This section presents the results of our proposed approach.
4.2.1. Context Variation
The context modeling was carried out using a 43 states CI
model and an 8129 states CD model. Both have 16 Gaussians
per state. The interpolation-merging produced a CD model
Table 2. WER of non-natives using CI acoustic models with
with 8129 states with an average of 25 Gaussians per state
different number of Gaussians
(except for CI weight=0 and CI weight=1.0). Approximated
divergence distance was used as the distance measurement.
We noticed that there was a slight decrease in WER when
the CI weight is at 1.0, compare to the original CI result. Note
that when CI weight equal to 1.0, the algorithm will produce a
model with 8129 states, where all triphones replaced by their
The average WER for native Vietnamese and Chinese
respective monophone. The best WER for native Vietnamese
speakers are high. The high difficulty of the database was
and Chinese speakers of French were achieved when CI
confirmed by human perception tests1 which showed average
weight was 0.7. The WER were 51.5 and 54.0 respectively.
WER of 12.1% and 11.3% respectively. The results show that
sharp context modeling does not improve and even degrades
(for less experienced Chinese speakers) performances
compared to context independent (CI) modeling.
4.1.2. Language Variation
Before modeling language variation, we need to determine the
matching phonemes in the target language and the speaker’s
native language. We used a combination of knowledge based
and data driven approaches to determine the target language
phoneme substitution by non-natives speakers .
Acoustic model interpolation and merging were
W eight (CI/ CD)
experimented. The interpolation described in section 2.2 was
applied. For acoustic model merging, instead of using the
Figure 4. Graph shows interpolation-merge of a French CI
architecture in Figure 1, we assumed it is possible to transit
and CD model. CI weight at 1.0 denotes CI model, and CI
from a Gaussian in one state to another Gaussian in another
weight at 0.0 denotes CD results.
corresponding state in another language and vice versa. So,
each new model had only three states, where the
We validated the result on another corpus, by conducting
corresponding states in target and source models were merged
the test on 23 native German speakers of English from the
to become one. Transition weights to different models were
corpus ISLE . TIMIT  corpus was used to create CI
instead applied to the mixture weights of the models.
and CD models with 1120 states. The models were then
interpolate-merge to create a new CD acoustic model. The test
showed that when CI weight is at 1.0, the WER for non-native
speakers was 57.9%. On the other hand, the WER of CD
acoustic model was 63.5%. At CI weight equals to 0.5,
German speakers recorded a reduction in WER to 55.1%.
The results from our context modeling show that when
appropriate weight is used, the hybrid method produces
encouraging results with ASR. The weight to apply
corresponds with the experience of the speaker in the
language. The Vietnamese speakers which are more
experienced show higher improvements in WER compare to
Chinese speakers. We also found that the WER of native
1,0/0,0 0,9/0,1 0,8/0,2 0,7/0,3 0,6/0,4 0,5/0,5 0,4/0,6 0,3/0,7 0,2/0,8 0,1/0,9 0,0/1,0
W eight (FR/ Spk Native)
French speakers (not showed in the graph) only showed slight
increase of 2% compared to CD model when the CI weight is
Figure 3. Interpolation and merging with different weights
equal to 0.5.
4.2.2. Language Variation
1 Human listeners were asked to transcribe non-native utterances,
The approach can also be used to model language variation. A
where an unlimited number of replays for each utterance were
French CI acoustic model with 43 states Gaussian was
interpolated-merged with the speaker’s native language CI
be used offline. By applying a weight of 0.5, the method will
acoustic model. Both have 16 Gaussians per state. Euclidean
give generally good results. When non-native speech is
distance was used as the distance measure. The resulted
available, weight can be estimated automatically. Since there
models have an average of 26 Gaussians per state. Threshold
is only one parameter, one possibility is to compare the
was set at about two times the average Gaussian distance.
acoustic scores of different acoustic models created at
The test showed that the proposed hybrid method
different weights and to select the one which produces the
performed better than the interpolation approach. It is better
highest value. One side effect of the resulted hybrid model is
than acoustic model merging in most cases and the results are
that it increases the total number of Gaussians. Appropriate
achieved using less Gaussians compared to acoustic model
clustering of the Gaussians or eliminating certain unnecessary
merging. For example at FR weight equals to 0.5, the hybrid
Gaussians may be useful to reduce the number of Gaussians
models have WER of 51.2% and 53.4 for Vietnamese and
and thus the overall complexity of the model.
Chinese respectively, while in merging the WER are 53.6%
and 55.7% respectively.
Characteristics of Foreign Accented Speech," ICASSP
97, 2:1123-1126, Munchen, 1997.
 J. Flege, E. Frieda and T. Nozawa "Amount of Native-
Language (L1) Use Affects the Pronunciation of an L2",
Journal of Phonetics, 25:169-186, 1997.
 Y.R. Oh, J.S. Yoon and H.K. Kim "Acoustic Model
Adaptation based on Pronunciation Variability Analysis
for Non-Native Speech Recognition", ICASSP 2006, 1:I-
137-I-140, Toulouse, 2006.
 U. Uebler and M. Boros "Recognition of Non-native
1.0/ 0.0 0.9/ 0.1 0.8/ 0.2 0.7/ 0.3 0.6/ 0.4 0.5/ 0.5 0.4/ 0.6 0.3/ 0.7 0.2/ 0.8 0.1/ 0.9 0.0/ 1.0
German Speech with Multilingual Recognizers",
W eight (FR/ Spk Native)
Eurospeech-1999, 2:911-913, Budapest, 1999.
Figure 5. Graph shows interpolation-merging of French
 Y. Liu and P. Fung "Modeling partial pronunciation
CI model and Vietnamese/ Chinese CI model.
recognition", Computer Speech and Language, 17:357-
4.2.3. Context and Language Variations
 Y. Liu and P. Fung "Multi-Accent Chinese Speech
In this test, we investigate the combined effect of modeling
Recognition", ICSLP 2006, 133-136, Pittsburgh, 2006.
the context and language variations. We used the CD model
 S.M. Witt "Use of Speech Recognition in Computer-
after our proposed context modeling at CI weight equals to
assisted Language Learning," PhD Thesis, University of
0.5. The CD model was then interpolated-merged with the
speaker’s native acoustic model. The Vietnamese CD model
 Z. Wang, T. Schultz and A. Waibel "Comparison of
which contains 5123 states and the Chinese CD model with
Acoustic Model Adaptation Techniques on Non-native
1102 states were used.
Speech", ICASSP-2003, 1:540-543, 2003.
 T.-P. Tan and L. Besacier "Acoustic Model Interpolation
for Non-Native Speech Recognition", ICASSP 2007,
Hawaii, 1009-1012, 2007.
 N. Minematsu, K. Osaki and K. Hirose "Improvement of
Non-native Speech Recognition by Effectively Modeling
Eurospeech-2003, 2597-2600, Geneva, 2003.
 T.-P. Tan and L. Besacier "A French Non-Native Corpus
for Automatic Speech Recognition", LREC 2006, 1610-
1613, Genoa, 2006.
 L.F. Lamel, J.L. Gauvain and E. M. "BREF, a Large
Vocabulary Spoken Corpus for French", Eurospeech-91,
W eight (FR/ Spk Native)
505-508, Genoa, 1991.
Figure 6. Combination of context and language modeling
 V.-B. Le, T. Do-Dat, E. Casteli, L. Besacier and J.F.
Serignat "Spoken and written language resources for
At French (FR) weight of 0.5, the test showed an overall
Vietnamese," LREC 2004 pp. 599-602, Lisbon, 2004.
improvement of WER (compared to the baseline presented in
Table 1) from 60.6% to 44.1% for Vietnamese speakers and
from 58.5% to 52.1% for Chinese speakers of French. On the
 W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P.
other hand, the WERs for native French speakers increased
Howarth, R. Morton and C. Souter "The ISLE Corpus of
only about 3% with the modified Vietnamese and Chinese
Non-Native Spoken English", LREC 2000, 957-963,
model, compared to the baseline CD model.
 W. M. Fisher, G. R. Doddington, and K. M. Goudie-
Marshall, "The DARPA Speech Recognition Research
Database: Specifications and Status", Proceedings of
We have presented a hybrid approach of interpolation and
DARPA Workshop on Speech Recognition, 93-99,
merging to model context and language variations which can