Multilingual speech recognizer can learn a new language from small data

Researchers of the DigiTala project experimented new methods to improve the accuracy of an ASR for non-native speech. The results are promising and helpful in the automatic speech recognition and assessment of language learners of small languages, such as Finnish.

A high-performance ASR is vital in automatic assessment of spoken language skills, but the speech of language learners can be difficult to recognize due to, e.g., mispronunciations and grammatical errors. Moreover, traditional ASR systems require loads of speech data with text from the target language in order to function properly.

It is more challenging to develop ASR for small – or low-resourced – languages, such as Finnish and Swedish, than for English. This is because low-resourced languages have considerably fewer speakers and thus less training data is available. The foci in DigiTala’s ASR studies are in the speech of Swedish and Finnish language learners, groups with very limited data.

Target data used only for fine-tuning

Instead of traditional training data, speech with text, the researchers first used large speech data without text to pre-train the ASR. The pre-training was unsupervised, meaning that the systems learn without human intervention.

Both the Finnish and Swedish ASRs were pre-trained with large multilingual data containing speech from 23 languages. The Swedish ASR was also trained with large Swedish speech data without text, but such data was not available for Finnish.

Then the ASR systems were fine-tuned with Finnish and Swedish language learners’ speech with corresponding text. Both fine-tuning data were collected in the DigiTala project. The performances of new ASRs were compared to the ones trained only with the small target data.

Self-learnt machines were more accurate

The pre-trained ASRs proved to be more accurate than ASRs trained only with the small speech with text data in the target language. This was the case also for the multilingual ASR compared to monolingual ASR without pre-training.

In other words: a self-learnt machine can learn a new language from relatively small target data, even without previous “knowledge” on the target language. This makes the development of ASR for small languages and other small speakers groups, such as language learners, more efficient. 

Read more about the development of ASRs for non-native speech in Yaroslav Getman’s Master’s Thesis.

The article on the development of L2 Swedish ASR is available in the Interspeech proceedings.