Greater Linguistic Diversity in the MT Universe
February 17, 2020 by Sonja
In November 2019 Amazon announced that it would be incorporating 22 new languages into its translation service Amazon Translate, including languages like Swahili and Tamil.
This reflects an overall trend - growing numbers of “rare” languages are being offered for automatic translation, with the output continuously improving in quality. Does that mean we’ll soon have the legendary Babel fish from A Hitchhiker’s Guide to the Galaxy – a tiny helper that sits in your ear and offers impromptu interpretations from any language in the universe? The jury is still out on that one – but in this blog post we’ll outline what’s currently possible with the technology available, and what belongs in the realm of science fiction (for now).
High-resource vs. low-resource languages
Artificial intelligence forms the cornerstone of modern machine translation systems (MT). The “fuel” needed to power these engines is data. For MT systems specifically, this is enormous bilingual corpora. At least 10 million sentence pairs are required to train a machine of this nature. MT engines tend to find this training material by combing the web.
That’s why countries that have a strong internet presence, such as those that speak Romance languages, English, or German (“high-resource languages”) are perfect for this process. “Low-resource languages” are a different story altogether – this term refers to languages in which comparably less content is available online, including Croatian, Slovenian, and Hindi.
"Low-resource languages are currently experiencing a golden age"
Yes, that’s right – Hindi! While it is one of the most commonly spoken languages in the world, many official publications and commercial documents in India are available primarily in English, the country’s second official language. That’s why you’ll find comparably less high-quality bilingual content with Hindi as the source or target language on the Internet.
A brave new world of languages
Low-resource languages are currently experiencing a golden age. There are several reasons for this:
MT’s technological basisContemporary MT relies on neural networks and deep learning. In contrast to conventional statistical and rule-based translation algorithms, this technology even works well with language combinations that have totally incomparable grammatical structures like Japanese and English. This means that machine translations from Chinese into German, for example, are achieving an impressive quality level that would have been inconceivable just five years ago.
The current phase of the MT innovation cycleNeural MT has reached the peak of its innovation curve, which means that improvements to this technology have slowed down for high-resource languages, shifting the focus to improving and investing in low-resource languages. This development has also been accompanied by a diversification of the market. While major players like Google, Microsoft, Amazon, and IBM initially dominated the industry, more and more niche providers that focus on one specific content type (e.g., medical translations) or less widespread languages have popped up in the meantime. For example, Yandex is considered an expert for Russian, Baidu has become the first port of call for Chinese, and Naver Papago is particularly popular in Korea right now.
- Google is developing a further approach for improving the translation quality of rare languages in the shape of its Massively Multilingual NMT System. The word “massive” isn’t an over-exaggeration – a breathtaking 25 billion sentence pairs are fed into Google’s MT system. The solution doesn’t just cover one individual language pair – it currently supports several dozen languages and even more language combinations. Its major advantage is that the language model developed for high-resource languages can be applied to low-resource languages and referenced for less common language combinations, such as translations from French into Irish.
Technology has always managed to eliminate language barriers, and that’s truer now than it has ever been before. This is great news, especially for companies who cater to markets in Eastern Europe, Scandinavia, and Asia. This is where machine translation will increasingly help to drive down localization costs in the future.