Do Google and Alexa ignore small languages?
Main image: Astana, the capital of Kazakhstan. About half of the country's 18 million people speak Kazakh. Credit: Alex J. Butler via Flickr, CC BY 2.0
Imagine if English wasn't the universal language of the internet. What if you couldn't read this article online except as a version mangled by Google Translate? And what if Alexa didn't understand anything you asked it? Now imagine that you got in touch with Google and Amazon and asked them to add English to their systems… and they said 'No thanks – do it yourself'.
Kazakh is a language spoken by around half of the 18 million people in Kazakhstan, a vast country in Central Asia that borders both Russia and China – and although it's vast, it's relatively sparse population means it can get overlooked by tech giants such as Google.
”As a commercial market Kazakhstan is not really interesting to Google because it doesn't generate the right amount of money out of advertising,” says Rauan Kenzhekhanuly, founder of the nonprofit WikiBilim Public Foundation, who in 2011 set up a Kazakh language version of Wikipedia, a huge initial act of translation that was to prove critical.
He's since been the driving force behind an attempt to boost Kazakh in online machine translation tools. ”It's very important for small languages to be able to give access to any website, and to translate websites and articles in your language,” he says, before underlining just how endemic English and Russian are as a written language in Kazakh. ”At university, even if you study Kazakh literature and language you'll be obliged to find textbooks in Russian or English.”
Google's ambivalence towards cultures on the margins is pretty standard behavior, and perhaps understandable. A few years ago the Faroe Islands petitioned Google to include the Faroe Islands – home to just 30,000 people – on Google Street View, then used sheep to make it happen.
Lost in translation
To be fair to Kazakhstan, it's taken some drastic steps to meet the world halfway. After getting 7,000 articles in Kazakh on Wikipedia, Kenzhekhanuly spearheaded a project to boost that to 210,000 to please Google.
”We started to communicate with Google, but they explained that they don't really do anything to bring minor languages into the Google Translate service,” he says. ”They said that it's up to you – you have to provide us with tons of text – and they asked for 10,000 articles.”
After far surpassing that figure for mirror translations from Kazakh into English (and back) thanks to the work of 350 volunteers in Kazakhstan, Google's system was able to build its first translations. Kazakh is now available as a simple text-to-text system on Google Translate, though it won't translate entire websites, spoken Kazakh, nor translate via a camera using the Google Translate app (which is mostly used for translating menus).
As easy as ABC
There is one more rather drastic step that Kazakhstan has taken to make its language easier to integrate into the wider world: it's changing its entire alphabet. Working on the presumption that the Russian Cyrillic alphabet used to write Kazakh is both a hangover from rule by the USSR, and off-putting to English-speaking visitors, in 2017 the government announced plans to transit to using the Roman alphabet completely by 2025.
It's already being used in schools, which is no surprise since the decree read: “For the sake of the future of our children we should make this decision and create it as a condition of entry for our wider global integration.”
Despite Kazakhstan’s linguistic concessions to the tech world, advances in machine translation will lessen translation issues in the very near future. Having been part of the USSR for 55 years until 1991, what Kazakhstan is partly fighting against in linguistic terms is the continued domestic dominance of the Russian language; which is ironic, because just this summer a British company was the first to crack the historically tricky Russian-to-English translation.
”In Russian, a word might have 12 variations in meaning, with inflexions used instead of word order, but in English it's just three or four and a fixed word order,” says Mihai Vlad, VP of Machine Translation at UK-based SDL. ”So generic machine translation technology is not enough for a language like Russian; you need an engine that address the specific ways of phrasing.”
The solution proved to be Neural Machine Translation (NMT), which has also been responsible for recent advances in image recognition and speech recognition. ”What's different is the way words are being converted into numbers,” explains Vlad. ”Every word gets coded into an array of numbers, and those numbers get passed through a neural network that uses matrix multiplication, and you end up with word-embedding that essentially captures the meaning of the word or sentence.”
Latin languages have proved much easier to map, but German, Russian and most of the Asian languages have required NMT – essentially custom-made language-mapping engines – to become readable by machines.
What about voice recognition?
If having a Kazakh-language Wikipedia and getting Kazakh onto Google Translate is helping keep the small language alive and flourishing, what about Alexa, Google Assistant and Siri? So far the global growth in speech recognition has been in voice assistant hardware, not software, with all the big players limited in what languages they handle:
Alexa: English, German and Japanese.
Google Assistant: English, French, German, Italian, Japanese and Spanish
Siri: English, Arabic, Chinese, Danish, Dutch, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Malay, Norwegian, Portuguese, Russian, Spanish, Swedish, Thai and Turkish
”We'd love to be part of those technologies, and right now we are working on bringing Kazakh to the speech-to-speech system,” says Kenzhekhanuly. This is not just so people in Kazakhstan can indulge in novelty nonsense like getting an Echo to set kitchen timers, and asking about the weather – the stakes are much higher. This is about accessing the future of technology.
”If you have your language included in speech-to-speech then you will get access to platforms that access smartphones, but also smart cars,” says Kenzhekhanuly. For example, the driverless cars of the future will surely communicate with their ’drivers’ primarily using voice, but if it’s left up to the car manufacturers and tech companies, only the world’s really big languages – Mandarin Chinese, English and Spanish – will be catered for.
Back in Kazakhstan, work will continue on fusing the Kazakh language into the fabric of the internet – and specifically Google Translate – because Kenzhekhanuly is convinced of its vital importance in the modern age.
”It's not perfect, but the beauty of the technology is that it's improving constantly,” he says. ”As a piece of technology, there is no other that is closer to imitating the human brain, and that's why it's so important for Kazakh to be part of it – these platforms are not only information platforms, but also linguistic platforms.”
TechRadar's Next Up series is brought to you in association with Honor