
Machine translations have been somewhat of a holy grail in AI and language technologies for decades. And for a good reason. In a world of ever increasing international business and cooperation, effective communication is crucial. Fast and reliable, automated translations would therefore be of tremendous value, but despite serious efforts it is still far from realization.
When I was in high-school, I started thinking about this problem and decided to give it a try – making a program that would translate sentences from Icelandic to English and vice versa. It couldn’t be that hard, could it? So off I went – happily ignorant of the enormity of the task. After spending some 2 or 3 months of free-time programming on the task (a lot at the time), I began to realize what I was getting myself into. There had definitely been progress, but the goal seemed to have moved dramatically further away.
The last thing I want to do is to scare off somebody that wants to give it a try, but I thought it could help somebody to share my experience and some of the things I’ve learned since from various sources, mostly from people that have given the problem a lot more time and thought than I have. Remember that this is one of the problems where not knowing it can’t be done is the only way to succeed.
Firstly, where had I gotten with my project?
-
I had a vocabulary of about 250 words in both languages, derived from a list of common English words in a course book. The vocabulary could be searched for the translation either way around.
-
I had achieved to make a program that could correctly derive all possible inflections of Icelandic nouns and adjectives from their basic forms. It handled some 95-99% of all words (not just the small vocabulary) correctly and included a lot of exceptions. Note that in Icelandic nouns have some 16 different inflections and adjectives 72!
-
Similarly, my system included the much simpler inflections in English, notably the definite and indefinite articles, the plural -s and the -ed past tense ending.
-
Using the other tools, my system would parse sentences word-by-word and use the vocabulary to translate each word to the other language, often (sometimes at least) using the correct inflection on the translated word.
By careful selection, my system could perfectly translate some simple sentences (something on the form “I see a dog”, that becomes “‘Eg s’ hund”). Utterances such as “Black dogs” (“Svartir hundar”), also worked but anything more complex would turn up funny grammatical errors that no person would ever make.
Not all bad, but thinking about the next steps of improvement caused me to find myself another pet project. The grammatical errors my system made were usually a question of word order, wrongly chosen inflections or – worst of all – wrongly diagnosed words that could have different meanings based on the context, e.g. “search” as a verb (to search for something), instead of a noun (a search for something).
The key to solving all of these problems seemed to be grammatical diagnosis of the syntax of sentences and teaching the system the syntax of each language. I think I made one attempt at a solution but soon left it to program 3-dimensional rotating boxes or something of the sort.
Many people have taken similar steps as I did (many of them before me of course) and many obviously wandered far past it. Automated translations such as Systran (used by both Altavista’s Babelfish and Google) are good enough to help make sense of what a paragraph, or find out what a website is all about, but you will quickly notice a lot of funny stuff that ruin your trust of the system right away. Translating a sentence from one language to another and back again can give quite amusing results (try it yourself at “Lost in Translation“). Other tools such as InterTran
are not as good – but the more amusing

Even with advanced syntax handling, like the Systran system has – there are still severe problems. And these seem to be problems that can not be solved without some kind of understanding to what the text that is being translated actually means. At this point you’ve really got some problems. In order to get somewhere in this, you need “common sense”, a collection of the things that are so obvious to us that we never really think of them but are essential to our understanding of language and the world in general (see previous Wetware entries “Gathering Common Sense“and “Google Miner“).
“Radically different approaches like statistical machine translations have shown promising results”
But is this really the best approach?
Radically different approaches like statistical machine translations have shown promising results. In this method, a translation system tries to “make its own rules” from a collection of parallel texts in two languages (originally translated by human translators). The rules are based on statistical models. The system notes that when used in a specific context a word is usually translated in a specific way. There is no understanding involved, only statistics and recognition of patterns.
If we look at how infants do it, it’s surely not by first learning a huge vocabulary, then learning how to inflect every single word, then the grammar and finally adding a database of common sense to polish the process. It is really the other way around, isn’t it? Learning a language in school it somewhat different, often starting with syntax and inflections and then gradually building the vocabulary, assuming that common sense is already in place.
However, when trying to build useful tools in AI or other mimicry fields it’s not always the best approach to try to do it exactly the way the role-model we’re aiming for does it. Many useful applications of AI use approaches that are radically different from the way a human brain works, e.g. fuzzy logic and expert systems.
Bearing this in mind, it’s always nice to think out of the box. Could there be some dramatically different way to approach machine translation? With the statistical model in mind, might it for example be possible to feed “truckloads” of text in different languages into a system that would try to deduce some patterns from all of them, not just from a single language pair?
One gets the feeling that there might be the same key to good machine translation as there is to making believable chat agents that can keep up human like conversations, as both need at least to “fake” some kind of understanding to what is being said. Any hints there?
I don’t know. But in any case this is a very interesting problem to think about.
Links:
The Translation Page – A nice guide to available online machine translation software.