Current State of Machine Translation

From Ott09 Wiki
Jump to: navigation, search

Led by Francis, this is a discussion of the state of the art in machine translation

Our expectations for machine translation are based on what we're asking MT to do machine translation for dissemination, publishing - translate into a text that can be cleaned up by hand

machine translation for assimilation - allowing people to understand "uphands, robbery it is" - we can all understand, even if it's a lousy translation

can't make useful translation that's faster to post-edit than translate from scratch

can get translation for post-editing that works well within the same language group - spanish/english, spanish/catalan might do 90% of the work Fran focuses on MT between closely related languages

Google Translate - results are surprisingly fantastic large amounts of pre-translated texts, effectively translation memories iterate through a huge amount of text - word alignment, phrase alightment - note that casa rosa and red house appear in the same places in aligned texts the system pulls apart texts, looks for translated words and phrases, and chooses the right ones based on probability models

estacion de tren - don't translate as season of the train, though that's how estacion could work can work fantastically well if you've got many millions of sentences aligned 20 million sentences in one language pair for google (Olaf argues that it's at least 100 million)

how good will Google get? Possibly good enough that we'll be able to post-edit test. But there will probably always be a human element

Corpus, corpora - a set of documents An aligned corpus - documents aligned at a sentence level, in different languages One is the rules for joining the European Union Europal corpus - european parliament - 30 million sentences in 14 languages - the golden status

Statistical machine translation relies on these parallel corpora But Breton/French has only 30,000 sentences in aligned corpora Rule-based machine translation - source text, intermediate representation, then target language Breaking it into noun phrases, verbs, subordinate clauses Welsh - Verb, Subject, Object English - Subject, Verb, Object lots of work, not as much as people think SYSTRAN works on this model, surprising how bad it is intermediate representation - distinguish by parts of speech - wound (verb or noun) chunks - noun phrase, verb phrase full parsing - parse a sentence into noun, verb trees

disambiguating parts of speech - rule based: if it's at the beginning of the sentence in Welsh, it's a verb Statistical: the probability of a verb beginning a sentence in Welsh is quite high

Another approach to intermediary languages might be Interlinguas, but there's less research in this field right now. (Olaf points to a Japanese/Italian collaboration to develop an interlingua between the two. It goes beyond parts of speech into sense disambiguation - here's a word in a specific space. Professor Della Santa, worked in Japan for decades on this, developing a computer-based interlingua called UNL)

Will the semantic web save us? A word in the source language can be translated into two different ways. Metadata might help us choose the right sense. In truth, language sense disambiguation almost always done statistically. Often built by manual approaches, coding senses

English/Japanese - rule-based grammars seem to work well, but statistics for vocabularies rule-based systems work very well in languages with complex cases statistical translations work poorly when there are lots and lots of cases, inflections, etc.

morphological analysis, lemmaization - find the "lemma" - the core word, the singular, present, dominant gender. Rule-based machine translation systems can be useful for making search engines, spell checkers, parts of speech taggers, which are good for grammar checkers rules can be written in XML and exported for other systems

the importance of deciding what you want your translation system to do It might not make sense to translate Basque to Spanish - most people in the Basque autonomous region speak Spanish as well. But it might be useful to English

MT for dissemination is easier with a reduced language set - Canadian METEO system, a rule-based system for language translation. With a rule-based system, "all bets are off" - you can get perfect translations

Esperanto? The kind of esperanto you get is based on the mother language of the speaker

How would a community write better for machine translation? Make translations freely available - license under CC-BY or something similar Jer - we'd love to do this - how can we help? Use the software that's already there, use the software for machine translation...

Why is machine translation so bad around gender? Translating from a language with gender pronouns, like English, this isn't hard - in languages like Spanish, where the pronoun is often dropped, it's often very difficult

Darius - is machine translation going to get good enough that social translation is no longer necessary? Translating between Catalan and Spanish and you're not using MT, you're wasting time. Human translators with a translation memory might translate between 3000 - 6000 words a day. A rule-based MT system helped Spanish/Brazilian Portuguese improve from 3000 to 6000 a day. So, for these closely related languages, it's already good enough. For unrelated languages, it might take 10-20 years.

Lots of open copyright questions - what's the copyright status on the government documents of the former Yugoslavia? It would be very helpful for building translation systems...

Petra - controversial terms like "sex worker" are usually translated very badly. Is it possible to get activism around this? Axel - At Mozilla, they're always coining terms, and they ask people to walk through the translation process with them, and document the process.

EthanZ 15:01, 22 June 2009 (UTC)