You are here

REPOST: On machine translation

Language is not pure information; it's information shorthand. It assumes a high degree of already-shared knowledge about the world. Some of these assumptions are near-universal; many are not.

Japanese and English (my languages) offer a great example, especially as it pertains to machine translation. Whereas English is a subject-predicate language, where basically all the information is encoded in the language stream, Japanese is a topic-comment language, where, once set, the "subject" is not re-stated until it changes. Beginning Anglophone learners of Japanese make the mistake of putting a "wa" to denote what they think of as the subject in every sentence, when it does not need to be there. "Wa" is a topic marker; not a subject marker.

This is a fundamentally different way of thinking about language and, therefore, about the world. Germanic languages seek to operate regardless of context; Asian languages seek to augment (or "comment on") it. If you've ever felt that Japanese people who speak English are beating around the bush or being vague, part of that is cultural, but part of that is the language of the culture that does not require explicitness. A big part of learning Japanese or, for Japanese people, of learning English is learning how to think about the world and about human interactions in a very different way.

Machines aren't human. They are information processors. They don't know what a "cat" is; they just know that it's a piece of code that can be slotted into a certain place in a set of syntax. Until machines are really intelligent (and I don't think that will be anytime soon), expect more crappy translation than useful. Anyone who tells you otherwise is probably selling something (a crappy machine translator, to be exact!).