Bridging the Online Language Barrier

BROOKE GLADSTONE: English may be the internet’s dominant language, but it’s employed by only 27 percent of internet users. Chinese seems certain to surpass it within a few years, and dozens of other languages are experiencing huge growth. Will vast swaths of cyberspace be lost to us, blocked by an insurmountable language barrier? OTM’s Mark Phillips set out to explore whether the Net is developing any new tools to fix a very old communication problem.
MARK PHILLIPS: Every once in a while I stumble onto a webpage that isn't in English. My immediate reaction is, oh, this is wrong, and I hit the back button as fast as I can. I know this is proof of some sort of Amero-Anglo hegemonic bias or something, but let's face it, we English speakers have gotten used to thinking this way.
ETHAN ZUCKERMAN: There’s been this sense that English is the default language on the internet.
MARK PHILLIPS: Ethan Zuckerman, cofounder of the multilingual blog network Global Voices, says this is because the internet started as a project by the U.S. military and then grew with the help of American universities. Zuckerman says at first the ingrained English bias had a big upside.
ETHAN ZUCKERMAN: Things were easier. It was easier to have the sense that we were all in the same conversation, that we were all laughing at the same jokes. There’s something great about a common language. A common language is a first step towards communication across cultural boundaries.
PROFESSOR DAVID BELLOS: Well, we've been here before.
MARK PHILLIPS: David Bellos is the director of the Program in Translation and Intercultural Communication at Princeton University.
PROFESSOR DAVID BELLOS: From about the 4th century A.D. until the age of Newton and Leibniz and Spinoza, there was Latin, and it was very convenient. Nobody grew up speaking Latin. Latin was a learned language. But it was the language of philosophy, religion, law, debate, science for more than a thousand years, throughout the whole of Europe. And it was wonderful, very convenient. And right now there are far more users of English as a foreign language than there are speakers of English as a native language. English is, in that sense, the new Latin.
MARK PHILLIPS: But even Zuckerman has noticed big changes in just the past few years. Back in 2004, he was at a dinner in Amman, Jordan with about 20 bloggers who were chatting away in Arabic.
ETHAN ZUCKERMAN: But almost all of them were blogging in English, at that point. I pulled aside my friend Ahmad Humeid and said, why is it that you guys write in English? You’re all more comfortable in Arabic. And Ahmed said, look, if we want the world to listen to us, we've got to speak in English because they don't speak Arabic. Out of that group of people that I had dinner with, a lot of those people blog in Arabic now. And I've gone back and I've talked to some of them and said, well, what made you switch. And what one said to me was, now our friends, our peers, our neighbors, they're all online. That’s who we want to reach.
MARK PHILLIPS: In the Renaissance, people moved away from Latin and began writing and printing in the local vernaculars, and that helped produce, well, a renaissance. The same goes for the Web now.
ETHAN ZUCKERMAN: It has allowed billions of people to come onto this network. It’s really hard to imagine 400 million Chinese users coming onto the internet, if English was a precursor for them to participating in this space.
MARK PHILLIPS: Non-English growth is skyrocketing. Arabic internet users, for example, have increased by over 2,000 percent over the past decade.
ETHAN ZUCKERMAN: The one negative is that it introduces this huge source of fragmentation, lots of separate internets. And they're separated by a couple of different things, but language is the big one and language is the one that turns out to be hardest to overcome.
MARK PHILLIPS: One dream is that translation software could make all the content on the Web available to everyone, no matter what language it’s in. A Popular Science article from April, 1958 shows just how old this vision is. Of course, they didn't have the internet back then, but the article explains -
[CLIP/BEEP/MUSIC UP AND UNDER]
MALE CORRESPONDENT: Each year, millions of report on scientific research are published, a big fraction of them in foreign languages. [BEEP] In this mass of Russian, Dutch, Chinese and Hindustani data are clues to H-power, interplanetary flight, more powerful batteries, longer-wearing tires. [BEEP]What we need is a machine to read one language and type in another, an automatic translator.
[END CLIP]
PROFESSOR DAVID BELLOS: The history of machine translation really begins during the Second World War when, in the U.K., and especially at Bletchley Park, enemy codes were cracked.
MARK PHILLIPS: Princeton Professor David Bellos:
PROFESSOR DAVID BELLOS: After the war, just as computers were being invented, the bright idea came that maybe you could use these wonderful new machines to do code cracking and that maybe languages could be looked at as if they were in code, as if the real meaning of the thing was actually the English and the Russian was just, you know, one of these complicated ways of masking what the real meaning was.
MARK PHILLIPS: First you teach the computer vocabulary, apple equals yablaka, and then you teach it all the rules and grammar, do it for every language and, boom, you've got a Star Trek-style universal translator.
PROFESSOR DAVID BELLOS: It didn't produce the results they wanted.
MARK PHILLIPS: David Bellos:
PROFESSOR DAVID BELLOS: The reason it didn't was that it was based on not a very sophisticated idea of what language actually is. What I am saying isn't in code for something else, it is what I'm saying. So there are really very strict limits on what you can do with machine translation, based on the idea of code. By the early 1960s, they'd pretty much given up.
MARK PHILLIPS: This rules-based machine translation was a failure, but there was still another method called statistical translation. Think of it as a behavioral approach. The underlying grammar and syntax don't matter, but repeated exposure to language, as it’s actually used, does. It’s like how babies learn. You don't diagram sentences for them. They just hear you say stuff and copy you. The catch is to teach the machine, you have to load huge amounts of text into the computer. Back in the 1960s, they didn't have enough data to make a statistical machine translation work. Now we do, says Michael Galvez, a project manager at Google Translate.
MICHAEL GALVEZ: What we do is we actually use hundreds of billions of words that Google infrastructure has access to.
MARK PHILLIPS: It’s a two-step process. First, Google’s computers pull it all in, recognize the language and create what they call a language model. There’s one for each of the 52 languages currently on the service. As they get more data for a particular language, the computers get a better feel for it. It knows from a statistical standpoint that in English, the sentence “The boy are sad” is very rare, just as a five-year-old knows that sounds weird. But the language model only teaches the computer how to speak each language by itself. The next step is to learn how to go between multiple languages. Google’s Michael Galvez says, for that:
MICHAEL GALVEZ: We also build what’s called a translation model, using previous human translation that we have access to, documents from the EU, the United Nations, very high-quality translation corpora.
MARK PHILLIPS: Everything spoken or written at the United Nations is automatically translated into six languages.
[U.N. HUBBUB/MANY LANGUAGES AT ONCE]] Google uses U.N. and European Union transcripts, along with tons of other professional high-quality translations, to build this translation model, which allows their computers to take a sentence and predict what it would be in another language. Michael Galvez:
MICHAEL GALVEZ: We take the language model and the translation model and we put these two models together, and we basically create the machine translation system out of this.
MARK PHILLIPS: It produces startlingly accurate results. Plug in an article from a Spanish-language newspaper and it reads like an English article that just needs a trip to the copy editor. And the original text doesn't have to be perfect to get a good translation.
MICHAEL GALVEZ: If Google Translate has seen a particular construction many times in the past, then it should be able to translate this correctly. So it can take the wrong word and translate it into the right word in the foreign language. And that’s because it sees these mistakes on the Web and it basically learns from them.
MARK PHILLIPS: Galvez had me type a sentence for translation with the word “receive” spelled incorrectly, with the “i” before the “e” – a mistake I usually make. Google’s computers ignored my misspelling and translated it into Spanish correctly. And since there’s plenty of slang on the Web, Google Translate also knows what to do with that, like the abbreviation for “laugh out loud.” I just typed in LOL, and that’s “ja, ja, ja.”
MICHAEL GALVEZ: With a “J.”
MARK PHILLIPS: Yes.
[LAUGHTER] So it found LOL.
[LAUGHTER]
MICHAEL GALVEZ: Yeah, in Spanish it’s “ja, ja, ja” with a J, yes.
MARK PHILLIPS: Google has even integrated Translate into its search engine. Galvez had me enable the translated search feature and type in -
MICHAEL GALVEZ: Okay, let's do Belgian waffles.
MARK PHILLIPS: - Belgian waffles. Okay, delicious-looking. [LAUGHING]
MARK PHILLIPS: Google translates Belgian waffles into a bunch of languages, searches, and then shows me the best foreign language sites translated back into English.
MICHAEL GALVEZ: Let me see. If you go to the first link, it says, Belgian waffles. It’s - it’s actually translated from French. So if you click on that link, you will actually be able to navigate to the French site, and you’re actually experiencing this website now in English. If you click on “waffle” it will actually show you the recipes in English.
MARK PHILLIPS: Wow. Being able to search and read all French recipes on the Web without speaking a word of French is amazing. But Google Translate makes plenty of mistakes. Using the Google Translate app on my phone, I tried to have a conversation with someone who speaks Chinese. I started by asking him a simple question. Can I ask you a couple of questions?
[CHINESE TRANSLATION]
MAN: This thing doesn't make sense. When you say, can I ask you a couple of questions, “couple,” I mean, it can translate as the – a husband and wife. They translate as a – can I ask you - this couple questions?
MARK PHILLIPS: Like a husband and wife.
MAN: Yes.
MARK PHILLIPS: Let me try it again without saying “a couple.” Okay, can I ask you some questions?
[CHINESE TRANSLATION]
MAN: Yeah.
MARK PHILLIPS: That's right?
MAN: Yeah. [CHINESE]
MARK PHILLIPS: What were you trying to say?
MAN: I say, of course you can.
[CHINESE TRANSLATION]
MARK PHILLIPS: The answer was “gay.”
[LAUGHTER][CHINESE TRANSLATION]
MAN: That’s wrong.
[LAUGHTER/VOICES IN BACKAGROUND]
MARK PHILLIPS: The voice recognition was the first problem. It had trouble turning his voice into text. But even when we typed in the text, the translation only gave us the gist of what the other person said. And that might help a tourist find the bathroom, but is it really enough to usher a new era of communication? Would an article translated by a machine be enough to persuade community leaders in Africa to adopt a new way to combat malaria? Also, machine translation takes away the nuance and art of prose. Would an English-speaker read an eloquent and satirical Saudi blogger, if the translation just barely made sense? Or will we always need a human translator to get the whole picture?
ETHAN ZUCKERMAN: The solution isn't machine translation just getting better or human translators just getting more pervasive. The solution is some combination of the two.
MARK PHILLIPS: Blogger Ethan Zuckerman.
ETHAN ZUCKERMAN: We're doing a much, much better job of figuring out how to organize these communities of human translators, take advantage of people’s willingness to volunteer or willingness to work for small sums of money. I think that translation is going to go from being esoteric, rare and expensive to becoming fairly commonplace, participatory and expected.
MARK PHILLIPS: A new website, called Meedan.net, is an example of this kind of project. It translates stories about the Arab world from both English and Arab media. The short posts on the home page are always translated by a person.
ED BICE: The idea is a Wikipedia-style approach to translation.
MARK PHILLIPS: Meedan’s founder, Ed Bice:
ED BICE: Any registered translator on our system, and we now have about a thousand people who are capable of generating translations on Meedan, can contribute to improving a translation.
MARK PHILLIPS: Meedan doesn't rely solely on people, though. When you comment on a story, it’s instantly sent through a machine translator. The result is that everything on the website is in both languages. But where Meedan is particularly innovative is how it presents the translations. The traditional way is to put the toggle button at the top of the webpage where you choose your language. Click on English, and all of the Arabic on the page disappears. That’s how Google Translate works. But Ed Bice says Meedan does it differently.
ED BICE: So on Meedan we have side-by-side. Page right is Arabic, which, conveniently, it’s a right-to-left language and we have page left, English.
MARK PHILLIPS: Of course, I only read the English column on the left. But if it’s a translation, it has a gray background and I intuitively know it’s a translation. A white background lets me know I'm reading a comment originally written in English. Ed Bice says seeing the comments in Arabic, even if you can't understand them, is a potent reminder that you’re talking to someone who may have a very different point of view.
ED BICE: We think you need a visual cue. And from having that visual cue, you can actually see this cross-language conversation happening on the website.
MARK PHILLIPS: You immediately see a ping-pong back and forth between the two languages, something you'd lose if there was just one column in your language. At Meedan, a two-sentence story about a Syrian man throwing his shoe at the Turkish prime minister generated 26 comments. The first two were originally in English, then two from Arabic, then five that were originally in English, then three from Arabic, and so on. At first, the two languages seemed to be having different conversations, but as the thread continues, English speakers begin responding to Arabic comments, and vice versa. It becomes a cross-cultural conversation about the meaning of shoe throwing.
ED BICE: There are two narratives that describe almost any emerging situation or policy decision regarding the U.S. and the Middle East, and the most obvious dividing line for those narratives is linguistic.
MARK PHILLIPS: So if sites like Meedan put these two narratives side by side and translate them, will a third narrative emerge? Will our differences dissolve and world peace reign? Ethan Zuckerman says, not so fast.
ETHAN ZUCKERMAN: When you can read what people say in their own languages, it’s often a lot less diplomatic. It’s often a lot more nationalistic. They tend to be unaware of the other audiences that might be paying attention. My favorite example of this, in many ways, is Jack Cafferty of CNN.
JACK CAFFERTY: We’re in hock to the Chinese up to our eyeballs, as we continue to import their junk…
ETHAN ZUCKERMAN: Who referred to the Chinese as:
[CLIP]:
JACK CAFFERTY: …basically the same bunch of goons and thugs they’ve been for the last 50 years.
[END CLIP]
ETHAN ZUCKERMAN: That comment was widely translated, circulated all throughout the Chinese Web. And some Chinese citizens sued CNN for a billion dollars, one dollar for each offended Chinese person. And while the lawsuit, I don't think, had legal implications, CNN was, was forced to apologize. Cafferty got on the air and apologized and said, no, no, no, I was talking about the government, I wasn't talking about the people. As we get better and better and better at translating, I think what it’s really going to do is force us to address each other’s preconceptions, prejudices, biases. But unless we can actually hear what people are saying, it’s very hard to start on that process.
MARK PHILLIPS: So it’s not just about generating good translations. It’s also about being aware of them, because if we know the rest of the world is listening to us, we'll be more conscious of our casual jabs and unexamined assumptions and begin to hear ourselves as others hear us. And we need to hear them. Remember, 73 percent of internet users are not using English right now, and that number is growing fast. If we want to access all the internet has to offer, we have to give up the idea that English is the Latin of the digital age. Never again can we count on one language to connect us all. For On the Media, I'm Mark Phillips.
[MUSIC UP AND UNDER]