What is Transliteration? A Complete Guide

Definition

Transliteration is the process of converting text from one script into another. It works by preserving the sound of the original word, not its meaning.

Put simply: the letters change, but the pronunciation stays the same. The language itself doesn't change.

"The act of writing words or letters in the characters of another alphabet."
— Cambridge Dictionary[1]

"The act or process of writing words using the alphabet of a different language."
— Oxford Learner's Dictionaries[2]

Transliteration vs. Translation

These two terms are often confused, but they describe fundamentally different operations:

OperationWhat changes?Example
Transliteration The script (writing system) namaste → नमस्ते
Translation The language (meaning) Hello → नमस्ते

When you transliterate "namaste" into Devanagari as "नमस्ते", the meaning is not involved at all. Only the spelling is mapped to the nearest phonetic equivalent in the target script.

Translation works differently. It converts the meaning of a word, regardless of how it sounds.

Transliteration Across Language Pairs

Transliteration isn't limited to English or Hindi. It works between any two writing systems. Here are a few examples across different language pairs:

Original LanguageScriptWordTransliterationTarget Script
HindiDevanagariनमस्तेNamasteLatin
ArabicArabicمحمدMuhammadLatin
JapaneseHiraganaとうきょうTōkyōLatin
ChineseHanzi北京BěijīngLatin
RussianCyrillicМоскваMoskvaLatin
GreekGreekΑθήναAthínaLatin
ArabicArabicمحمدМухаммадCyrillic (Russian)
ArabicArabicالقاهرةАль-КахираCyrillic (Russian)

Romanization: Transliteration Into the Latin Script[4]

When the target script is the Latin (Roman) alphabet, the process has its own name: romanization. So the Arabic name "محمد," when romanized, becomes "Muhammad."

Romanization is the most widely studied form of transliteration. That's largely because the Latin script is used internationally for science, travel documents, and the internet.

Several formal standards govern romanization for different source scripts:

Hepburn Romanization (Japanese)[5]

Hepburn romanization was created in 1867 by American missionary James Curtis Hepburn. He designed it with English phonology in mind, so that English speakers could naturally pronounce Japanese words without special training.

It is the most widely used Japanese romanization system in the world. You'll find it on road signs, train timetables, and passports, and it's taught to most learners of Japanese as a foreign language.

  • The kana し is written shi (not "si"), because English readers would mispronounce "si."
  • The kana ちゃ is written cha (not "tya").
  • Long vowels are shown with a macron: ō, ū (e.g., Tōkyō, not Tokyo).

Pinyin — Hanyu Pinyin (Chinese)[6]

Pinyin (literally "spelled sounds") was developed in the 1950s by the People's Republic of China. It uses the Latin alphabet plus tone marks over vowels to represent both pronunciation and tone in Standard Mandarin.

Today it's the official romanization standard in mainland China and Singapore, recognised by the United Nations, and the main method for typing Chinese characters on digital devices.

  • 北京 → Běijīng (the diacritics show tones: ě = falling-rising, ī = high level)
  • 毛泽东 → Máo Zédōng

ISO 9 (Cyrillic to Latin)[7]

ISO 9:1995 is published by the International Organization for Standardization. It defines a single, universal table for converting Cyrillic characters into Latin characters, covering 118 characters across all Cyrillic-script languages.

These include Slavic languages such as Russian, Ukrainian, and Bulgarian, as well as non-Slavic languages of the former Soviet Union. Its defining property is full reversibility: the original Cyrillic text can be reconstructed unambiguously from the ISO 9 transliteration.

Beyond Romanization: Transliteration Into Other Scripts

Romanization moves text into the Latin script, but that's only one direction. Transliteration can work between any two scripts, and two other important directions have their own names:

Cyrillization[8]

Cyrillization is the direct inverse of romanization: it takes a word from a non-Cyrillic script and renders it in the Cyrillic alphabet.

It's used for writing foreign names and words in Russian, Ukrainian, Serbian, Macedonian, Bulgarian, and other Cyrillic-script languages. There are two approaches:

  • Transliteration: systematic character-to-character mapping, used when the source language has consistent spelling.
  • Transcription: phonetic rendering of the word's pronunciation, used for languages like English and French whose spelling is irregular.

Take "Shakespeare" as an example. In Russian it becomes Шекспир. That's not a letter-for-letter map; it's a phonetic approximation based on how English speakers pronounce the name.

Transcription into Chinese Characters[9]

In Chinese, this process is called 音译 (yīnyì): phonetic transcription into Chinese characters (Hanzi).

Because Chinese characters are largely monosyllabic logograms, foreign words must be broken into syllables. Each syllable is matched to a Chinese character with a similar sound. Where possible, characters are chosen that also carry a neutral or favourable meaning:

  • "McDonald's" → 麦当劳 (Màidāngláo), where the characters approximate the sound and loosely mean "wheat, serve, labour."
  • "Coca-Cola" → 可口可乐 (Kěkǒu Kělè), meaning "tasty and enjoyable"; an unusually favourable match of sound and meaning.

Official transcriptions in China are standardised by the Xinhua News Agency's Names of the World's Peoples dictionary.

Challenges of Transliteration

Transliteration sounds straightforward, but it comes with some real-world complications worth knowing about:

  • Sounds that don't exist in the target script: Many scripts lack characters for sounds common in other languages. Hindi's retroflex consonants (ट, ड) have no direct Latin equivalents.
  • Multiple romanization systems: Japanese alone has Hepburn, Kunrei-shiki, and Nihon-shiki. The same word can be spelled differently depending on which system is used, which causes confusion.
  • Context-dependent pronunciation: Arabic and Hebrew write vowels inconsistently. A single written word may have multiple correct readings, and transliteration systems must pick one.
  • Diacritics and special characters: Strict, reversible systems like ISO 9 require diacritics that many keyboards can't easily produce, which limits their practical use.
  • Proper name inconsistency: The same person's name can appear in dozens of romanised forms across different countries and eras (e.g. Muammar Gaddafi / Qaddafi / Kadhafi).

Best Practices of Transliteration

Whether you're working on a research project, a database, or a travel guide, a few simple principles will save you a lot of headaches:

  • Use a recognised standard: Choose a published system (ISO, ALA-LC, BGN/PCGN) relevant to your source language and state it clearly.
  • Be consistent: Apply the same system throughout a document or database.
  • Keep the original script: Where possible, include the original script alongside the transliteration, e.g. "नमस्ते (Namaste)."
  • Don't mix up transliteration and transcription: Transliteration is script-to-script; transcription is sound-to-script. Mixing them up causes errors.
  • Think about your audience: A reversible ISO 9 transliteration suits library cataloguing; a pronunciation-friendly system like Hepburn works better for a general travel guide.

Machine Transliteration Models

Software can perform transliteration automatically. In a landmark study, Oh, Choi, and Isahara (2006) compared four distinct machine transliteration models within the same experimental framework[3]:

ModelHow it worksBest suited for
Grapheme-based Maps source characters (graphemes) directly to target characters using learned spelling patterns. No phonetic information used. Languages with consistent spelling-to-sound correspondence.
Phoneme-based Converts source graphemes into phonemes first, then maps those phonemes to target characters. Pronunciation is the intermediary. Languages with irregular spelling but consistent pronunciation.
Hybrid Combines grapheme-based and phoneme-based probabilities using linear interpolation. Leverages the strengths of both. General-purpose; consistently outperforms either alone.
Correspondence-based Establishes explicit alignments between graphemes and phonemes, treating them as jointly paired units rather than independent sequences. Complex scripts where grapheme and phoneme information must be modelled together.

The study found that the hybrid and correspondence-based models performed best. Combining all four in an ensemble improved accuracy even further.

Where Does Google Transliteration Fit?

Google's transliteration, used in Google Input Tools, doesn't fit neatly into any single category from the 2006 JAIR taxonomy. Its approach has evolved through three distinct phases:

  1. Phase 1, Phoneme-based: Early Google transliteration used phonetic rules to map Latin input to the nearest-sounding characters in the target script (e.g. "namaste" → नमस्ते).
  2. Phase 2, Statistical hybrid: A statistical language model trained on large parallel corpora was added to rank candidate outputs by their likelihood in context, combining grapheme-level patterns with phonetic mapping.
  3. Phase 3, Neural sequence-to-sequence: Modern versions use recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) loss. This handles polyphonic scripts such as Arabic and Devanagari, where a character's pronunciation depends on context.

Google's transliteration started as a phoneme-based model, then became a statistical hybrid, and is now best described as a neural sequence model. That last category didn't exist when the 2006 JAIR taxonomy was written, but it builds directly on the hybrid foundation the study identified as most effective.

Frequently Asked Questions

References:

Sambhu Raj SinghSambhu Raj Singh · LinkedIn · GitHub · Npm

Last Updated On: