2014-03-02

Representing Indian languages using the Latin script

Indian languages are written in a diverse set of scripts, most of them derived from the Brahmi script and not from the Phoenician script. These scripts neither look like Latin, nor do they have the familiar A, B, … ordering. Further, many Indian languages have many phonemes not present in languages Latin was traditionally used for. As a consequence, many of these scripts have many more than 26 base glyphs (not to mention ligature forms). Mapping these to Latin characters becomes important for 2 distinct uses:
  1. storing/presenting Indian language content
  2. inputting Indian language content.
While Unicode encodes most popular scripts used for Indian languages and even some rare ones, Latin‐letters continue to be used in representing Indian languages everywhere. They're used in email, in SMS messages, in web‐pages, in file‐names; pretty much ubiquitously. But how to map the diverse phoneme set (between 30–50 for most Indian languages) into the 26 letters in the Latin alphabet?

[Likewise, physical keyboards with a QWERTY layout dominate the world; how to allow combinations of characters on the QWERTY keyboard to represent the diverse character sets in Indian languages? I'll address this in another post.]

As usual, there are many options.

"The nice thing about standards is that you have so many to choose from." Andrew S. Tanenbaum

ISO‐15919, an international scholastic standard, and a few other schemes – IAST, Hunterian, National Library of Kolkata, ALA‐LC – use diacritic (accent) marks over/under Latin characters. Harvard‐Kyoto, Velthuis, ITRANS, SLP1, WX, VedaType and ISO‐15919's limited character set option are schemes that restrict themselves to 7‐bit ASCII but use punctuation characters.

For example, here're the same Sanskrit characters in a few sample schemes:
Devanagari ISO‐15919 Hunterian ISO‐15919‐lcs Harvard‐Kyoto ITRANS
aa A aa/A
ri ,r R RRi/R^i
e ee e e
ं (anusvāra) m ;m M M
ख् kh kh kh kh kh
ञ् n ~n J ~n/JN
ड् d .d D D
श् sh ;s z sh

What a mess!

I'm going to address what I think should be the hallmarks of a good scheme for representing Indian language text using Latin characters – how one can figure out if such a scheme was thoughtfully, carefully designed and not thrown together in the middle of a Usenet discussion.

[Note: Indian languages being phonetic, I'm sometimes careless about the phoneme vs. written character distinction. I carelessly use the word “character” for both; the meaning should be clear].

Here's my prioritised list of features a good Indian language representation scheme should have:

  1. Unambiguity: the scheme must preserve the integrity of the script. Put another way, the mapping should be reversible from Latin back to the original script without loss. [Shockingly, the Hunterian scheme fails this basic test.]
  2. Meaningfulness: the scheme must use Latin letters phonetically close to the original Indian language sound. This automatically rules out absurdities like using f for the Sanskrit velar nasal (ङ्) [vide. the wx notation].
  3. Pan‐linguistic consistency: identical phonemes across Indian languages must have identical representations. Especially as Indian languages have heavily borrowed from Sanskrit, it's inconsistent if the "same" word has multiple Latin representations. Sadly, many schemes fail to satisfy this requirement: in ITRANS, the Sanskrit word केवल is represented as “kevala”, but its Tamil borrowing is represented as “kēvala”.
  4. Pan‐linguistic consistency: conversely, a single Latin representation must identify the same phoneme across Indian languages. Again, some schemes fail this requirement too — in some schemes, ḷ would mean a syllabic dental liquid in Sanskrit but could mean a retroflex liquid consonant in Tamil.
  5. Fidelity to pronunciation: the scheme should aid pronunciation, or at least, not encourage distortion. Harvard‐Kyoto's RRi for a syllabic alveolar trill; ITRANS's x for a conjunct consonant and GY/dny for another conjunct consonant are all misspellings (for phonetic languages, misspelling and mispronunciations go together!)
  6. Modularity and symmetry: in Indian languages, the phonemes (and thus the characters) have relationships among them: clearly, the palatal nasal stop has a relationship with the other nasals, as well as a different relationship to the other palatal stops, and yet another relationship with palatal/semi‐palatal vowels. Any scheme may have various mechanisms to represent such features: perhaps an underdot may represent retroflexion, or perhaps doubling a Latin letter may indicate vowel‐length. These mechanisms should be modular and symmetric, i.e. they should work independent of one another, and they should always mean the same thing. If doubling the vowel ‘a’ indicates a long ‘a’ sound, doubling the vowel ‘i’ should indicate a long ‘i’ sound. If adding an underdot to ‘t’ makes it retroflex, adding an underdot to ‘s’ should make that retroflex too.
  7. Alphabet restrictions: the scheme must not use punctuation-marks or mixed-casing. Punctuation‐marks in the middle of words look ugly. ;L'ik:e t"h..i;s. As does miXEd‐CaSiNg. Using punctuation marks also collides with well‐understood orthographic usages: it seems like a shame to not be able to use commas in transliterated Tamil text, just because the scheme gives the comma a special meaning. And starting every sentence with a lower‐case letter is too edgy for my taste.
Only one of the standard schemes fulfils 6 of the 7 requirements completely and the 7th partially (it uses punctuation, but only to represent some combinations that cannot normally occur in the language) – ISO‐15919. In addition to being well thought‐out and sane, it is an international standard and is widely used in scholastic publications. Its one drawback is that the official standard is not available free of cost; instead, ISO charges more then $100 for an electronic copy. However, it's fully documented at Dr. Anthony Stone's website and is usable today. I use it everyday, and so should you!

No comments:

Post a Comment