2017-05-22

வேற்றுமையுருபுகள் அவத்தம்

இலக்கணம் பயிலும்பொழுது எட்டு வேற்றுமையுருபுகள் (‐ஐ/‐ஆல்/‐ஓடு/‐உடன்/…) பயின்றேன். சில ஆண்டாக வடமொழி கற்றுக்கொள்கிறேன். வடமொழியிலும் வேற்றுமையுருபுகள் உள்ளன. மொழியை விளக்க இலக்கணம் எனில், வடமொழியில் பெயர்ச்சொற்களை விளக்க வேற்றுமையுருபுகள் இன்றியமையாதவை. உரையிலோ செய்யுளிலோ பெயர்ச்சொல்லின் முடிவில் வரும் பிற்சேர்க்கைகள் எட்டே; அவ்வெட்டின் மூலம் பெயர்ச்சொல் எல்லாமே எட்டாக வேறுபடும். இப்பிற்சேர்க்கைதான் வேறுபாட்டின் உருபகளென கருதப்படுகின்றன.

அம்மொழி சொல்லச்சொல்ல, தமிழைப்பற்றித் தோன்றியது. வேற்றுமைகள் எவ்வளவு வடமொழியிற்கு பொருந்தியவையோ, அவ்வளவு தமிழிற்கு பொருந்தாதவையென தோன்றிற்று. சிறிது ஆராய்ந்தேன். என்னவென்றால், தமிழ் இலக்கணம் எழுதிய அகத்தியர் முதலானோர் அனைவரும் வடமொழி இலக்கணம் அறிந்தவர். அவ்விலக்கணத்தின் விதிமுறை கண்டு அவற்றை தமிழிலக்கணத்துள் திணித்துள்ளனர். இதனை ஆங்கிலத்தில் பற்பல எடுத்துக்காட்டுடனும் சான்றுடனும் அருமையாக கூறுகிறார் ஆரொலுடு சிப்புமன் முனைவர்.

2017-05-19

On Kogul


If you follow Tamil urban pop culture, you’ve probably heard of the ‘kogul’ moniker that describes native Tamil speakers’ pronunciation of foreign words. In Sanskrit, ‘gōkula/goːkul̪ə/’ means ‘cattle-station’, and through synecdoche also means Krishna’s cattle-shed near modern-day Vrindavan in Uttar Pradesh. The derived term ‘gokul’, pronounced /goːkul̪/ is a personal name in various Indian languages. Tamil speakers pronounce the imported word variously as /goːgul/, /koːgul/ and /goːkul/, and thus the label. ‘Kogul’ began life in web-comics and has seen usage in diverse forms – in blog titles, photo captions, and as a Twitter hashtag. A search reveals >50000 ghits.

The reason Tamil speakers pronounce voiceless and voiced consonants interchangeably has to do with Tamil phonemic rules. The specifics are unique to Tamil, but the phenomenon exists in all world languages.

Each language has a fixed set of phonemes, and various context-dependent rules around which phonemes are valid in what contexts. For instance, English allows the /h/ phoneme in every syllabic position except the final. In the initial position, /pr/, /pl/ and /tr/ clusters are allowed but not /tl/. Sanskrit words cannot end in a palatal phoneme. Sanskrit also has sandhi rules when two phonemes come together either in the middle of a word or between words. And so on.

When a language borrows a word from another language it perforce has to adapt the phonemes into its set of valid phonemes. This is why gairaigo words in Japanese often sound very different from the source words (e.g. /bijinesu manejimento/ from ‘business management’, or /raibaru/ from ‘rival’). Other examples include English /ˈke-chəp/ from Amoynese ‘ke2 jap1’ and the Spanish ‘chofer’ from French ‘chauffer’. Not just loan-words, but cognates too are pronounced very differently in different child languages: compare the pronunciations of Stephen, Etienne, Esteban, Stefan and Estephanos across English, French, Spanish, German and Greek respectively.

In the Indian context, languages across India have borrowed Sanskrit words over millennia so systematically that grammarians have classified the borrowings: tatsama words are those that have retained their Sanskrit phonetics, while tadbhava are those with morphed pronunciations, like Hindi ghar/gʰəɽ/ from Sanskrit gr̥ha/gɽ̩ɦə/, pyās/pjaːs̪/ from pipāsā/pipaːs̪aː/ and lakhan/l̪əkʰən̪/ from lakṣmaṇa/l̪əkʂməɳə/ (Note the Hindi dental nasal in place of the Sanskrit retroflex nasal). Sanskrit has also borrowed words, fitting them into its phonemic scheme, e.g. pravāla/pɽəvaːl̪ə/ from the Tamil pavaṛa/pəvəɻə/, and dramiḍa/d̪ɽəmiɖə/ from the Tamil word for the language itself, tamiṛ/t̪əmiɻ/.

So there’s nothing unnatural or wrong about Tamil people pronouncing the Sanskrit-derived word ‘Gokul’ /koːgul/. So why all this fuss?

In the 21st century, Indo-Aryan culture, centered around northern India, is on the ascendant. Bollywood movies are popular throughout urban southern India, and among the rural middle-class; there are 130 million non-native speakers of Hindi; ‘Indian food’, ‘Indian attire’ and even ‘Indian accent’ have come to mean the food, clothing styles and manner of speaking common in Hindi-speaking cities. Names like Rahul, Neha, Diya and Amar, once only found in the north of India, are now common throughout India, and are percolating into non-Hindi-speaking villages where few understand their meanings. Schwa elision, a standard feature of Hindi, is widespread in names of people, places and establishments well outside the Hindi-sphere.

At the same time, there’s renewed interest in “Sanskritic purity”: there is a popular movement to declare oneself a Sanskrit-speaker in the Indian census, there are many popular Sanskrit-quote-of-the-day web-feeds, and Carnatic music singers now mispronounce tadbhava words as if they were tatsama words. In fact, the kogul phenomenon was given prominence by a well-known Carnatic music blogger.

So, on the one hand, we have these Sanskrit purists trying to outdo one another in insisting on applying Sanskrit rules to non-Sanskrit languages. And on the other hand, North Indian cultural practices have become normative and beyond judgement. That leaves the poor Tamils, pronouncing words in their language the way they’ve always done and according to the rules their grammar prescribes. Except now their pronunciation is being judged against a standard founded on ignorance and built on snobbery.

2016-07-11

Alavandar's catuḥślōkī

yāmunācāryēṇa viracitāṁ catuḥślōkīnāmnīṁ stutiṁ mātāmahyā naikēbhyō vatsarēbhyaḥ prāgupadiṣṭōham । tasyā arthamavētuṁ yatamānaḥ anvayaṁ lilēkhiṣāmi ।

svādayanniha sarvēṣāṁ trayyantārthaṁ sudurgraham ।
stōtrayāmāsa yōgīndrastaṁ vandē yāmunāhvayam ॥

kāntastē puruṣōttamaḥ phaṇipatiḥ śayyāsanaṁ vāhanaṁ vēdātmā vihagēśvarō yavanikā māyā jaganmōhinī ।
brahmēśādisuravrajaḥ sadayitastvaddāsadāsīgaṇaḥ śrīrityēva ca nāma tē bhagavati brūmaḥ kathaṁ tvāṁ vayam ॥

bhagavati । tē kāntaḥ puruṣōttamaḥ । tē śayyā phaṇipatiḥ । tē āsanaṁ vāhanaṁ vēdātmā vihagēśvaraḥ । tē yavanikā māyā jaganmōhinī । tvaddāsadāsīgaṇaḥ sadayitaḥ brahmēśādisuravrajaḥ । tē nāma ca śrīrityēva । kathaṁ vayaṁ tvāṁ brūmaḥ ।

atra tu śriyaṁ mahārājñīṁ bhāvayan śrēṣṭhatamāni tadupalakṣaṇāni varṇayati kaviḥ । bhagavati । hē sarvamaṅgalavati । "asyāsti" iti matup । kāntaḥ priyatamō dayitaḥ । kartari ktaḥ । puruṣōttamaḥ uttamaḥ puruṣō nārāyaṇaḥ । "na nirdhāraṇē" ityasmātṣaṣṭhītatpuruṣaniṣēdhātkarmadhārayavyutpattiḥ । śayyā talpam । phaṇipatiḥ phaṇināṁ sarpāṇāṁ patiḥ ādiśēṣaḥ । phaṇaḥ sphaṭā । "ata iniṭhanau" iti matubartha in । āsanamāsandaḥ āsyata asminniti । vāhanaṁ yānaṁ vāhayatīti । vihagēśvarō vihagānāṁ khagānāmīśvarō garuḍaḥ । lakṣmīstu viṣṇuvakṣaḥsthalasthitā nityānapāyinī garutmati āstē tēna vāhyatē cēti yāvat । parasparaṁ virōdhēpi lakṣmīnārāyaṇau śēṣaṁ garuḍamubhāvapi bhuñjāta iti vicitrāspadam । yavanikā tiraskariṇī । purā rājastriyō hriyā yavanikāntarhitā iti । jagatō mōhaḥ jaganmōhaḥ । "ata iniṭhanau" iti matubartha in strītvāt ṅī ca । māyā jaganmōhinī lōkānmōhayati yathā tē mahālakṣmīṁ draṣṭuṁ nālam । brahmā caturmukhaḥ īśō rudraḥ । tau ādī yasya sa brahmēśādiḥ । tathāvidhaḥ suravrajō dēvasamūhaḥ । sadayitaḥ sapriyaḥ patnībhiḥ sahitaḥ । mahārājñyā dāsadāsyaḥ kvēti cēt gaṇōyamēvēti । sarvāṇi rājalakṣaṇānyupavarṇya antē nāmaiva anupamamityāha । pāñcarātrāgamāt śrīyatē śrayatē śruṇōti śrāvayati śruṇāti śrīṇāti cēti śrīnāmadhēyā । tathā satyāṁ tvayi kathaṁ vā brūmō vayamiti kavērāścaryaṁ bhaktiśca ।

yasyāstē mahimānamātmana iva tvadvallabhōpi prabhurnālaṁ mātumiyattayā niravadhiṁ nityānukūlaṁ svataḥ ।
tāṁ tvāṁ dāsa iti prapanna iti ca stōṣyāmyahaṁ nirbhayō lōkaikēśvari lōkanāthadayitē dāntē dayāṁ tē vidan ॥

lōkaikēśvari । lōkanāthadayitē । dāntē । yasyāḥ tē niravadhiṁ svatō nityānukūlaṁ mahimānam ātmanaḥ (mahimānam) iva tvadvallabhaḥ prabhurapi iyattayā mātuṁ na alam । tāṁ tvāṁ dāsaḥ iti prapannaścēti nirbhayōhaṁ tē dayāṁ vidan stōṣyāmi ।

kaviratra dēvyā anantaṁ mahimānaṁ varṇayannāha । lōkaikēśvari । lōkānāṁ sarvajanānām ēkā kēvalā īśavarī īśitrī sambuddhau । ēkōlpārthē pradhānē ca prathamē kēvalē tathā । sādhāraṇē samānēpi saṅkhyāyāṁ ca prayujyatē । lōkanāthadayitē । jagatpatipriyē । dāntē । damavati । damum̐ upaśamē ityasmāt ktaḥ kartari । "anunāsikasya kvijjhalōḥ kṅiti" iti upadhādairghyam । niravadhim avadhirahitam aparimitam । svataḥ lōkanāthāt tvadvallabhāt nāma śrīmannārāyaṇataḥ nityānukūlaṁ sadānusāram । mahimānaṁ mahattvam । ātmanaḥ svasya tannāma śrīpatēḥ । tvadvallabhastava priyatamaḥ । prabhuḥ svāmī punaśca bhagavān । iyattayā ētāvaditi । mātuṁ tōlayitum । na alam aparyāptaḥ । yasyāḥ tādr̥k mahimā yaṁ mātuṁ bhagavānapi aparyāptaḥ tādr̥śāṁ tāṁ tvāṁ stōṣyāmi nutiṁ kariṣyē । kathaṁbhūtaḥ stōṣyāmīti cēt । dāsaḥ kiṅkaraḥ । prapannaḥ śaraṇaṁ gataḥ । nirbhayaḥ apagatatrāsaḥ । kimarthaṁ nirbhaya iti cēt । bhavatyāḥ dayāṁ karuṇāṁ vidan jānānaḥ । mahālakṣmyāḥ apārāṁ karuṇāṁ samyak avabudhyan taddayayā bhavasāgaraṁ tarīṣyāmīti viśvasan vigatabhayaḥ stōṣyāmīti abhiprāyaḥ ।

īṣattvatkaruṇānirīkṣaṇasudhāsandhukṣaṇādrakṣyatē naṣṭaṁ prāktadalābhatastribhuvanaṁ sampratyanantōdayam ।
śrēyō na hyaravindalōcanamanaḥkāntāprasādādr̥tē saṁsr̥tyakṣaravaiṣṇavādhvasu nr̥ṇāṁ sambhāvyatē karhicit ॥

tribhuvanam anantōdayaṁ rakṣyatē īṣat tvatkaruṇānirīkṣaṇasudhāsandhukṣaṇāt prāk naṣṭaṁ tadalābhataḥ । nr̥ṇāṁ saṁsrutyakṣaravaiṣṇavādhvasu śrēyaḥ na hi sambhāvyatē aravindalōcanamanaḥkāntāprasādāt r̥tē ।

atha śriyaḥ anantā karuṇā taddhētūni phalāni ca varṇyantē । tribhuvanaṁ lōkatrayaṁ bhūrbhuvaḥsvargātmakam । ēkavadbhāvī dviguḥ trayāṇāṁ bhuvanānāṁ lōkānāṁ samāhāra iti । anantōdayaṁ anantāḥ udayāḥ yasya tādr̥k tribhuvanaṁ niravadhikālamiti । prāk naṣṭaṁ caturyugāntē pralīnaṁ parabrahmaṇi । kvēti cēt । tadalabhātaḥ । tvaddayākaṭākṣam anavāpya । "atigrahāvyathanakṣēpēṣvakartari tr̥tīyāyāḥ" iti tasim̐ । sarvanāmnaḥ tvatkaruṇānirīkṣaṇasudhāsandhukṣaṇēnvayaḥ । prāṅnaṣṭamadhunā rakṣyatē bhujyatē । kuta iti । tvatkaruṇānirīkṣaṇasudhāsandhukṣaṇāt । pañcamī hētau । tava karuṇāyāḥ dayāyāḥ nirīkṣaṇamēva dr̥ṣṭirēva sudhā amr̥tam । tasyāḥ sandhukṣaṇaṁ sphūrtiḥ । tasmāt tasya hētunā । jaganti varīvr̥tati tasyāḥ karuṇayā cēt puruṣārthā api kiṁ tasyāḥ dayayā labhyēran । āmityāha yathā । aravindalōcamanaḥkāntāprasādāt । aravindē tāmarasē iva lōcanē yasya saḥ nārāyaṇaḥ । tasya kāntā sahadharmiṇī śrīḥ । tasyā manaḥ cētaḥ । tatprasādaḥ tōṣāt anugrahaḥ । tasmādr̥tē taṁ vinā nēti । r̥tē kiṁ na । nr̥ṇāṁ manuṣyāṇāṁ saṁsr̥tau saṁsārasambaddhēṣu aiśvaryādiṣu phalēṣu akṣarē kaivalyamōkṣē vaiṣṇavādhvani paramapadaprāptau śrēyaḥ bhadraṁ na hi sambhāvyatē upapadyatē karhicit kadācana । kimuta itarē puruṣārthāḥ mōkṣōpi dēvyāḥ karuṇākaṭākṣāvalambita ityabhipraiti ।

śāntānandamahāvibhūti paramaṁ yadbrahma rūpaṁ harērmūrtaṁ brahma tatōpi tatpriyataraṁ rūpaṁ yadatyadbhutam ।
yānyanyāni yathāsukhaṁ viharatō rūpāṇi sarvāṇi tānyāhuḥ svairanurūparūpavibhavairgāḍhōpagūḍhāni tē ॥

yat śāntānandamahāvibhūti paramaṁ brahma rūpaṁ । tatōpi tat priyataraṁ (rūpaṁ) yat mūrtaṁ brahma atyadbhutam । yathāsukhaṁ viharataḥ harēḥ yāni anyāni rūpāṇi । tāni sarvāṇi tē svairanurūparūpavibhavaiḥ gāḍhōpagūḍhāni āhuḥ (pramāṇāḥ) ।

atha divyadampatyōranapāyitvam upavarṇayati । nārāyaṇasya yāni rūpāṇi mūrtayaḥ santi । kīdr̥ṁśi । prathamaṁ divyātmasvarūpam । śāntānandamahāvibhūti । śāntaṁ ṣaḍbhirūrmibhiḥ pipāsākṣucchōkamōhajarāmr̥tyubhī rahitam । ānandam jñānānandamayam । mahāvibhūti sarvavyāpi । tathā śāntam ānandaṁ mahāvibhūti śāntānandamahāvibhūtīti । paramaṁ parō māsyēti paramaṁ nirupamam । brahma br̥hat br̥ṁhaṇaṁ ca । adhunā divyamaṅgalavigrahaḥ । mūrtaṁ mūrtimat brahma brahmaśabdavācyam atyadbhutamatyāścaryakaraṁ tatōpi amūrtarūpādapi priyataraṁ lōkānāṁ saulabhyāt yadrūpaṁ tat । harirviṣṇuḥ yathāsukhaṁ yathēcchaṁ viharati parikrāmati । tathābhūtasya yāni anyāni rūpāṇi syuḥ । vibhavādimūrtīnāṁ grahaṇamatra । āhuḥ pramāṇā iti śēṣaḥ । kimāhuḥ । tāni sarvāṇi rūpāṇi tē tava svaiḥ svakīyaiḥ anurūparūpavibhavaiḥ anurūpāṇām anukūlānāṁ rūpāṇāṁ mūrtīnāṁ vibhavaiḥ kīrtimadbhiḥ gāḍhōpagūḍhāni gāḍham dr̥ḍham upagūḍhāni āliṅgitāni anapāyīni ।

ākāratrayasampannāmaravindanivāsinīm ।
aśēṣajagadīśitrīṁ vandē varadavallabhām ॥

sarvaṁ śrīkr̥ṣṇārpaṇamastu ॥

2014-03-27

A custom keymap for Indian languages

As we saw in the last couple of posts, keying in Indian languages using a QWERTY keyboard requires a keyboard/IME software as well as a standardised way to map the Latin alphabet to the characters in the Indian language du jour. As before, I use Google's Input Tools on Windows and Lipika on OS X. Unlike a representation format (which case use diacritic or other accent marks), a key-map can only employ the characters inputtable through the QWERTY keyboard. So while I use ISO-15919 as the representation format, I needed a key-map as well. As in the previous post, here were my requirements:
  1. Meaningfulness
  2. Pan‐linguistic consistency
  3. Fidelity to pronunciation
  4. Modularity and symmetry
  5. Alphabet restrictions: the scheme must use only Latin characters to represent phonemes; the scheme may use punctuation marks to represent non-phonemic punctuation-like characters in the target language.
With these requirements, I set about to create a key-map I could use. I'd start with my requirements, and in the end, if the key-map ended up resembling an existing "standard", I'd just stick with that instead.

I started out by identifying characters in Tamil and Sanskrit (the 2 Indian languages I write in) based on phonetics and history; this identification process is important for pan-linguistic consistency.

Vowels and Dependents
Sanskrit (Devanagari) ISO‐15919 Tamil Key-map
a
i
u
r̥̄
l̥̄
e
ai
o
au
'

Consonants
Sanskrit (Devanagari) ISO‐159191 Tamil Key-map
क् k க்
ख् kh
ग् g
घ् gh
ङ् ங்
च् c ச்
छ् ch
ज् j
झ् jh
ञ् ஞ்
ट् ட்
ठ् ṭh
ड्
ढ् ḍh
ण् ண்
t ̱ ற்
ன்
त् t த்
थ् th
द् d
ध् dh
न् n ந்
प् p ப்
फ् ph
ब् b
भ् bh
म् m ம்
य् y ய்
र् r ர்
r ̣ ழ்
ळ् ள்
ல்
ल् l
व् v வ்
श्
ष्
स् s
ह् h

The next step was filling in the key-combinations that were "natural" and "obvious".
  1. Given the existence of short and long vowels, using lower- and upper-case letters for vowels seems natural.
  2. Naturally, any unmarked consonant in ISO-15919 can be mapped to the bare letter.
  3. Representing retroflexion by upper-casing the corresponding dental consonant is standard-practice. By modularity, we can do the same for liquids and sibilants too.

Sanskrit (Devanagari) ISO‐15919 Tamil Key-map
a a
A
i i
I
u u
U
r̥̄
l̥̄
e e
E
ai ai
o o
O
au au
'
क् k க் k
ख् kh kh
ग् g g
घ् gh gh
ङ् ங்
च् c ச் c
छ् ch ch
ज् j j
झ् jh jh
ञ् ஞ்
ट् ட் T
ठ् ṭh Th
ड् D
ढ् ḍh Dh
ण् ண் N
t ̱ ற்
ன்
त् t த் t
थ् th th
द् d d
ध् dh dh
न् n ந் n
प् p ப் p
फ् ph ph
ब् b b
भ् bh bh
म् m ம் m
य् y ய் y
र् r ர் r
r ̣ ழ்
ळ् ள் L
ல்
ल् l l
व् v வ் v
श्
ष् S
स् s s
ह् h h

6 issues remain: Dravidian alveolar consonants, the Dravidian approximant, Sanskrit nasals, Sanskrit sibilants, Sanskrit syllabic vowels, and miscellaneous rarely used dependents.
  1. Dravidian alveolar consonants: from the point of view of tongue-position, alveolar stops are intermediate between dental stops and retroflex stops. From this, a natural choice of key-combination for an alveolar stop is a juxtaposition of the keys for the corresponding dental and retroflex stops. Likewise for the alveolar liquid ல்.
  2. Dravidian approximant: based on usage, I picked 'z' as the key for the approximant ழ். The fact that non-native speakers mispronounce the approximant as a voiced sibilant adds credibility to this choice :-)
  3. Sanskrit nasals and sibilants: there are 2 remaining nasals: ङ्, ञ् and one remaining sibilant: श्. The palatal nasal is both a palatal stop and a nasal; a natural representation combines the nasality of 'n' with the palatalness of 'j' or 'c'; we thus get 'nj' and 'nc' as possible key-combinations. By correspondence, the palatal sibilant श् is 'sc' or 'sj', and the velar nasal ङ् 'nk or 'ng'.
Looks like the consonants are done! Here they are:
Consonants
Sanskrit (Devanagari) ISO‐15919 Tamil Key-map
क् k க் k
ख् kh kh
ग् g g
घ् gh gh
ङ् ங் nk/ng
च् c ச் c
छ् ch ch
ज् j j
झ् jh jh
ञ् ஞ் nc/nj
ट् ட் T
ठ् ṭh Th
ड् D
ढ् ḍh Dh
ण् ண் N
t ̱ ற் tT/Tt
ன் nN/Nn
त् t த் t
थ् th th
द् d d
ध् dh dh
न् n ந் n
प् p ப் p
फ् ph ph
ब् b b
भ् bh bh
म् m ம் m
य् y ய் y
र् r ர் r
r ̣ ழ் z
ळ् ள் L
ல் lL/Ll
ल् l l
व् v வ் v
श् sc/sj
ष् S
स् s s
ह् h h
  1. Sanskrit syllabic vowels: The Sanskrit syllabic vowels (ऋ, ॠ, ऌ, ॡ – the last one not actually used) present a problem. The mid-central vowel inherent in these is absent in European languages and thus lacks a symbol; it can however be described as mid-way between 'y' and 'w'. 'y' is already used up in our scheme, but 'w' is free! Using 'w' also ensures people don't mispronounce it as a front-vowel. We thus get 'rw', 'Rw', 'lw' and 'Lw' respectively.
  2. Misc. dependent letters: There are a few different dependent letters that can only existƒ attached to a vowel — the anusvāra, the anunāsika, the visarga and its two other forms the jihvāmulīya and the upadhmānīya, and the āythayeṛuttu. The anusvāra is traditionally represented by an 'M', and the anunāsika by 'MM'; we can stick with those. The visarga, likewise is an 'H'. The upadhmānīya is closest to the Latin 'f', and we can use that. The jihvāmūlīya and the āythayeṛuttu are both velar/glottal and as such 'K' is the most suitable.
We finally have a complete key-map for vowels and dependents! Here it is:
Vowels and Dependents
Sanskrit (Devanagari) ISO‐15919 Tamil Key-map
a a
A
i i
I
u u
U
rw
r̥̄ Rw
lw
l̥̄ Lw
e e
E
ai ai
o o
O
au au
M
MM
H
f
K
K
' '

You can download the keymap for Tamil and Sanskrit from http://code.ambari.sh/keymap.

Footnotes:

1 Unfortunately, ISO-15919 does not distinguish between alveolar and dental liquids; Tamil has only the former, while Sanskrit only the latter. As such, I've had to make a few minor modifications to ISO-15919, where ற and ல are concerned. Thanks to Greg for pointing this out in the comments.

2014-03-02

Representing Indian languages using the Latin script

Indian languages are written in a diverse set of scripts, most of them derived from the Brahmi script and not from the Phoenician script. These scripts neither look like Latin, nor do they have the familiar A, B, … ordering. Further, many Indian languages have many phonemes not present in languages Latin was traditionally used for. As a consequence, many of these scripts have many more than 26 base glyphs (not to mention ligature forms). Mapping these to Latin characters becomes important for 2 distinct uses:
  1. storing/presenting Indian language content
  2. inputting Indian language content.
While Unicode encodes most popular scripts used for Indian languages and even some rare ones, Latin‐letters continue to be used in representing Indian languages everywhere. They're used in email, in SMS messages, in web‐pages, in file‐names; pretty much ubiquitously. But how to map the diverse phoneme set (between 30–50 for most Indian languages) into the 26 letters in the Latin alphabet?

[Likewise, physical keyboards with a QWERTY layout dominate the world; how to allow combinations of characters on the QWERTY keyboard to represent the diverse character sets in Indian languages? I'll address this in another post.]

As usual, there are many options.

"The nice thing about standards is that you have so many to choose from." Andrew S. Tanenbaum

ISO‐15919, an international scholastic standard, and a few other schemes – IAST, Hunterian, National Library of Kolkata, ALA‐LC – use diacritic (accent) marks over/under Latin characters. Harvard‐Kyoto, Velthuis, ITRANS, SLP1, WX, VedaType and ISO‐15919's limited character set option are schemes that restrict themselves to 7‐bit ASCII but use punctuation characters.

For example, here're the same Sanskrit characters in a few sample schemes:
Devanagari ISO‐15919 Hunterian ISO‐15919‐lcs Harvard‐Kyoto ITRANS
aa A aa/A
ri ,r R RRi/R^i
e ee e e
ं (anusvāra) m ;m M M
ख् kh kh kh kh kh
ञ् n ~n J ~n/JN
ड् d .d D D
श् sh ;s z sh

What a mess!

I'm going to address what I think should be the hallmarks of a good scheme for representing Indian language text using Latin characters – how one can figure out if such a scheme was thoughtfully, carefully designed and not thrown together in the middle of a Usenet discussion.

[Note: Indian languages being phonetic, I'm sometimes careless about the phoneme vs. written character distinction. I carelessly use the word “character” for both; the meaning should be clear].

Here's my prioritised list of features a good Indian language representation scheme should have:

  1. Unambiguity: the scheme must preserve the integrity of the script. Put another way, the mapping should be reversible from Latin back to the original script without loss. [Shockingly, the Hunterian scheme fails this basic test.]
  2. Meaningfulness: the scheme must use Latin letters phonetically close to the original Indian language sound. This automatically rules out absurdities like using f for the Sanskrit velar nasal (ङ्) [vide. the wx notation].
  3. Pan‐linguistic consistency: identical phonemes across Indian languages must have identical representations. Especially as Indian languages have heavily borrowed from Sanskrit, it's inconsistent if the "same" word has multiple Latin representations. Sadly, many schemes fail to satisfy this requirement: in ITRANS, the Sanskrit word केवल is represented as “kevala”, but its Tamil borrowing is represented as “kēvala”.
  4. Pan‐linguistic consistency: conversely, a single Latin representation must identify the same phoneme across Indian languages. Again, some schemes fail this requirement too — in some schemes, ḷ would mean a syllabic dental liquid in Sanskrit but could mean a retroflex liquid consonant in Tamil.
  5. Fidelity to pronunciation: the scheme should aid pronunciation, or at least, not encourage distortion. Harvard‐Kyoto's RRi for a syllabic alveolar trill; ITRANS's x for a conjunct consonant and GY/dny for another conjunct consonant are all misspellings (for phonetic languages, misspelling and mispronunciations go together!)
  6. Modularity and symmetry: in Indian languages, the phonemes (and thus the characters) have relationships among them: clearly, the palatal nasal stop has a relationship with the other nasals, as well as a different relationship to the other palatal stops, and yet another relationship with palatal/semi‐palatal vowels. Any scheme may have various mechanisms to represent such features: perhaps an underdot may represent retroflexion, or perhaps doubling a Latin letter may indicate vowel‐length. These mechanisms should be modular and symmetric, i.e. they should work independent of one another, and they should always mean the same thing. If doubling the vowel ‘a’ indicates a long ‘a’ sound, doubling the vowel ‘i’ should indicate a long ‘i’ sound. If adding an underdot to ‘t’ makes it retroflex, adding an underdot to ‘s’ should make that retroflex too.
  7. Alphabet restrictions: the scheme must not use punctuation-marks or mixed-casing. Punctuation‐marks in the middle of words look ugly. ;L'ik:e t"h..i;s. As does miXEd‐CaSiNg. Using punctuation marks also collides with well‐understood orthographic usages: it seems like a shame to not be able to use commas in transliterated Tamil text, just because the scheme gives the comma a special meaning. And starting every sentence with a lower‐case letter is too edgy for my taste.
Only one of the standard schemes fulfils 6 of the 7 requirements completely and the 7th partially (it uses punctuation, but only to represent some combinations that cannot normally occur in the language) – ISO‐15919. In addition to being well thought‐out and sane, it is an international standard and is widely used in scholastic publications. Its one drawback is that the official standard is not available free of cost; instead, ISO charges more then $100 for an electronic copy. However, it's fully documented at Dr. Anthony Stone's website and is usable today. I use it everyday, and so should you!

2014-01-04

On keying in Indian languages

Right from the time I started out learning Sanskrit, I've had to type in Devanagari to send email, do my assignments, write blog-posts etc. Sometime later, I realised I'd like to be able to write in my mother-tongue, Tamil, too: I was rapidly becoming more fluent at writing in Sanskrit than in Tamil, and that bothered me! I used to use Windows and now use OS X, and have used a variety of tools to be able to compose in Indic scripts. I thought I'd write about my journey, what tools I loved, used, hated and discarded and what I now use.

In 2008, to begin with, my requirements were — types Devanagari, works on Windows Vista+, works across all applications (no browser plug-in for me) and has configurable key-maps. That last was very important. There were multiple competing ways to render Sanskrit in ASCII (and corresponding ways to type Devanagari on an US-ASCII keyboard): Harvard-Kyoto, IAST, ITRANS, etc. They had nothing in common, except for one thing: they all sucked. [I'll explain why in a follow-up post.] So I needed to use a keymap I liked, one I created myself.

I started out using Ajit Krishnan's wonderful Mudgala IME. It was very straight-forward to use, and supported custom keyboards. Switching to and out of it was so easy, and it never crashed once in more than a year of using it. But 2 things made me want moar: first, Mudgala IME didn't support Tamil very well. Secondly, Mudgala IME worked by installing a low-level keyboard-hook. Windows has a way to support software keyboards and Input Method editors so that the keyboard/IME would show up in the Language Bar, and other Good Things™ would happen. Mudgala IME instead worked at a much lower level by intercepting every key-stroke, translating to the corresponding Devanagari code-point, and inserting it into the current window. This approach had a couple of drawbacks: as above, it didn't integrate well with Windows or with the keyboard shortcuts Windows provides to switch input keyboards. Worse, the keyboard shortcut Mudgala provided to switch to and out of it was a global shortcut and interfered with others software's use of that shortcut. If you didn't understand the above, suffice it to say that its integration with Windows is non-standard :-)

In 2009, Google came out with its Indic language input tools; this has now became Google Input Tools for Windows. You had to install a separate piece of software for each language you need to type in. It did support custom keyboards, but you had to switch to your custom keyboard explicitly, and worse, you had to do that every-time you switch languages. Worst of all, if you had multiple custom keyboards per language (say, a keyboard to output Devanagari and one to output ISO-15919), you had to use the GUI to switch keyboards each time. Kind of bad UX, but it worked OK otherwise. There were also occasional bugs involving typing too soon after switching to the Google IME: the thing needed a few seconds to initialise, and it would swallow key-strokes till then!

In 2013, Google released an update in 2013 that unified the various language-specific IMEs: you now had to install one software, and pick various languages at installation-time. Unhappily, the documentation on setting up custom keyboards was never updated: the page refers to a Scheme directory of C:/Program Files/Google/Google [Language] Input/, but the actual directory is:

%ProgramData%/Google/Google Input Tools/com.google.input_tools.t13n.ime.{language-name}/schemes/

The other bugs remain, but this is, in my experience, the most solid and clean Indic typing solution on Windows today. I use to type Sanskrit (in Devanagari, Grantha, IPA, ISO-15919) and Tamil (in Vattu, IPA, ISO-15919) and it's never let me down.

Lately, though, I've switched from my trusty Thinkpad to my wife's old MacBook Air, and as soon as I switched, I realised there was a paucity of good IMEs on OS X. Happily, my friend Ranganath Atreya has come up with Lipika, an IME that also supports the same custom-keyboard format Google IME does (yay for compatibility!). I was one of the beta-testers, and I love it.

At work, I use Windows and don't want to install software; when I occasionally want to send email, I use Google Input Tools online. It's pretty Google-y: barebones but efficient.

On my Windows Phone, Microsoft still does not allow 3rd party keyboards :-( I make do with Binu's Type Tamil and Type Sanskrit, which let you input text, which you then have to copy and paste into wherever you want. This sucks, so I'm still looking to you, Microsoft!