AspellPansari

From CMWiki

Points of formulas for developing spell-checker in Hindi by Hariram Pansari <hariraama@gmail.com> :

(1) In Hindi Language many alternative spellings of same words are being used by different users. This creates lot of problems. The Standardised Spellings in "मानक हिन्दी वर्तनी" booklet by Department of Official Language,GOI published in 1965(probably) became outdated in view of computing (by using of standardised characters for forming a word instead of mere glyph-pieces of characters as was in ordinary manual typewriters and unstandardised 7Bit/8Bit TTF Devanagari fonts, which has limitations). It needs to be revised.

Many confusion occurs with alternative spelling made by alternative rendering in Hindi/Devanagari.

For example :

replacing the use of fifth consonant of a Varga with Vowel Modifier 'Anuswar' (अङ्क=अंक, शङ्ख=शंख .... सम्भाल=संभाल) is scientifically and logically wrong. But BIS ISCII document accepted this for keeping match with old practice of Hindi typewriter (where characters broken in pieces to adjust anyhow over 48 keys of English character). It creates lot of problems. Some softwares spell check auto-replaces words like 'शङ्ख' to 'शंख' which is not justified by linguistics.

Whereas as per alphabetical sorting order शंख will appear in a dictionary at begining and शङ्ख will appear after hundreds of more word.

(2) Right sequence of characters are to be inputed for rendering of conjucts. In absence of any standard Many OT fonts uses different sequences for conjuncts which creats problems in spell check. The spell checked/corrected word may appear wrongly in some other OT font of that language.

(3) The use of ZWJ and ZWNJ for alternative rendering of conjucts also creates great problem in Hindi and Indic spell checking. e.g.:

क(0915)+halant(094D)+ष(0937) = क्ष

क(0915)+halant(094D)+ZWJ(200D)+ष(0937) = क्‍ष

क(0915)+halant(094D)+ZWNJ(200C)+ष(0937) = क्‌ष

Whether these two characters ZWJ(200D)and ZWNJ(200C) are to be ignored in spell-checker rules? If ignored, or if replaced with said right word of spell checker, how the user's sequecences and defined rendering could be kept intact?

(4) Hindi Large corpora or word-lists has to be collected and built, proof-readed, standardised with correct spellings. For this users "personal dictionaries" may be collected from time to time to enlarge the Spell check dictionaries.

(5) Generally spell chekers are very boring, as they stop frequently after only few words over a unknown word. There should be an options 'replace all' and 'ignore all' so that once defined a right 'word', all the forthcoming words can be replaced automatically.

(6) As provided in CDAC products Leapoffice/ISM, option of "user-defined settings" are must needed for a spell checker in Indian Languages, as in different cases different form of spellings are found correct or wrong.

(7) Indic grammer checker may be more difficult, specially for the Hindi, where gender-suffix is additionally used on the verbs also. For Example : राम जाता है, सीता जाती है, राम और सीता दोनों जाते हैं. The जाता, जाती, जाते the suffix of mattra आ, ई and ए needs a special algorithm. But it will not applicable in all the cases. For Example राम मजबूत है, सीता मजबूत है, राम और सीता दोनों मजबूत हैं।


Some rules for Hindi Spell-checker

(1) Hindi spell checker corpus should be compact with a difined way of maintaining a Database in the format -- "prefix+base_word+suffix" For example : अ+समर्थ+ता= असमर्थता OR this may appear as अ~समर्थ~ता= असमर्थता, so that the same dictionary can be used as hyphenation dictionary for breaking long words at write places (in hotzone of a line-end).

(2)

Use of ड(0921) and ड़(095C)

Use of ढ(0922) and ढ़(095D)

Both pairs are separate sounds. In Unicode encoded independently.


(2.1)

With habitual old practice of typing on mannual typewriter:

People use wrongly

ड(0921)+Nukta(093C) for forming ड़(095C)

and use

ढ(0922)+Nukta(093C) for forming ढ़(095D)


A rule needed to be difined in spell-checker algorithm, so that

All ड(0921)+Nukta(093C) should be auto-replaced with ड़(095C)

All ढ(0922)+Nukta(093C) should be auto-replaced with ढ़(095D)


(2.2)

People use wrongly ड(0921) and ढ(0922) in place of ड़(095C) and ढ़(095D).


In this regard there are some rules from Indic Lexic

1. ड़(095C) and ढ़(095D) never comes at begining of a word.

If appears at begining of a word it should be

auto-replaced with ड(0921) and ढ(0922 respectively.


2. ड़(095C) and ढ़(095D) never joins with any character to form a conjuct.

So if appears with

halant(094D)+ड़(095C) or ड़(095C)+halant(094D)

or

halant(094D)+ढ़(095D) or ढ़(095D)+halant(094D)

should be auto-replaced with ड(0921) and ढ(0922 respectively.


(3)

Use of ल(0932) and ळ(0933)


People has wrong old practice of use of ल(0932)+)+Nukta(093C) for forming ळ(0933)

A rule needed to be difined in spell-checker algorithm, so that

All ल(0932)+Nukta(093C) should be auto-replaced with ळ(0933)


(4)

Use of ळ(0933) and ऴ(0934)


People has wrong old practice of use of ळ(0933)+)+Nukta(093C) for forming ऴ(0934)

A rule needed to be difined in spell-checker algorithm, so that

All ळ(0933)+)+Nukta(093C) should be auto-replaced with ऴ(0934)


(5)

Use of य(092F) and य़(095F)


For writing Marathi and Oriya and Bengali words in Devanagari

People has wrong old practice of use of य(092F)+Nukta(093C) for forming य़(095F)

A rule needed to be difined in spell-checker algorithm, so that

All य(092F)+Nukta(093C) should be auto-replaced with य़(095F)


(6)

Use of क़(0958) ख़(0959) ग़(095A) ज़(095B) फ़(095E)

People has wrong practice of typing the above characters by base_character+Nukta,

In Most IMEs it is easy to type nukta(extra) intead of typing separately above characters.

So rules are needed to be difined in spell-checker algorithm, so that---

All क(0915)+Nukta(093C) should be auto-replaced with क़(0958)

All ख(0916)+Nukta(093C) should be auto-replaced with ख़(0959)

All ग(0917)+Nukta(093C) should be auto-replaced with ग़(095A)

All ज(0917)+Nukta(093C) should be auto-replaced with ज़(095B)

All फ(0917)+Nukta(093C) should be auto-replaced with फ़(095E)


(7)

[Earlier when the Hindi/Devanaagarii Typewriter machines were manufactured (by just replacing the Roman character-types with Devanaagarii character-types), there was limitation of 48 keys, so some characters has to be left and some has to cut off to there half form only. All nasalisation syllables has to be formed by Anuswaar and Chandrabindu.] Following the tradtion in some softwares based on ISCII-1991 Hindi spell-check provisions/options were provided for auto-replacing all पञ्चमाक्षर of a consonant-varga + halant [ङ्,ञ्,ण्,न्,म्] (if followed by any of first four characters of that varga) to अनुस्वार (U0902). But confusing by this some people wrongly using Anuswaar with words like "संमति(सम्मति), संमान(सम्मान), अंवय(अन्वय), कंव(कण्व), वांमय(वाङ्मय)" also.

Therefore:

as now Unicode and OT fonts have vast scopes, there is no any technical limit, so now in Unicoded Spell-checker provision should be made to auto-replace all


"अनुस्वार(U0902)+क[or ख, ग, घ]" to "ङ्(U0919+U094D)+क[or ख, ग, घ]"

"अनुस्वार(U0902)+च[or छ, ज, झ]" to "ञ्(U091E+U094D)+च[or छ, ज, झ]"

"अनुस्वार(U0902)+ट[or ठ, ड, ढ]" to "ण्(U0921+U094D)+ट[or ठ, ड, ढ]"

"अनुस्वार(U0902)+त[or थ, द, ध]" to "न्(U0928+U094D)+त[or थ, द, ध]"

"अनुस्वार(U0902)+प[or फ, ब, भ]" to "म्(U092E+U094D)+प[or फ, ब, भ]"



(8) Some people has wrong practice of inserting a space before '।'(purnaviram/danda, 0964) and '॥' (double danda, 0965). Whereas appropriate space_width is inherent before it, and kept inbuilt in the font itself at designing level. By this mistake, if the Danda and Double_Danda are appearing in the hotzone of a line-end, the last word of sentence remain at the line-end and the Danda and Double_Danda shifts to the begining of next line. This creates wrong visualisation.

So a rule A rule needed to be difined in spell-checker algorithm, so that

all space(0020)+Danda(0964) should be auto-replaced with Danda(0964), i.e. space to be deleted.

all space(0020)+Double_Danda(0965) should be auto-replaced with Double_Danda(0965), i.e. space to be deleted.

(9) Similar the above pt. No.(7), some people has wrong practice of not-inserting a space after '।'(purnaviram/danda, 0964) and '॥' (double danda, 0965). By this mistakes two sentences are found joined and creates problems in NLP processings.

So a rule needed to be difined in spell-checker algorithm, so that

All Danda(0964) should be auto-replaced with Danda(0964)+space(0020) if space(0020) not found after Danda(0964).

All Double_Danda(0965) should be auto-replaced with Double_Danda(0965)+space(0020) if space(0020) not found after Double_Danda(0965).


(10)

Similarly

All space(0020)+,(comma U002C) should be auto-replaced with ,(comma U002C) only.

All space(0020)+;(semi-colon U003B) should be auto-replaced with ;(semi-colon U003B) only.

All space(0020)+?(qestion mark U003F) should be auto-replaced with ?(qestion mark U003F) only.

All space(0020)+॰(abbreviation U0970) should be auto-replaced with ॰(abbreviation U0970) only.


A space after '.' (Fullstop) ',' (comma) and ';' (semi-colon) ?(qestion mark) ॰(abbreviation)(and other punctuation marks if necessary felt by developers) must be inserted, if not found already inserted.


All the above rules should be so that no user interferace required while spell-checking, i.e. spell-checker should not stop for user's selection of option. It should proceed without break.


Other rules


(11)

Some peoples, as in English, use '.' (002E) for the following multiple meanings(functions):

1. Fullstop (Sentence_ending mark)

2. Abbreviation (For example : Mr.A.B.Singh, Rs.=Rupees)

3. Decimal point (For example : 456.50, 34.567)

4. Muliplication sign (in math as 2.2=4)

5. Dot (in websites For example : www.sarai.net)

This creates lot of problems in NLP processig (like Computer Aided Translation etc.), as where to break sentence, how to deal the abbreviations etc...


But in Devanagari Unicode separate signs are encoded :

For fullstop (sentence_end)= । (Danda, 0964)

For Paragraph_end (specially found in poems) = ॥ (Double_Danda, 0965)

For abbreviation = ॰ (Abbreviation, 0970)


For this reason English to Hindi Computer-translation is very difficult

than

Hindi to English Computer-translation.


So user definable options settings should be made in the program for-

Replacing All .(002E) found at the end of a sentence with । (Danda, 0964)

Replacing All .(002E) found used as abbreviation with ॰ (Abbreviation, 0970)

Replaceing All .(002E) found used as Decimal point i.e. between two Numerals with • 'middle dot' (00B7), as most of the languages uses it for this purpose.


(12)

Some people by confusion use ड़ (DDDHA 095C) in place of ङ (NGA, 0919), it should taken care if a rule could be defined for correcting this.


(13)

People (specially non-Hindi speaking) makes frequent mistakes in use of

'i'maatra (093F) and 'ii' maatra (0940)

'u'maatra (0941) and 'uu' maatra (0942)


Some available special rules are to set for this. (to be maintioned....)


(14)

As it was technically difficult to put a Chandrabindu (U0901) on the words with above maatraa (as there was limitation in mannual Hindi Typewriter machines and 7bit/8bit TTF fonts) it was accepted to use anuswar in place of chandrabindu in spelling-rules.

But both are separate sounds and encoded characters.

Now in Unicode encoded scripts and OT fonts there is no such limit. In some Hindi Fonts like Raghu and Mangal, chandrabindu also clearly appearing on above maatraas. So provisions should be made to correct the spelling with Chandrabindu where Anuswar is used wrongly.

more... (to continue...)



GtoA7y <a href="http://swqdeiwfnaag.com/">swqdeiwfnaag</a>, [url=http://rbyqkwilvfnb.com/]rbyqkwilvfnb[/url], [link=http://tsnwcwtwuufe.com/]tsnwcwtwuufe[/link], http://ubvvxiiwuzbu.com/