TAMIL PART-OF-SPEECH TAGGER CUM SPELL CHECKER README.DOC Copy the file tagtamil.zip into a separate directory in your hard disk and give the command C:> pkunzip tagtamil.zip (or use any other compatible compression program). This will uncompress all the files related to Tamil part-of-speech tagger and spell checker. You require an IBM compatible computer with VGA color monitor, either 386 or 486 machine, to run the programs in this package. Following are the executable stand-alone programs kept in this package: 1) tamiltag.exe - Part-of-speech tagger for Tamil. 2) spltamil.exe - Experimental spelling checker for Tamil. 3) getwords.exe - An utility program to extract words from a text file. 4) correct.exe - Corrects your source text consulting the error file. 5) help.exe - Command line instructions for tagger and spell checker. 6) translat.exe - A prototype English - Tamil translation system. 7) tamilize.exe - Transliterates Roman to Tamil(Only monolingual documents). 8) tagtex.exe - Prepares Tamil TeX file for printing using wntml fonts. 8) write.exe - Like DOS type.com, but it prompts for every page. 9) ted.com - Public domain text editor.(write.exe and this program can be used to read tamilized documents). 10) tamil.com - Tamil ascii driver. (Run this first!). 11) normal.com - Brings back normal ascii from Tamil ascii. The programs in this package are written based on the morphological knowledge base that I have developed for Tamil. TAMILTAG.EXE is a morphological processor that can recognize Tamil words and provide an output with root form of the input word along with suitable tags for affixes. This system is written in the programming language Prolog by incorporating the ideas of level ordered morphology as illustrated in the theory of lexical phonology, and the concept of two-level morphology as introduced in the PCKIMMO's system. Morphemes in a given Tamil word are recognized in a specific order conforming to the three way classification of Tamil suffixes viz., level-1, level-2 and level-3 suffixes. Once a suffix belonging to a specific level is recognized, stripping off the respective suffix and reconstruction of the surface form of the rest of the word are made. This process is repeated until all the suffixes in the given word are recognized and the root form of the word is obtained. The two-level morphological forms such as surface and lexical forms are stored in a built-in PROLOG database as part of the system. This tagger may fail to recognize words due to one of the following reasons. 1) Given word is not in the dictionary (words.dat), which consists of only one thousand words. 2) The rule for the specific form is yet to be coded in the system. 3) Problems due to lexical homonymy. (Ex.) The Tamil word col ('say', 'word'), for instance, is both a noun and a verb. Since the dictionary lists this word only as a verb, only the verbal forms of this word such as connaan (said-he), conna (that which is said), collum (will say-it) etc., will be recognized, and the nominal declensions such as coRkaluTan 'with the words', collai 'word-obj.', collaal 'by the word' etc., will be ignored by this system. Only one of the roles of inflectionally homonymous words will be recognized by this system. For example, the word paar-tt-a-tu in Tamil can be interpreted in three different ways. a) as a neuter singular finite verb, saw-it, b) as a verbal noun meaning 'seeing' (past) and c) as a participial noun with the meaning 'that which saw'. This system is designed in such a way that it can identify only the last interpretation, i.e., 'that which saw', and thus the above example is tagged as [[nom, paar, pa_ajp_pn_neut.sg]]. (i.e., This is a nominative noun with neuter singular adjectival participle form of the verb 'paar'). Thus, the word 'paarttatum' will be interpreted by this system as a neuter singular adjective participle with conjunctive suffix, ('the one which saw was also') and the consequtive meaning 'soon after one saw' will be ignored. SPELL CHECKER: The program SPLTAMIL.EXE is an experimental spell checker for Tamil. This basically reflects the application side of the above morphological tagger. The input file for this program must be an ascii file with Tamil words arranged as one word per line. (The program getwords.exe, provided in this package, can be used to extract all the words from your text file non-redundantly). When a word does not conform to any of the morphological rules provided in the rule base, this system assumes that it is a misspelled word. A file called error.dat will be automatically created by this system to store all the misspelled and corrected words. The program correct.exe can be used to implement the correction in your source file. Three sample files namely sample.dat, akilan.dat and taginput.dat are provided in this package. Use these files with the programs getwords.exe, spltamil.exe, tamiltag.exe, correct.exe and tamilize.exe. Both command line input and file input are facilitated in this system. Most of the operations of this system are very self-explanatory due to the presence of a number of options. Options are provided for consultation of transliteration table, example inputs, abbreviation table and file handling. Run the program help.exe to read the detailed instructions for how to use the programs in this package. This is a work in progress. Any suggestion to improve these programs will be greatly appreciated. I am hoping that this work will be of much useful for further research related to Tamil NLP applications such as information retrieval, machine translation, pedagogical application, natural language interface etc., besides the other practical applications such as spell checker and thesaurus. SYSTEM REQUIREMENTS: An IBM compatible PC with 386 or higher configuration and a VGA color monitor. My other NLP oriented Tamil learning programs are kept in the ftp cites clr.nmsu.edu under the directory CLR/multiling/indian/Tamil and wuarchive.wustl.edu under /pub/doc/misc/tamil/pulavan. Copyright: This is a freeware and no prior permission from the author is required either to use these programs or to copy them to others. However, if any part of this system is used in any form for commercial purposes, a prior permission from the author is requested. I release this as part of my previous Tamil learning package called "PULAVAN". I claim no responsibility for any kind of consequences that might occur due to proper or improper use of any of these programs. Permanent address: Vasu Renganathan East street, Kazhanivasal, Manganallur bazaar P.O. Mayiladuthurai R.M.S. 609 404. India. Phone: 011 91 4364 83452. Email: vasu@u.washington.edu