TAMIL PART-OF-SPEECH TAGGER CUM SPELL CHECKER

README.DOC

Copy the file tagtamil.zip into a separate directory in your hard disk and 
give the command C:> pkunzip tagtamil.zip (or use any other compatible 
compression program).  This will uncompress all the files related to Tamil 
part-of-speech tagger and spell checker.  You require an IBM compatible 
computer with VGA color monitor, either 386 or 486 machine, to run the 
programs in this package. 

Following are the executable stand-alone programs kept in this package: 

  1) tamiltag.exe  - Part-of-speech tagger for Tamil.
  2) spltamil.exe  - Experimental spelling checker for Tamil.
  3) getwords.exe  - An utility program to extract words from a text file.
  4) correct.exe   - Corrects your source text consulting the error file.
  5) help.exe      - Command line instructions for tagger and spell checker.
  6) translat.exe  - A prototype English - Tamil translation system.
  7) tamilize.exe  - Transliterates Roman to Tamil(Only monolingual documents).
  8) tagtex.exe    - Prepares Tamil TeX file for printing using wntml fonts.
  8) write.exe     - Like DOS type.com, but it prompts for every page.
  9) ted.com       - Public domain text editor.(write.exe and this program
                       can be used to read tamilized documents).
 10) tamil.com     - Tamil ascii driver. (Run this first!).
 11) normal.com    - Brings back normal ascii from Tamil ascii.

   The programs in this package are written based on the morphological 
knowledge base that I have developed for Tamil.  TAMILTAG.EXE is a 
morphological processor that can recognize Tamil words and provide an output 
with root form of the input word along with suitable tags for affixes. This 
system is written in the programming language Prolog by incorporating the 
ideas of level ordered morphology as illustrated in the theory of lexical 
phonology, and the concept of two-level morphology as introduced in the 
PCKIMMO's system.  Morphemes in a given Tamil word are recognized in a 
specific order conforming to the three way classification of Tamil suffixes 
viz., level-1, level-2 and level-3 suffixes.  Once a suffix belonging to a 
specific level is recognized, stripping off the respective suffix and 
reconstruction of the surface form of the rest of the word are made.  This 
process is repeated until all the suffixes in the given word are recognized 
and the root form of the word is obtained.  The two-level morphological forms 
such as surface and lexical forms are stored in a built-in PROLOG database as 
part of the system. 
      
This tagger may fail to recognize words due to one of the following reasons.  
1) Given word is not in the dictionary (words.dat), which consists of only 
one thousand words.  2) The rule for the specific form is yet to be coded in 
the system.  3) Problems due to lexical homonymy. (Ex.) The Tamil word col 
('say', 'word'), for instance, is both a noun and a verb.  Since the 
dictionary lists this word only as a verb, only the verbal forms of this word 
such as connaan (said-he), conna (that which is said), collum (will say-it) 
etc., will be recognized, and the nominal declensions such as coRkaluTan 'with 
the words', collai 'word-obj.', collaal 'by the word' etc., will be ignored by 
this system.  

   Only one of the roles of inflectionally homonymous words will be recognized 
by this system.  For example, the word paar-tt-a-tu in Tamil can be 
interpreted in three different ways.  a) as a neuter singular finite verb, 
saw-it, b) as a verbal noun meaning 'seeing' (past) and c) as a participial 
noun with the meaning 'that which saw'.  This system is designed in such a way 
that it can identify only the last interpretation, i.e., 'that which saw', and 
thus the above example is tagged as [[nom, paar, pa_ajp_pn_neut.sg]]. (i.e., 
This is a nominative noun with neuter singular adjectival participle form of 
the verb 'paar').  Thus, the word 'paarttatum' will be interpreted by this 
system as a neuter singular adjective participle with conjunctive suffix, 
('the one which saw was also') and the consequtive meaning 'soon after one 
saw' will be ignored. 

SPELL CHECKER:

The program  SPLTAMIL.EXE is an experimental spell checker for Tamil.  This 
basically reflects the application side of the above morphological tagger. The 
input file for this program must be an ascii file with Tamil words arranged as 
one word per line.  (The program getwords.exe, provided in this package, can 
be used to extract all the words from your text file non-redundantly).  When a 
word does not conform to any of the morphological rules provided in the rule 
base, this system assumes that it is a misspelled word.  A file called 
error.dat will be automatically created by this system to store all the 
misspelled and corrected words.  The program correct.exe can be used to 
implement the correction in your source file.  Three sample files namely 
sample.dat, akilan.dat and taginput.dat are provided in this package.  Use 
these files with the programs getwords.exe, spltamil.exe, tamiltag.exe, 
correct.exe and tamilize.exe.  

Both command line input and file input are facilitated in this system.  Most 
of the operations of this system are very self-explanatory due to the 
presence of a number of options.  Options are provided for consultation of 
transliteration table, example inputs, abbreviation table and file handling. 

Run the program help.exe to read the detailed instructions for how to use the 
programs in this package. 

This is a work in progress.  Any suggestion to improve these programs will be 
greatly appreciated.  I am hoping that this work will be of much useful for 
further research related to Tamil NLP applications such as information 
retrieval, machine translation, pedagogical application, natural language 
interface etc., besides the other practical applications such as spell checker 
and thesaurus.  

SYSTEM REQUIREMENTS:  An IBM compatible PC with 386 or higher configuration 
and a VGA color monitor.    

My other NLP oriented Tamil learning programs are kept in the ftp cites 
clr.nmsu.edu under the directory CLR/multiling/indian/Tamil and 
wuarchive.wustl.edu under /pub/doc/misc/tamil/pulavan.

Copyright:  This is a freeware and no prior permission from the author is 
required either to use these programs or to copy them to others.  However, if 
any part of this system is used in any form for commercial purposes, a prior 
permission from the author is requested.  I release this as part of my 
previous Tamil learning package called "PULAVAN".  I claim no responsibility 
for any kind of consequences that might occur due to proper or improper use of 
any of these programs.  

Permanent address:
Vasu Renganathan
East street,
Kazhanivasal,
Manganallur bazaar P.O. 
Mayiladuthurai R.M.S.  609 404.
India.
Phone: 011 91 4364 83452.

Email: vasu@u.washington.edu