NATURAL LANGUAGE PROCESSING


PROBABILISTIC METHODS FOR PROCESSING HIGHLY INFLECTED NATURAL LANGUAGES

We are applying our expertise of OA for the development of natural language processing tools. We concentrate at this stage on rare semitic languages. The development of new fundamental statistical and probabilistic methodologies is an important aspect of our effort. The ultimate goal is to create spell checkers, syntax and morphology analyzers, electronic dictionaries and translation machines based on stochastic methods. Our research might also be useful for the automatic retrieval of historical text and Internet search engines. The example of a spell checker illustrates well tremendous problems created by the complex morphology of semitic languages. The spell checker should offer suggestions for incorrectly spelled words. The suggested list is made up by the words which are close to the misspelled word. To measure closeness between words, one can use an OA-score or similar measures. For time efficiency, one needs to reduce words to a skeletal form and apply pattern-matching algorithm One difficulty for semitic languages is the complex morphology of the verbs: the primary meaning is defined by its root which consists of three consonants. There are complicated suffixes and prefixes which contain information about the gender, person and number of the object and subject. For example, take the root "flg"(have) in Amharic. Then, "if"algal "ahu"(I want) and "if\"alg\"awal\"ahu"(I want something) are different only because of the object. Note that the suffix added for the object changes due to the subject: "tf\"aligal\"ash"(you want) but "tf\"algiwal\"ash" (you want something). The vowels between the consonants of the root define various modes and the verb is also inflected for benefactive, malfactive, causative, transitive, passive, dative, negative. For example, ``I have'' is different in ``I have money'' and `` I have a problem''. This difference is there because in one sentence something beneficiary to the subject is meant whilst not in the other. The problem is that for all verbs there are many such different forms, so it is not an isolated phenomenon, but rather a systematic one. To make matters worse, there are often many suffixes and prefixes added to the root at the same time. To summarize: a verb appears in thousand of different shapes in a text. Since, many words are derived from a relatively small amount of roots, there are huge numbers of related and unrelated words, which have a high degree of similarity. This is the primary reason why the spell check problem has not yet been solved in a satisfactory manner for Arabic, for example.

TEXT CLASSIFICATION

Text classification is still very important task: for example when you build a chat-bot, you first need to determine the intention of the meassage received. This can be viewed as a topic classification. Our published articles on the subject: