A new Arabic stemming algorithm

Eiman Tamah AlShammari; Jessica Lin

doi:10.36505/ExLing-2008/02/0004/000063

Authors

Eiman Tamah AlShammari College of Computer and Information Sciences, King Saud University, Saudi Arabia Author
Jessica Lin Computational Linguistics Program, University of Arizona, USA Author

DOI:

https://doi.org/10.36505/ExLing-2008/02/0004/000063

Keywords:

Arabic, stemming, morphology

Abstract

Text processing is a vital step in the information retrieval process, text mining, and natural language processing. It includes several stages, such as normalization, stop word removal, and stemming. Stemming is the process of reducing the lexicon to its root. Due to the different structures and rules in languages, the task of stemming is language-dependent. This research introduces a new stemming algorithm for the Arabic Language. Arabic is one of the most complex languages, both spoken and written. However, it is also one of the most common languages in the world. It is the base from which many other languages are derived. Despite the wide usage of the language, technology and development for Arabic has been limited. The main reason lies within the formulation rules of Arabic, as Arabic language exhibits a very complicated morphological structure. Existing Arabic stemmers suffer from high stemming error-rates. They blindly stem all the words and perform poorly, especially with compound words, proper nouns and foreign Arabized words. The main cause of this problem is the stemmer’s lack of knowledge of the word lexical category (i.e. noun, verb, proposition, etc.) This paper presents a new stemming algorithm that relies on Arabic language morphology and Arabic language syntax. The automated addition to the syntactic knowledge reduced both stemming error and stemming cost. Additionally, the new Algorithm automatically creates it is own list of proper nouns, and compound words based on the processed corpus.

References

Al-Fedaghi, S.S. and Al-Anzi, F., 1989. A New Algorithm to Generate Arabic RootPattern Forms. Proceedings of the 11th National Computer Conference and Exhibition, 391–400.

Baeza-Yates, R., 1992. Text retrieval: Theory and practice. 12th IFIP World Computer Congress, 1, 465-476.

Khoja, S., 1999. Stemming Arabic Text. Lancaster, UK, Computing Department, Lancaster University,

Larkey, L.S., Ballesteros, L. and Connell, M.E., 2002. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 275-282.

Larkey, L.S. and Connell, M.E., 2001. Arabic Information Retrieval at UMass in TREC-10. Proceedings of the Tenth Text REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 562-570.

A new Arabic stemming algorithm

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Keywords

Browse Articles

Share