A new Arabic stemming algorithm
DOI:
https://doi.org/10.36505/ExLing-2008/02/0004/000063Keywords:
Arabic, stemming, morphologyAbstract
Text processing is a vital step in the information retrieval process, text mining, and natural language processing. It includes several stages, such as normalization, stop word removal, and stemming. Stemming is the process of reducing the lexicon to its root. Due to the different structures and rules in languages, the task of stemming is language-dependent. This research introduces a new stemming algorithm for the Arabic Language. Arabic is one of the most complex languages, both spoken and written. However, it is also one of the most common languages in the world. It is the base from which many other languages are derived. Despite the wide usage of the language, technology and development for Arabic has been limited. The main reason lies within the formulation rules of Arabic, as Arabic language exhibits a very complicated morphological structure. Existing Arabic stemmers suffer from high stemming error-rates. They blindly stem all the words and perform poorly, especially with compound words, proper nouns and foreign Arabized words. The main cause of this problem is the stemmer’s lack of knowledge of the word lexical category (i.e. noun, verb, proposition, etc.) This paper presents a new stemming algorithm that relies on Arabic language morphology and Arabic language syntax. The automated addition to the syntactic knowledge reduced both stemming error and stemming cost. Additionally, the new Algorithm automatically creates it is own list of proper nouns, and compound words based on the processed corpus.
References
Al-Fedaghi, S.S. and Al-Anzi, F., 1989. A New Algorithm to Generate Arabic RootPattern Forms. Proceedings of the 11th National Computer Conference and Exhibition, 391–400.
Baeza-Yates, R., 1992. Text retrieval: Theory and practice. 12th IFIP World Computer Congress, 1, 465-476.
Khoja, S., 1999. Stemming Arabic Text. Lancaster, UK, Computing Department, Lancaster University,
Larkey, L.S., Ballesteros, L. and Connell, M.E., 2002. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 275-282.
Larkey, L.S. and Connell, M.E., 2001. Arabic Information Retrieval at UMass in TREC-10. Proceedings of the Tenth Text REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 562-570.
Downloads
Published
Issue
Section
License
Copyright (c) 2008 Eiman Tamah AlShammari (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Articles are published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.