Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11889/4443
Title: Arabic text categorization based on Arabic Wikipedia
Authors: Yahya, Adnan
Salhi, Ali
Keywords: Natural language processing (Computer science)
Computer network resources - Arab countries
Computational linguistics
Linguistics - Databases
Text processing (Computer science)
Issue Date: 1-Feb-2014
Publisher: ACM: Association for Computing Machinery
Citation: Adnan Yahya and Ali Salhi. 2014. Arabic Text Categorization Based on Arabic Wikipedia. 13, 1, Article 4 (February 2014), 20 pages. DOI: http://dx.doi.org/10.1145/2537129
Series/Report no.: doi>10.1145/2540989;
Abstract: This paper describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based data sets, obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization algorithm was built by adopting a simple categorization idea, then moving forward to more complex one. We applied tests and filtration criteria to end with the best and most efficient results that our algorithm can achieve. The categorization depends on the statistical relation between the input text and the reference (training) data supported by well defined Wikipedia-based categories. Our algorithm supports two levels for categorizing Arabic text; categories are grouped into a hierarchy of main categories and subcategories. This introduces a challenge due to the correlation between certain subcategories and overlap between main categories. We argue that our algorithm achieved good performance compared to other methods reported in the literature.
Description: ACM Transactions on Asian Language Information Processing. Vol. 13, No. 1, Article 4. February 2014
URI: http://hdl.handle.net/20.500.11889/4443
Appears in Collections:Fulltext Publications

Files in This Item:
File Description SizeFormat 
TALIP_PreFinalCopy.pdf651.3 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.