Please use this identifier to cite or link to this item:
Title: Quality Assessment of General and Categorized Arabic Text Corpora
Authors: Yahya, Adnan
Salhi, Ali
Keywords: Corpora (Linguistics);Corpora (Linguistics) - Data processing;Computational linguistics;Web search engines;Economic geography - Mathematical models;Multivariate Analyse;Natural language processing (Computer science);Language processing (Arabic)
Issue Date: 2014
Abstract: Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of Arabic corpora. We borrow heavily from the extensive work on corpora quality for other languages and try to adapt some of them for Arabic. We also apply these measures to sample corpora, including categorized corpora and report on the results. The main corpora we experiment with are: a general corpus extracted from newspaper (AlQuds newspaper) articles covering the years 2009 -2012 and a highly categorized (split into 9 major and 25 minor categories) corpus built from Arabic Wikipedia. We employ different filtration methods, discuss and examine different corpora features as quality metrics. The metrics are based on parameters such as character/word N-gram frequencies and Zipf’s law applicability. We also study error rates, vocabulary properties, monolinguality, the effects of normalization, as well as corpora stability with size growth. It is our intention to make our corpora quality assessment tools available online for possible use by other researchers
Appears in Collections:Fulltext Publications

Files in This Item:
File Description SizeFormat
921659f6eee8023ea31077c36690082dc089.pdf1.11 MBAdobe PDFView/Open
Show full item record

Page view(s)

Last Week
Last month
checked on Jun 27, 2024


checked on Jun 27, 2024

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.