Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11889/4371
Title: Quality Assessment of General and Categorized Arabic Text Corpora
Authors: Yahya, Adnan
Salhi, Ali
Keywords: Corpora (Linguistics)
Corpora (Linguistics) - Data processing
Computational linguistics
Web search engines
Economic geography - Mathematical models
Multivariate Analyse
Natural language processing (Computer science)
Language processing (Arabic)
Issue Date: 2014
Abstract: Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of Arabic corpora. We borrow heavily from the extensive work on corpora quality for other languages and try to adapt some of them for Arabic. We also apply these measures to sample corpora, including categorized corpora and report on the results. The main corpora we experiment with are: a general corpus extracted from newspaper (AlQuds newspaper) articles covering the years 2009 -2012 and a highly categorized (split into 9 major and 25 minor categories) corpus built from Arabic Wikipedia. We employ different filtration methods, discuss and examine different corpora features as quality metrics. The metrics are based on parameters such as character/word N-gram frequencies and Zipf’s law applicability. We also study error rates, vocabulary properties, monolinguality, the effects of normalization, as well as corpora stability with size growth. It is our intention to make our corpora quality assessment tools available online for possible use by other researchers
URI: http://hdl.handle.net/20.500.11889/4371
Appears in Collections:Fulltext Publications

Files in This Item:
File Description SizeFormat 
921659f6eee8023ea31077c36690082dc089.pdf1.11 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.