Please use this identifier to cite or link to this item:
DC FieldValueLanguage
dc.contributor.authorYahya, Adnan-
dc.contributor.authorSalhi, Ali-
dc.description.abstractMany Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of Arabic corpora. We borrow heavily from the extensive work on corpora quality for other languages and try to adapt some of them for Arabic. We also apply these measures to sample corpora, including categorized corpora and report on the results. The main corpora we experiment with are: a general corpus extracted from newspaper (AlQuds newspaper) articles covering the years 2009 -2012 and a highly categorized (split into 9 major and 25 minor categories) corpus built from Arabic Wikipedia. We employ different filtration methods, discuss and examine different corpora features as quality metrics. The metrics are based on parameters such as character/word N-gram frequencies and Zipf’s law applicability. We also study error rates, vocabulary properties, monolinguality, the effects of normalization, as well as corpora stability with size growth. It is our intention to make our corpora quality assessment tools available online for possible use by other researchersen_US
dc.subjectCorpora (Linguistics)en_US
dc.subjectCorpora (Linguistics) - Data processingen_US
dc.subjectComputational linguisticsen_US
dc.subjectWeb search enginesen_US
dc.subjectEconomic geography - Mathematical modelsen_US
dc.subjectMultivariate Analyseen_US
dc.subjectNatural language processing (Computer science)en_US
dc.subjectLanguage processing (Arabic)en_US
dc.titleQuality Assessment of General and Categorized Arabic Text Corporaen_US
newfileds.departmentEngineering and TechnologyEngineering and Technologyen_US
item.fulltextWith Fulltext-
Appears in Collections:Fulltext Publications
Files in This Item:
File Description SizeFormat
921659f6eee8023ea31077c36690082dc089.pdf1.11 MBAdobe PDFView/Open
Show simple item record

Page view(s)

Last Week
Last month
checked on May 11, 2022


checked on May 11, 2022

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.