Please use this identifier to cite or link to this item:
http://hdl.handle.net/20.500.11889/4371
Title: | Quality Assessment of General and Categorized Arabic Text Corpora | Authors: | Yahya, Adnan Salhi, Ali |
Keywords: | Corpora (Linguistics);Corpora (Linguistics) - Data processing;Computational linguistics;Web search engines;Economic geography - Mathematical models;Multivariate Analyse;Natural language processing (Computer science);Language processing (Arabic) | Issue Date: | 2014 | Abstract: | Many Natural Language Processing and Information Retrieval methods are based on the extensive use of text corpora. The credibility of the results can be heavily influenced by the underlying corpus quality. Much research has been utilizing Arabic corpora into various tasks of Arabic Information Processing. In this paper we discuss a suite of metrics that can be used to ascertain the quality of Arabic corpora. We borrow heavily from the extensive work on corpora quality for other languages and try to adapt some of them for Arabic. We also apply these measures to sample corpora, including categorized corpora and report on the results. The main corpora we experiment with are: a general corpus extracted from newspaper (AlQuds newspaper) articles covering the years 2009 -2012 and a highly categorized (split into 9 major and 25 minor categories) corpus built from Arabic Wikipedia. We employ different filtration methods, discuss and examine different corpora features as quality metrics. The metrics are based on parameters such as character/word N-gram frequencies and Zipf’s law applicability. We also study error rates, vocabulary properties, monolinguality, the effects of normalization, as well as corpora stability with size growth. It is our intention to make our corpora quality assessment tools available online for possible use by other researchers | URI: | http://hdl.handle.net/20.500.11889/4371 |
Appears in Collections: | Fulltext Publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
921659f6eee8023ea31077c36690082dc089.pdf | 1.11 MB | Adobe PDF | View/Open |
Page view(s)
172
Last Week
0
0
Last month
3
3
checked on Mar 25, 2024
Download(s)
72
checked on Mar 25, 2024
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.