Diacritic-Based Matching of Arabic Words

Jarrar, Mustafa; Zaraket, Fadi; Asia, Rami; Amayreh, Hamzeh

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11889/5888

DC Field	Value	Language
dc.contributor.author	Jarrar, Mustafa	-
dc.contributor.author	Zaraket, Fadi	-
dc.contributor.author	Asia, Rami	-
dc.contributor.author	Amayreh, Hamzeh	-
dc.date.accessioned	2019-03-26T06:28:48Z	-
dc.date.available	2019-03-26T06:28:48Z	-
dc.date.issued	2018-12	-
dc.identifier.citation	Mustafa Jarrar, Fadi Zaraket, Rami Asia, Hamzeh Amayreh: Diacritic-Based Matching of Arabic Words. ACM Asian and Low-Resource Language Information Processing. Volume 18, No 2, Pages(10:1--10:21), ACM, December 2018. ISSN 2375-4699.	en_US
dc.identifier.issn	2375-4699	-
dc.identifier.uri	http://hdl.handle.net/20.500.11889/5888	-
dc.description.abstract	Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this paper, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine- learning based algorithm. We evaluated the soundness, completeness and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case in lemma disambiguation and in linking hundreds of Arabic dictionaries.	en_US
dc.language.iso	en_US	en_US
dc.publisher	ACM	en_US
dc.relation.ispartofseries	Vol. 18, No 2;	-
dc.subject	Natural language processing (Computer science)	en_US
dc.subject	Phonology, Arabic - Data processing	en_US
dc.subject	Grammar, Arabic - Phonology	en_US
dc.subject	Language resources	en_US
dc.subject	Computational linguistics	en_US
dc.subject	Diacritics	en_US
dc.subject	Disambiguation	en_US
dc.subject	Ambiguity - Data processing	en_US
dc.title	Diacritic-Based Matching of Arabic Words	en_US
dc.type	Article	en_US
newfileds.department	Engineering and Technology	en_US
newfileds.conference	ACM Asian and Low-Resource Language Information Processing	en_US
newfileds.item-access-type	open_access	en_US
newfileds.thesis-prog	none	en_US
newfileds.general-subject	Computers and Information Technology \| الحاسوب وتكنولوجيا المعلومات	en_US
item.grantfulltext	open	-
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
Appears in Collections:	Fulltext Publications

Files in This Item:

File	Description	Size	Format
JZAA18.pdf	Article	5.98 MB	Adobe PDF	View/Open

Show simple item record

Page view(s)

135

Last Week
0

Last month
4

checked on Apr 14, 2024

Download(s)

100

checked on Apr 14, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM