An efficient, font independent word and character segmentation algorithm for printed Arabic text

Qaroush, Aziz; Jaber, Bassam; Mohammad, Khader; Washaha, Mahdi; Maali, Eman; Nayef, Nibal

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11889/8363

Title:	An efficient, font independent word and character segmentation algorithm for printed Arabic text
Authors:	Qaroush, Aziz Jaber, Bassam Mohammad, Khader Washaha, Mahdi Maali, Eman Nayef, Nibal
Keywords:	Arabic OCR;Machine learning;Neural networks (Computer science);Optical Character Recognition;Word segmentation;Speech perception - Arabic language;Character segmentation;Natural language processing (Computer science);Image analysis;Image processing - Digital techniques;Segmentation techniques;Baseline;Projection profile
Issue Date:	2022
Publisher:	Journal of King Saud University - Computer and Information Sciences
Abstract:	Characters segmentation is a necessity and the most critical stage in Arabic OCR system. It has attracted the interest of a wide range of researchers. However, the nature of the Arabic cursive script poses extra challenges that need further investigation. Therefore, having a reliable and efficient Arabic OCR system that is independent of font variations is highly required. In this paper, an indirect, font-in dependent word and character segmentation algorithm for printed Arabic text investigated. The proposed algorithm takes a binary line image as an input and produces a set of binary images consisting of one character or ligature as an output. The segmentation performed at two levels: a word segmentation performed in the first level, by employing a vertical projection at the input line image along with using Interquartile Range (IQR) method to differentiate between word gaps and within word gaps. A projection profile method used as a second level of segmentation along with a set of statistical and topological features, which are font independent, to identify the correct segmentation points from all potential points. The APTI dataset used to test the proposed algorithm with a variety of font type, size, and style. The algorithm experimented on 1800 lines (approximately 24,816 words) with an average accuracy of 97.7% for words segmentation and 97.51% for characters segmentation.
URI:	http://hdl.handle.net/20.500.11889/8363
DOI:	10.1016/j.jksuci.2019.08.013
Appears in Collections:	Fulltext Publications

Files in This Item:

File	Description	Size	Format
An efficient, font independent word and character segmentation algorithm for printed Arabic text.pdf		2.96 MB	Adobe PDF	View/Open

Show full item record

Page view(s)

8

checked on Jan 20, 2024

Download(s)

7

checked on Jan 20, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Altmetric

Google Scholar^TM