An edge-based text region extraction from document images using connected component analysis

R. Pradeep Kumar Reddy; N. Subramanyam; Dr. C. Nagaraju

Year: 2014
Volume: 4
Issue: 7

An edge-based text region extraction from document images using connected component analysis

Author:
R. Pradeep Kumar Reddy¹, N. Subramanyam², C. Nagaraju³
Total Page Count: 13
Page Number: 230 to 242

¹Assistant Professor, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

²Academic Consultant, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

³Associate Professor, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

Online published on 13 August, 2014.

Abstract

Detection of text from documents in which text is embedded in complex colour document images is a very challenging problem. There are a lot of potential uses of text extraction in image searching, archiving documents etc. The objective of the text extraction is to recognize the text and graphic components in documents and to extract the intended information as a human wood. This paper proposed an edge based technique using connected component analysis for separating text and non-text regions in a document image. The maximum magnitude of the edge is detected by using the compass masks convolution filtering in eight major directions. Successively, in the localization process the magnitude of the edge can be compared with a threshold value to generate the edge map. Morphological operations are applied for detection of boundary, removal of noise; identify components, convex hull and so on. The Run-Length smoothing is used to find the connected components in both horizontal and vertical directions. Using Connected Component Analysis and pixel neighborhood a bounding box is drawn for each component after that using the spatial features such as height, width and area of each component a block is classified as either text or non-text block. The texts blocks are then given as input to the segmentation stage of the OCR system, OCR software convert them into electronic representation or machine editable code.

Keywords

Compass mask, Connected Component Analysis, Morphological Operators, OCR, Run-Length Smoothing, Text Extraction, And Threshold

An edge-based text region extraction from document images using connected component analysis

Abstract

Keywords

Products

Company

Account

Support