ZENITH International Journal of Multidisciplinary Research

  • Year: 2014
  • Volume: 4
  • Issue: 7

An edge-based text region extraction from document images using connected component analysis

  • Author:
  • R. Pradeep Kumar Reddy1, N. Subramanyam2, C. Nagaraju3
  • Total Page Count: 13
  • DOI:
  • Page Number: 230 to 242

1Assistant Professor, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

2Academic Consultant, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

3Associate Professor, Department Of Computer Science And Engineering, Y.S.R Engineering College Of YVU, Proddatur-516360, Andhra Pradesh, India

Abstract

Detection of text from documents in which text is embedded in complex colour document images is a very challenging problem. There are a lot of potential uses of text extraction in image searching, archiving documents etc. The objective of the text extraction is to recognize the text and graphic components in documents and to extract the intended information as a human wood. This paper proposed an edge based technique using connected component analysis for separating text and non-text regions in a document image. The maximum magnitude of the edge is detected by using the compass masks convolution filtering in eight major directions. Successively, in the localization process the magnitude of the edge can be compared with a threshold value to generate the edge map. Morphological operations are applied for detection of boundary, removal of noise; identify components, convex hull and so on. The Run-Length smoothing is used to find the connected components in both horizontal and vertical directions. Using Connected Component Analysis and pixel neighborhood a bounding box is drawn for each component after that using the spatial features such as height, width and area of each component a block is classified as either text or non-text block. The texts blocks are then given as input to the segmentation stage of the OCR system, OCR software convert them into electronic representation or machine editable code.

Keywords

Compass mask, Connected Component Analysis, Morphological Operators, OCR, Run-Length Smoothing, Text Extraction, And Threshold