Department of Computer Applications, Kalasalingam University, Krishnankoil, Srivilliputtur (via), Tamil Nadu, India, 626 190. E-mail:
*Corresponding author E-mail: angadiub@gmail.com
In bioinformatics, enormous biological data is being accumulated due to genome sequencing projects all over the globe. Compelling need to transform biological data into useful information and knowledge is become an important and challenging task to both computer scientists and biologists. One of the problems arising in the analysis of biological sequences is the discovery of similar motifs/features from set of sequences. Such motifs usually corresponds to residues conserved during evolution due to an important structural or functional rule. In this paper, we develop a new algorithm GENETAL based on genetic theory for discovery of motifs/features in biological sequences and text documents. Our algorithm is able to produce all motifs appearing at least a minimum number of sequences (user defined). It is very efficient compared to other existing algorithms for large set of data, with respect to space and time complexity. Also, we demonstrate clustering of DNA/Protein sequences and text document data using GENETAL as feature extraction algorithm with simple incremental clustering technique and Jaccard coefficient dissimilarity measures.
Motifs discovery, Clustering, DNA/Protein sequences, Pattern recognition