Identification of Protein Coding Regions in Microbial Genomes Using Unsupervised Clustering

Loading...
Thumbnail Image

Authors

Konda, Jayashree

Issue Date

2009

Type

Thesis

Language

Keywords

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

At present the genomes of many organisms have been sequenced, meaning that their nucleotide structure is known but the location of genes, and most importantly, the coding regions, are unknown. Identifying coding regions is of vital importance, as they code for proteins. Distinguishing between coding and non coding regions is a difficult undertaking and many research efforts have been studied. We describe here an unsupervised clustering algorithm to find out protein coding regions in microbial genomic DNA sequences. The algorithm is based on a simple measure called vector of frequencies of nucleotides in sliding window and uses an ab-initio iterative Markov modeling procedure to partition the genomic sequences into coding, coding on the opposite strand and non-coding regions. The algorithm is very efficient and it can be used for any type of microbial genomes and also for uncharacterized microorganisms. Based on a method developed by Audic and Claverie, we improved the accuracy of finding coding regions and also found the nearest transition point from one class to another with an accuracy matching and exceeding the level of the best currently used gene detection methods. The method was examined on 18 complete microbial genomes from Genbank which covers four classes of major phylogenic lineages (Gram negative, Gram positive, cyanobacteria, and archaea). The results showed an improvement in performance of predicting coding regions of microbial genomes.

Description

Citation

Publisher

License

In Copyright(All Rights Reserved)

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN