Decoding DNA: scalable genome annotation software, generative AI, and application to the cactus pear genome

Loading...
Thumbnail Image

Authors

Lomas, Johnathan

Issue Date

2025

Type

Dissertation

Language

en_US

Keywords

Bioinformatics pipelines , Genome Annotation , Genomics , Large Language Models

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

The annotation of protein-coding genes from a raw genome assembly involves identifying the sequence regions that are transcribed into mRNA transcripts, spliced, and ultimately translated into protein. Thus, genome annotation is a fundamental task that critically supports systems level biological inquiry by providing a reference for ‘omics’ analyses including, RNA sequencing, proteomics, and comparative genomics. Despite advances in genome sequencing, which enable routine chromosome-level genome assembly, complete identification of gene structures in eukaryotes remains a significant challenge. For example, existing genome annotation pipelines suffer from a lack of automation, an inability to control precision, and from poor performance in predicting alternative splicing.The goal of this work is to develop improved computational tools for genome annotation and to apply the improved methods to annotate the genome of cactus pear (Opuntia cochenillifera). To address challenges in computational efficiency and precision, an automated bioinformatics pipeline called Sylvan was developed that computes a comprehensive genome annotation from disparate evidence sources and filters spurious gene models using a semi-supervised random forest classifier. In benchmarking trials involving Arabidopsis thaliana and Oryza sativa the pipeline outperformed current standards, such as MAKER and BRAKER, in both F1 similarity and BUSCO completeness. Sylvan was used to annotate the genome of Opuntia cochenillifera, representing the first genome sequence and assembly in the genus and a foundational tool for research into crassulacean acid metabolism (CAM) and drought tolerance in plants. To improve the capacity of genome annotation tools to predict full-length, alternatively spliced transcripts ab initio, a deep learning transformer model was developed to ‘translate’ a DNA sequence into its text-based annotation. This generative strategy provides increased flexibility to predict hierarchical and overlapping gene structures that are not possible with one dimensional segmentation models.

Description

Citation

Publisher

License

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN