Decoding DNA: scalable genome annotation software, generative AI, and application to the cactus pear genome
Loading...
Authors
Lomas, Johnathan
Issue Date
2025
Type
Dissertation
Language
en_US
Keywords
Bioinformatics pipelines , Genome Annotation , Genomics , Large Language Models
Alternative Title
Abstract
The annotation of protein-coding genes from a raw genome assembly involves identifying the sequence regions that are transcribed into mRNA transcripts, spliced, and ultimately translated into protein. Thus, genome annotation is a fundamental task that critically supports systems level biological inquiry by providing a reference for ‘omics’ analyses including, RNA sequencing, proteomics, and comparative genomics. Despite advances in genome sequencing, which enable routine chromosome-level genome assembly, complete identification of gene structures in eukaryotes remains a significant challenge. For example, existing genome annotation pipelines suffer from a lack of automation, an inability to control precision, and from poor performance in predicting alternative splicing.The goal of this work is to develop improved computational tools for genome annotation and to apply the improved methods to annotate the genome of cactus pear (Opuntia cochenillifera). To address challenges in computational efficiency and precision, an automated bioinformatics pipeline called Sylvan was developed that computes a comprehensive genome annotation from disparate evidence sources and filters spurious gene models using a semi-supervised random forest classifier. In benchmarking trials involving Arabidopsis thaliana and Oryza sativa the pipeline outperformed current standards, such as MAKER and BRAKER, in both F1 similarity and BUSCO completeness. Sylvan was used to annotate the genome of Opuntia cochenillifera, representing the first genome sequence and assembly in the genus and a foundational tool for research into crassulacean acid metabolism (CAM) and drought tolerance in plants. To improve the capacity of genome annotation tools to predict full-length, alternatively spliced transcripts ab initio, a deep learning transformer model was developed to ‘translate’ a DNA sequence into its text-based annotation. This generative strategy provides increased flexibility to predict hierarchical and overlapping gene structures that are not possible with one dimensional segmentation models.