Undergraduate Summer Research on RNA Structure Prediction

Overview


secondary structure
RNA folding is like nature's molecular origami. Instead of paper we have sequences of nucleotides. Instead of creasing the paper to create structure, hydrogen bonds are formed between nucleotides and cause the linear chain to fold back on itself. Such folded RNA molecules perform a variety of essential biological functions, functions that themselves depend on the structure created by this folding. Therefore understanding RNA folding is an important part of understanding these vital functions. Unfortunately, computational methods can not reliably predict the molecular structures of RNA sequences.

Abstractly, an RNA sequence can be treated as a string over a nucleotide alphabet (A, C, G, and U, which stand for Adenine, Cytosine, Guanine, and Uracil, respectively). The string is said to fold to a set of nested base pairs, known as a secondary structure, via the association or pairing of complementary nucleotides (for example, a G is complimentary to a C not to an A).


3-dimensional molecule
Further, non-nested pairings form to create additional foldings to eventually produce the three-dimensional structure. Accurate prediction of secondary structure is an important intermediary step to correctly predicting the three-dimensional structure. Based on the strength and stability of the RNA nucleotide pairings, predicting the secondary structure of an RNA sequence is typically approached as an optimization problem. This makes many aspects of predicting RNA secondary structure amenable to mathematical modeling and analysis.

A natural question then is "how close are these predictions to the to the structures that appear in nature?" The answer is not clear, and brings us to the proposed project areas for this summer.



Parametric Analysis of the Nearest Neighbor Thermodynamic Model

The optimization problem mentioned in the overview is typically based on the Nearest Neighbor Thermodynamic Model (NNTM), and seeks to minimize the free energy of the structure. The energy function used to score the secondary structures involves thousands of parameters. Yet, the optimal secondary structures obtained under this scoring scheme are often very different from the native structures, when the latter is known. This project will focus on using geometric combinatorics for analyzing the space of parameters used in the energy function and the set of possible optimal structures.

Comparing and Clustering RNA Secondary Structures

As mentioned above, a structure that has minimum free energy as defined by the NNTM may not be the native structure. Therefore it is interesting to return a set of suboptimal structures, instead of returning a single structure with minimum free energy. That is produce a set of structures that have free energy within some range of the minimum. Such sets become very large for long RNA sequences, and methods are needed to analyze these sets of structures. Students in this project will learn about various metrics on RNA secondary structures, and use those metrics along with clustering algorithms to summarize or classify the types of structures that are near optimal.



Prerequisites

The projects are interdisciplinary involving mathematics, computer science and biology. Students should have taken at least two proof-based mathematics courses (eg. abstract algebra, number theory, analysis, topology). As the projects involve some computational components, we especially encourage applications from students that have had programming experience. Students with less experience with advanced mathematics but significant programming experience (the equivalent of at least two project-based computer science courses) will be considered. All relevant biology will be presented during the early weeks of the project.

Inquiries

Please visit the GA Tech School of Mathematics Summer REU page for application details.
For more information about the Undergraduate Summer Research on RNA Structure Prediction, please e-mail RNAREU@math.gatech.edu