Link Search Menu Expand Document

Identification and Classification of Intrinsically Disordered Regions in Proteins

Partner: Marc Singleton, Biophysics graduate program, UC Berkeley, Academic

Overview

Project Description

Regions within proteins can be broadly classified into two types: ordered and disordered. Ordered regions assume a defined three dimensional structure and are identified by their unique sequence of amino acids. Disordered regions, however, can adopt a variety of structures and have amino acid sequences that can vary dramatically between equivalent proteins in different species. The relationship between sequence, disorder, and function in a protein remains poorly understood and as roughly one third of proteins contain significant regions of disorder, this poses a significant barrier to predicting protein function. The goal of this project is to evaluate existing methods of disorder prediction, improving on them as necessary, and to develop models for classifying disordered regions using state-of-the-art machine learning techniques.

Expected Deliverable

The primary deliverable would be a model that segments protein sequences into ordered and disordered sequences. If time allows, the students can develop a second model to further classify disordered sequences into subcategories.

What would a successful semester look like to you?

I consider a student that comes away from this semester with a greater understanding of the structure and function of proteins, bioinformatics, and modern data science techniques a success. To accomplish this, ideally the semester would be structured into 2 weeks of background reading, software setup, and detailed project planning. The following 6 weeks would be dedicated to evaluating existing disorder prediction tools. Depending on these results, the next 6 weeks would focus on developing novel models to either predict disorder or to further classify the disordered sequences identified by an existing tool. The final weeks would consist of organizing and preparing the results for presentation.

Data

Models

Conclusion