About this course
Intro: What is protein function and what are aspects of function that can be predicted?
Predicting protein function using sequence: sequence alignments, multiple sequence alignments, motifs, domain assignment, annotation transfer by homology, de novo predictions. Predicting protein function using structure: structural alignments, structural motifs, annotation transfer via structure similarity. From structure prediction to function prediction: comparative modeling; prediction of: subcellular location, protein-protein interactions, protein-DNA and -RNA interactions, protein-substrate interactions, protein networks, Gene Ontology (GO), Enzyme Classification, prediction of enzymatic activity, prediction of functional classes (e.g. GO classes). Prediction of the effect of single point mutations (sequence variants) on protein function and the organism. Prediction of phenotype from genotype.
Similar to the first part (Protein Prediction I), the module focuses on machine learning-based methods with particular emphasis on strategies to avoid over-estimating performance. Protein structure increasingly plays an important role in function prediction.
Learning outcomes
Students will learn to understand crucial aspects of protein function prediction. They will learn to apply the state-of-the-art methods toward these objectives in computational biology.
As almost all major solutions are based on ML and AI, THE most important challenges for the field are to avoid over-fitting, to cope with data bias, and generally to create ML/AI solutions that will not mislead users. To understand key strategies to avoid data leakage and over-estimates of performance. These issues are particularly relevant for PP2 because there is substantially less experimental data available for protein function than for structure.
Students will learn how evolutionary information from multiple sequence alignments (MSAs) is crucial to advance prediction methods, and what the limitations of such solutions are. Due to the evolution of the field over the last few years, another major focus will be on understanding the power of protein Language Models (pLMs) which are now outperforming MSA-based methods in several aspects of function prediction.
Unlike the situation for structure prediction, function prediction poses particular challenges. What these are, how to possibly overcome them, and what to learn for other ML/AI applications from this are part of the lessons.
Ultimately, students will learn to develop their own prediction methods (in groups guided by tutors) by combining existing methods, or algorithms, and / or by creating their own new methods. Students will learn to critically analyze and evaluate published methods (as readers of the publication, as peer-reviewers, and as competitors). Based on the outcome of these evaluations they will learn to create a tool that is readily usable by experimental and computational biologists. Almost all solutions will imply to develop or apply methods based on learning (ML) and artificial intelligence (AI). This means, that they will be able to convert an abstract idea of a solution under consideration of technical aspects into pseudo-code and optionally further into executable programs during the exercises.
Examination
The module is graded by an oral exam at the end of the semester. The exam takes 20 minutes for an individual and 60 minutes for a group. The questions will be sampled entirely from the oral lectures and the project work.
Project work in the exercises leads up to an original scientific analysis - typically the development or testing of a machine learning-based solution - which will be presented by the group at the end of the semester. Students who contribute crucially to a successful project work will get a 0.3 grade bonus if the exam has been passed.
In the exam, successful participants demonstrate their ability to devise and discuss appropriate computational approaches for a solution for a biological problem in the field of protein function prediction. For instance, they choose the appropriate methods depending on the type of data available (such as sequence, structure, experimental annotations). They can also choose the appropriate data abstraction level (such as GO level, EC classes, or structural classes).
They demonstrate their understanding of the concepts in the choice of appropriate solution approaches to the given tasks and they can evaluate these in terms of a discussion of the various pros and cons of alternative approaches in biological as well as in technical aspects. They can demonstrate their ability to create a usable tool implementing a solution approach down to the level of pseudo-code.
Most crucial is for students to show that they have learned what it takes to communicate with experimental biologists about their needs for predictions or about the strengths and limitations of particular prediction methods.
More details are announced at the lecture beginning.
Course requirements
"Protein Prediction 2 - Function" for Bioinformatics targets those who study Bioinformatics at the Master level. Typically, students have completed a bachelor in bioinformatics at some university and have had hands-on experience with machine learning and artificial intelligence. Ideally, participants have successfully completed the first leg in this series, i.e., Protein Prediction 1 for Bioinformatics, however, very motivated students may benefit from taking PP2 without PP1. For students of other discipline there is another course available.
Resources
- T Goldberg, T Hamp & B Rost (2012) LocTree2 predicts localization for all domains of life. Bioinformatics 28:i458-65 https://pubmed.ncbi.nlm.nih.gov/22962467/ M Littmann, M Heinzinger, C Dallago, K Weissenow & B Rost (2021) Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep 11:23916 https://pubmed.ncbi.nlm.nih.gov/34903827/ T Hamp & B Rost (2015) Evolutionary profiles improve protein-protein interaction prediction from sequence. 31:1945-50 https://pubmed.ncbi.nlm.nih.gov/25657331/
Activities
Lectures, Seminars, Exercises, Problems for individual and team study: The students apply the theory presented in the lecture by writing a protein function prediction method in the exercise starting from data in varying form (depending on the problem at hand). In some cases, they will get the complete input from the tutors, in others, they will have to write database parsers and generate the input / output data they will need during the lab work. Each team will thoroughly estimate the performance of the tool they created and the team will present their results to their peers and to the tutors.
Additional information
- More infoCourse page on website of Technical University of Munich
- Contact a coordinator
- About studying within the Euroteq alliancehttps://euroteq.eurotech-universities.eu/initiatives/building-a-european-campus/course-catalogue/
- LevelMaster
- Contact hours per week0
- InstructorsBurkhard Rost
