How I Used Yasara and FoldX to Analyze Mutations and Ligand Docking for Proteins

18 min readApr 15, 2024

In my previous article, I discussed the problem of protein structure prediction and the impact of AlphaFold2 in solving said problems. However, there are still two others aspects of the protein field.

The question of what balance of interatomic forces or bonds dictate the structure of the protein
The kinetics question of what routes or pathways some proteins use to fold

This article will cover my project, where I combined the technologies of FoldX with Yasara to study how proteins interacts with other molecules and how mutations affect a proteins structure, thus tackling mainly the second problem. Through this project, I gained a better understanding of what causes a protein to misfold, and how improper folding results in a protein not achieving its native structure.

Furthermore, while it is not at the core of the protein folding problem, I also got a better understanding of how a protein interacts with peptides and DNA. This is key because proteins have emerged as promising drug delivery vehicles and are often the targets for many therapeutics. Thus, understanding how they interact with molecules is necessary to ensure the efficacy and safety of your therapeutic.

Lets take a step back though. Why is it important to study these problems? To understand this, lets understand the history of the protein folding field.

The History of Protein Folding

Protein folding is the process by which a protein assumes its characteristic structure, known as the native state. The fundamental question in this field is how an amino acid sequence both specifies a native structure and a pathway to attain said structure. To boil it down even further, the protein folding problem is the question of how order (a native structure) arose from disorder (an on-rigid structure) in proteins.

From this overarching question, there are three main subquestions that have largely been answered from theory and statistical mechanics combined with experiments:

How did the various native structures of proteins arise from interatomic driving forces encoded within their amino acid sequences?

The answer to this is that chain randomness is overcome by solvation-based codes. In other words, although the linear sequence of amino acids in a protein chain may initially lack specific structure (chain randomness), solvation-based codes, which involve complex interactions between the protein chain and surrounding solvent molecules, play a crucial role in overcoming this randomness and guiding the protein towards its correct folded structure.

How does protein folding occur so fast given that a protein has to search through countless possible confirmations (structures) before finding its native structure?

To answer this, we use the needle-in-a-haystack metaphor, where native states are found efficiently because protein haystacks (conformational ensembles) are funnel-shaped. In other words, as the protein gradually folds, the amount of potential structures exponential decreases to the point that the protein only has a few possible native structure and can easily find the needle (native structure) in the large amount of possible protein structures (haystack).

Is there a folding mechanism, a particular route or pathway that is both specific enough to explain how a given amino-acid sequence reaches its native state, and general enough to say why most proteins have such routes?

In other words, is there a mechanism or narrative of a protein’s folding pathway that is general enough to apply to all proteins, yet still granular to account for the differences across proteins. This problem is super tricky and still unsolved because it can’t be an atom-by-atom nanosecond-by-nanosecond sequence of events, for that would surely apply only to one protein. However, the funnel concept alone is also not a sufficient descriptor of the mechanism because it is too generic to tell us differences in folding routes of different proteins.

A “Brief” Chronology

The notion of a folding “problem” first emerged in 1957, with the first experimental determinations of globular protein structures by Perutz et al. Here, John Kendrew and Max Perutz created the first atomic structure of the protein myoglobin, showing that proteins had atomistically detailed structures that were different for different proteins, and that those differences matter for function. The last part is super important and is at the core of the idea that a proteins structure determines its function.

Myoglobin’s three-dimensional structure as revealed X-ray crystallography (Source)

A key milestone was the work of Dr. Christian Anfinsen on ribonuclease (RNase). In 1961, Dr Anfinsen’s hypothesis was that the native structure of a protein is the thermodynamically stable structure; it depends only on the amino acid sequence and on the conditions of solution, and not on the kinetic folding route or sequence of steps by which a protein folds. In other words, the information required to fold a protein into its three-dimensional shape, or the folding code, comes from its sequence of amino acids.

This work was absolutely key for the field, and he was later recognized for the Noble Prize in Chemistry in 1972 for his work. Two powerful conclusions followed from Dr. Anfinsen’s work:

Now that we know that it doesn’t matter where the protein folds, the research enterprise of in vitro protein folding began, meaning that we can understand how proteins fold and their native structures through experiments inside test tubes rather than inside cells.
The Anfinsen principle suggests that while evolution can modify the sequence of amino acids in a protein, the actual folding of that protein into its proper shape and the speed at which it happens are determined by the laws of physics and chemistry.

Following Anfinsen’s work, in vitro studies showed that the folding process typically occurs on a milliseconds-to-seconds time scale, much faster than the rate estimated assuming that folding proceeds by a random search of all possible conformations.

Based upon this observation, in 1968, Cyrus Levinthal proposed that a random conformation search does not occur in folding and that proteins fold by specific ‘folding pathways.’ On these pathways, the protein molecule passes through well-defined partially-structured intermediate states. In other words, the protein may start to form secondary structural elements such as alpha helices or beta sheets, but the overall structure is not fully formed and these states only exist temporarily during the folding process. Each intermediate state is a step closer to the final folded structure.

Levinthal thought this because given the knowledge that proteins fold super fast, it would take forever for a protein to search through the estimated 10³⁰⁰ possible structures to find the native state. Out of a near infinite possible ways to fold, a protein thus picks one in just tens of microseconds; this same task would take 30 years of compute time.

The implications of this study was that even though folding is stochastic, it must still involve some physical basis for not exploring the vastness of entire conformational spaces. It was later understood that the chain randomly formed increasingly many hydrophobic-hydrophobic (HH) contacts, lowering the free energy and at the same time reducing the remaining space of searchable conformations. Folding is not a random search, but rather pruned by exponentially diminishing the number of conformations available to search as the protein becomes increasingly more compact.

Following these key experiments, one of the earliest proposed mechanisms for protein folding, the nucleation-condensation model, was formulated. This model was based on nucleation theory, which suggests that when a protein folds, the process begins with the formation of a stable nucleus, which serves as the starting point for the folding of the protein molecule into its native conformation. This initial formation of a nucleus could be then considered the rate-limiting step in the folding process. Based on this idea, the model required a small number of residues (folding nucleus) to form their native contacts in order for the folding reaction to proceed fast into the native state.

Furthermore, the cooperative nature of the protein folding is similar to the behavior exhibited in first-order phase transitions, which happen via formation of nuclei (nucleation) followed by their growth. Because of these similarities, terminology used in studies of phase transitions, such as energy landscapes and nucleation, was introduced into the discussion of protein folding.

The concept of energy landscape was key for the protein folding field. A key conclusion now is that proteins have funnel-shaped energy landscapes: many high-energy states and few low-energy states.

A simple chemical reaction goes from reactants A to product B. A protein cannot just go from its unfolded state straight to its folded state because its reactant, the denatured or unfolded state, is not a single microscopic structure. It’s a state of disorder where the protein’s amino acids are not organized. This goes back to the idea that folding is a transition from disorder to order, not from one structure to another. Simple one-dimensional reaction path diagrams do not capture the reduction in conformational degeneracy. A funnel, on the other hand, accurately portrays a proteins conformational heterogeneity, or just the existence of multiple possible structures that a protein can adopt. As a protein folds, the amount of possible structures decreases and the funnel becomes narrower to represent that.

Now I know I stated before that protein folding can be done in vitro because the folding code lies in the amino acid sequence, there are a few key differences between in vitro and in vivo studies that still makes it important to study protein folding in vivo.

The first difference is that protein folding is usually assisted by molecular machinery, such as chaperones and molecular cofactors. Molecular chaperones, such as the heat shock protein Hsp70, often help in the protein folding process. One way that they do this is by binding to the exposed hydrophobic regions on the protein surface, shielding them from other proteins and allowing the protein to fold without interference from other molecules. They essentially create a specialized microenvironment that promotes efficient and accurate protein folding.

Furthermore, nearly one third of all proteins in living cells interact with small molecule cofactors. The work of Wittung-Stafshede et al. was key to revealing the role of cofactors, showing that bound metals stabilize the native fold which means that cofactor binding to unfolded proteins dramatically accelerates folding time.

The second big difference is that the concentrations of macromolecular solutes in cells can be hundreds of grams per liter in cells, but most in vitro studies are performed in buffered solution with less than 1% of the concentration. The crowded environment in vivo means that their is less space for the amino acids chains to move and fold without the bumping into other molecules. Furthermore, a more crowded environment means that proteins have the risk of interacting with other molecules, altering the folding pathway and potentially leading to different conformations.

So… Why Does This Matter?

Now that we have a general idea of the field. Let’s circle back to the initial question: why is studying protein folding and proteins important in the first place?

Proteins are vital in the drug development and drug discovery space. They have not only shown promise as delivery vehicles, but even as therapeutics. Protein can be genetically engineered to bind to specific targets with high affinity and selectivity, meaning that there is a low chance that they bind to the wrong molecule and cause unwanted side-effects. Moreover, proteins are have lower immunogenicity, which is important if you want to do numerous dosages. A specific application that is of interest to me is their ability to revert misfolded proteins. Proteins could be potentially tailor made to bind to misfolded proteins and fufill the missing part of their structure, restoring their function. If this were to become a reality, neurodegenerative diseases such as Alzheimer’s and Parkinson’s would be a thing of the past.

Just thinking generally however, among the classes of molecules, proteins occupy a niche of special importance: Not only are they components of all living things and involved in every major cellular processes, and not only do they matter for health and disease as discussed, but their material powers are extraordinary: they are working components of being that walk (function/interact), talk (send signals and communicate), and think (decide how they want to fold and which structure they will adopt). Proteins are truly a special molecule and better understanding them will unlock the secrets to life itself.

What is FoldX and Yasara?

Now that we understand why this problem is important, lets discuss what I worked on.

The first tool that I used was Yasara, which is a computer program for molecular visualizing, modeling, and dynamics. It allows me to view protein structures on my computer and interact with them. I used the Yasara model, which came free using an academic license.

The features in this included everything from the Yasara View package along with some other cool features:

Analyze contacts, hydrogen bonds, hydrophobic/pi-pi/cation-pi interactions and protein secondary structure.
Align small molecules like ligands automatically, by superposing them on the largest common fragment.
Measure distances, angles and dihedrals between groups of atoms like helices or planar side-chains.
Show and calculate partial Van der Waals, molecular and solvent accessible surfaces, distinguish between outer and inner surfaces, calculate volumes. Analyze and show cavities formed by these surfaces.
Identify cis-peptide bonds and wrong stereoisomers. Stereoisomers is where molecules have the same molecular formula and sequence of bonded atoms, but differ in the three-dimensional orientations of their atoms in space.

While Yasara is cool, FoldX is the really cool part of the project. Centered around the laboratory of Luis Serrano at the European Molecular Biology Laboratory in Heidelberg and at Center for Genomic Regulation in Barcelona, FoldX has constructed the FoldX suite, a package of all of FoldX’s biggest technologies.

Released in 2019, FoldX’s newest FoldX5 suite contains all of the latest technology:

FoldX

FoldX provides fast estimations of how important a certain interaction (disulphide bonds, hydrophobic, electrostatic, polar and hydrogen bonding) is to the stability of the protein. FoldX uses a detailed atomic structure description of proteins and incorporates different energy factors, which are adjusted based on real-world protein engineering data. Its energy calculation is computationally inexpensive, making it suitable for protein design and predicting protein structure and folding pathways quickly and accurately.

For its energy function, FoldX uses the following equation to calculate the free energy of unfolding (ΔG) of a target protein:

Here, “a — l” are relative weights of the different energy terms in calculating free energy. The proteins interaction with a solvent is treated in two steps. First, FoldX considers the overall liquid environment as a desolvation factor, which changes as atoms become more buried or shielded from the surrounding environment. This desolvation factor is divided into two parts: one related to non-polar groups (ΔGsolvH) and and another to polar groups (ΔGsolvP). Those water molecules that make more than two hydrogen bonds with the protein are calculated explicitly in the ΔGwb term. Van der Waals terms (ΔGvdw) are calculated similarly to desolvation, but experimental energy transfers from water to vapor are also included.

Hydrogen bonds are figured out using basic geometric rules, and their energy (ΔGhbond) is determined from protein engineering experiments involving double mutant cycles. Double mutant cycles are super interesting and are a type of technique where two amino acid residues within a protein are first individually mutated to different amino acids, and the effects of these mutations on the protein’s stability are measured. Then, a double mutant is created where both residues are mutated simultaneously. By comparing the effects of the single mutations to the double mutation, we can infer the energetic contribution of the interaction between the two residues to the overall stability of the protein.

The electrostatic contribution to the free energy (ΔGel) is determined using Coulomb’s law: the force between two charged objects is directly proportional to the product of their charges and inversely proportional to the square of the distance between them.

For protein complexes, or two or more protein molecules that come together, there’s an extra electrostatic contribution calculated between atoms from different polypeptide chains, denoted as ΔGkon.

What differentiates FoldX from other force fields — which are just sets of mathematical equations to describe the interactions between atoms and molecules — lies in how they estimate entropy. While traditional methods involve simulations to capture the flexibility of protein structures, FoldX calculates the entropy pentaly for fixing the protein backbone in a specific conformation (ΔSmc) by analyzing the distribution of amino acids seen in crystal structures. This penalty is then adjusted based on factors like accessibility (how easily other atoms can interact with a specific part of the protein structure) of backbone atoms and energetics (energy changes) of hydrogen bonds. The entropy cost (ΔSsc) of locking a side chain into a specific shape is determined by adjusting a predefined set of entropy parameters, calculated by Abagayan and colleagues, according to how buried the side chain is.

Additionally, FoldX considers steric clashes between atoms in the structure to calculate a term called ΔGclash. Steric clashes are just when atoms in a molecular structure come too close to each other and their electron clouds repel each other, resulting in strain or distortion of the molecule’s structure as atoms cannot occupy the same space simultaneously.

Finally, FoldX can mutate the 20 natural aminocids. When mutating DNA, the code finds the matching base pair and changes both at once to maintain pairing.

LoopX

LoopX allows for quick and precise predictions of the structure of protein loops — regions which connect two secondary structures. It uses a library, LoopXDB of irregular structure elements grouped based on the distance between the regular (well-defined recurring) parts that surround the loop. This tool also helps researchers swap loops with others from the library, aiding in protein engineering by finding possible variations.

PepX

PepX allows for fast and accurate prediction of peptide docking by means of BackXDB, a library of protein building blocks, and PepXDB, a library of protein-peptide binding motifs.

The PepX algorithm takes a peptide sequence and a domain structure and finds the possible shapes the peptide can take when it binds to other molecules. It does this by using interaction rules with parts of the protein known to be important for its binding. To speed things up, it uses a shortcut algorithm (CSP) to narrow down the search among the many possible shapes stored in the PepXDB database. This database holds over 7 million interactions from various protein structures, giving a wide range of peptide-binding information.

BackX

While this module is currently under development, BackX models backbone moves. This is important because protein backbones are flexible and move, allowing them to adjust to fit nicely with other proteins.

DnaX

Over the last 25 years, there has been noticeable progress in terms of increasing the number of protein-DNA crystal structures stored in the PDB (Protein Data Bank). DnaX is designed to predict how DNA binds to proteins by analyzing these structures. The team has analyzed over 2500 protein-DNA crystal structures and organized the findings into a database based on their structural characteristics.

Example of a protein binding to DNA (Source)

After that rather long explanation, you can see the real power of these tools and all that they can do. They really are amazing and the fact that they are available for anyone to use is even better.

Showcase

Now let’s get to the fun part, here I will my results using these tools.

Ligand Docking

One of the things that I did was analyze ligand docking. Here is one of the proteins that I analyzed, I got this protein by searching through the Protein Data Bank and this was a COX-1 structure found in humans. I then easily imported it onto Yasara and selected one of the many types of ways that I could view this molecule. What is really cool about Yasara is that you can see the molecule in many different ways. For example, this is the ribbon structure of with side chains. A-helices are shown as coiled ribbons or thick tubes, β-sheets as arrows, and non-repetitive coils or loops as lines or thin tubes.

Here is an example of another way that I can model this protein. This model is called the ball and stick model because all it shows are the balls (atoms) and sticks (connections between them). While this model doesn’t give you a very good idea of where the helices and sheets are, or the full shape of the protein, it is simple and is an example of how customizable this tool is.

The first thing that I did was I studied ligand binding onto this protein. I first “cleaned” up the molecule by removing extra cofactors and ligands (those large red and blue balls that appear above), and also not showing any side chains, as all of these things are not needed when performing docking studies. Now, my protein looks like this:

Now I want to find my ligand. I went on pubchem and I wanted to see how aspirin would bind to COX1, which is my receptor.

Now, running the simulation, because I did not specify a specific place where the ligand should bind, I got a bunch of possible places where aspirin (all of the large red, blue, and grey circles) could bind to COX1.

While there are many places where aspirin could bind, we need to find the best place. The best one is the one with the highest binding energy in kcal/mol. Once you run the docking analysis, you will get a file that shows an analysis of all of the results and you can look there to find the best one. Here is an image of my best structure, you can see the aspirin ligand in the middle of the protein:

Predicting the Effect of Mutations

Mutations in proteins can have various origins. Some possibilities include errors during DNA replication, exposure to radiation or certain chemicals, or as a result of genetic recombination processes. Mutations can have varying effects on a proteins structure and/or function. For example, mutations can have no effect at all, be stabilizing or destabilizing; in the last two cases, these types of mutations can lead to diseases.

Traditionally, mutations are made in the wet lab to study the effect of a single residue position on protein stability, interaction with peptide ligand, and much more. However, such procedures in the wet lab are laborious and costly. With the introduction of bioinformatic tools such as FoldX, this process is now a lot easier. You can first use these tools to predict the effects of mutations and then test the really interesting ones in the lab.

In terms of how FoldX interprets mutations, the main focus is on the prediction of free energy changes (what happens to the free energy of the protein when we mutate an Asp to a Tyr for example). FoldX will then calculate the free energy of the wild type (WT) and the mutant (MT) and calculate the difference:

ddG (change) = dG(MT) — dG (WT)

As a rule of thumb:

ddG(change) > 0 : the mutation is destabilizing (weaken interactions within the protein)

ddG(change) < 0 : the mutation is stabilizing (strengthen interactions within the protein)

The protein that I am working with now is called 2AC0, which is a part of a tetrameric complex of the transcription factor P53 bound to DNA. After repairing or minimizing the structure, here is what my protein looks like:

I then used FoldX to mutate the ALA 159 residue to Trp. I also selected the option to show the Van der Waals (VdW) clashes in WT and mutant.

Zoomed-in-view on the original Ala159 region.

Now here is a view on the mutated Ala159Trp region, notice how there are a lot more red Van der Waals clashes. Remember how, anything above 0kcal/mol is assumed to be destabilizing. When I analyzed it I saw an energy change of +29 kcal/mol!

Conclusion

While I could show a lot more, I recommend that you just download the tools for yourself and try them out. This project was amazing and working with all of these tools was truly a gift. Hopefully, you also see the amazing power of these tools and all that they are capable of. This is honestly a huge testament to the scientific community and what they are able to produce and willing to make available.

If you liked this article, have any questions, or want to provide any feedback, please email me (natank929@gmail.com) or check out my LinkedIn to connect further.

Sources

As always, the information that I presented in this article did not come from thin air, here are the key resources that I used to inform my learning: