by Bennett Kapili
Proteins are the primary catalytic units of biology. They are composed of up to twenty different amino acids strung together into sequences that are usually a few hundred amino acids in length. Each of the twenty amino acids has different chemical properties, such as different sizes, affinities for water, and electrical charges, and the interaction of these amino acids with their environment and with each other gives proteins their distinct three-dimensional shapes. This structure, in turn, is critically important in defining the protein’s catalytic properties. While a particular protein might have slightly different amino acid sequences in different organisms, with accompanying variations in folding structure, only a small subset of sequences result in the protein’s proper functioning. By studying a protein’s variations in amino acid sequences, we can gain valuable insight into how the protein works.
Here, we will borrow fundamental concepts from information theory and apply them to the study of molecular evolution. In particular, we will study amino acid sequence variation in the protein nitrogenase reductase (NifH), which is an ancient protein found in Bacteria and Archaea and is critically important for sustaining life on Earth. NifH helps convert nitrogen gas – which most organisms cannot use to grow – into a form of nitrogen that all life on Earth can use to grow. Without the action of this protein and the handful of other proteins with which it works, there would not be enough usable nitrogen to sustain all the living organisms on Earth. The NifH protein works by splitting apart molecules of ATP (the most common energetic currency for life on Earth) and using the energy that is released to transfer an electron to another protein. That electron is then passed to a molecule of nitrogen gas, where after the accumulation of six electrons, it is converted into ammonia. The electrons are transferred using a small cluster of iron and sulfur atoms that is embedded inside the NifH protein. Since the amino acid sequences that are involved with splitting the ATP and holding the iron-sulfur cluster are absolutely essential for proper NifH functioning, we would expect little sequence variation involving these amino acids. We’ll explore this hypothesis using an information theoretic framework.
First, I downloaded all the publicly available amino acid sequences of NifH from GenBank (n = 8876 sequences), which is a database organized by the National Center for Biotechnology Information. Then, I constructed an amino acid sequence alignment, which is the placement of the sequences into a matrix, with rows corresponding to each linear sequence and columns corresponding to the equivalent positions in each sequence. The matrix columns capture the amount of amino acid variation that exists at a particular position in the protein. From this alignment, I created a phylogenetic tree to estimate which sequences were the most different based on a model of evolution. I picked the top 750 sequences that were the most spread out on the tree to analyze in this study. Then, I estimated the entropy of each column in the alignment and the mutual information between columns using the Jiao-Venkat-Han-Weissman estimator1. I then applied the average product correction described in Dunn et al., 2008 to the mutual information estimates to account for phylogenetic signal within the sequences2.
Results & Discussion
To study sequence variation, we can calculate the entropy of each column in the sequence alignment. We can then compare the position’s entropy to the entropy of a theoretical position in which any amino acid exists with equal probability. When we do this for NifH, we see little sequence variation in the positions that are involved with ATP splitting (called the “Walker A”, “Switch I”, and “Switch II” sequence motifs) and iron-sulfur cluster binding (Figure 1A). This is exactly what we expected! In Figure 1, the points closer to the dashed line have less sequence variation than points farther below the dashed line. Interestingly, the Walker B motif, which is also involved in ATP-splitting, has much more sequence variation than the other motifs (Figure 1A). Also, we see that the amino acids that compose the alpha helices and beta strands – 3D structures that help give NifH its shape – can have, at times, quite high sequence variation (Figure 1B). We’ll return to this later.
When we map the relative entropy estimates onto the 3D protein structure, we get a strong visual understanding of the relationship between sequence variation and protein function. The protein structure shown in Figure 2 is the structure of NifH during ATP splitting, where ADP-AlF4– serves as an analog of ATP3. First, we see very little sequence variation around the iron-sulfur cluster (Figure 2A). We also see very little sequence variation around the area that interacts with ATP (dark blue stripes in Figure 2B; ADP-AlF4– positions shown in Figure 2F). In addition, we see that the surface of the protein – that is, the area that is exposed to the chemical environment within the cell – has a relatively high amount of sequence variation.
To investigate these relationships further, we can plot the relative entropy as a function of distance to the iron-sulfur cluster and ADP-AlF4–, as well as how much each amino acid is exposed to the outside. We see that amino acids separated from the iron-sulfur cluster by <5 angstroms (five one-hundred-millionths of a centimeter; a distance that allows the formation of hydrogen and ionic bonds), have very little sequence variation (Figure 3A). This low sequence variation extends to ~10 Å, suggesting that not only is direct interaction with the iron-sulfur cluster important, but the structure of the protein in the vicinity is very important as well. We see similar results for distance to ATP (Figure 3B). When we look at the relationship between sequence variation and exposure to the intracellular environment (measured here as the percent of an amino acid’s surface area that is accessible to water), we see decreasing relative entropy with increasing exposure to the environment (Figure 3C). Interestingly, though, we see that not all amino acids completely buried inside the structure are completely conserved. There are 50 amino acids located completely inside the NifH structure, 32 of which display meaningful variation (defined here as having entropy >0.1, which roughly corresponds to a particular amino acid accounting for <99% of the variation at a position).
We can use mutual information, or how much we can reduce our uncertainty about the identity of the amino acid at position X if we know the amino acid at position Y, to further study these amino acids. In particular, variation at one amino acid tucked inside the protein structure might be allowed if variation at another amino acid helps “compensate” for the changes. Positions that show these patterns of compensating amino acid substitutions are coevolving, and we can measure the statistical dependency between positions to detect such behavior. However, to distinguish directly from indirectly coevolving positions (i.e., the amino acid at position X directly influencing the identity of the amino acid at position Y vs. X indirectly influencing Z through Y interacting with Z), we will only consider the pairing with the highest mutual information for each amino acid. This approach is an application of the data processing inequality, which has been applied to the study of gene regulation networks4. To evaluate whether the magnitude of mutual information is meaningful or not, we will compare the mutual information estimates to the estimates produced when comparing randomized NifH sequences. In particular, for each mutual information estimate, we will calculate the number of standard deviations that it is above the average mutual information value between positions in the randomized sequences. These values are called “Z-scores”, and we will refer to the pairs involving positions and their strongest coevolving partners as “maxZ pairs”.
Using this approach, we detect 56 directly coevolving amino acids. When we focus on the maxZ pairs, we see that the distribution of the distances between the amino acids in each pair is quite different than the distribution of the distances between all the amino acids (Figure 4). In particular, these distances tend to be much shorter than would be expected if we picked amino acid pairs at random. This is consistent with what we would expect if the amino acids are directly interacting with each other! Interestingly, we find that, of the 32 amino acids that are completely buried inside the NifH structure, 23 are found to coevolve with another amino acid (Table 1). Furthermore, of the 14 coevolving pairs (some amino acids were maxZ partners to multiple other amino acids), 8 of them occurred between amino acids that are both completely buried! This suggests that coevolution can indeed explain some of the sequence variation found inside the NifH protein structure – some amino acid substitutions appear to “work” if another substitution simultaneously occurs nearby. Interestingly, the top two amino acids that showed the most sequence variation (positions 151 and 196) were found to directly coevolve (Table 1). Although separated by 44 amino acids in the linear sequence, they are only separated by 3.6 Å in the NifH structure.
Coevolution can also explain some of the sequence variation we observed earlier in the alpha helices and beta strands (Figure 1B). In NifH, there are 156 amino acids located on alpha helices or beta strands, 116 of which showed sequence variation roughly >1%. From this set of positions, 38 positions were found to coevolve – roughly a third of the positions on alpha helices and beta strands showing meaningful variation! These formed 30 coevolving pairs (some amino acids were maxZ partners with multiple others), including many instances of coevolution between separate alpha helices, separate beta strands, and between an alpha helix and beta strand (Table 2). This suggests that, while amino acids forming alpha helices and beta strands may display noticeable sequence variation at first glance, there is actually less variation than there appears. It appears, as revealed through our mutual information analysis, that variation at a position is permissible if variation at another site simultaneously occurs. However, there are likely many more coevolving positions, particularly those that are strongly coevolving where the identity of an amino acid at position X is inflexible and requires a particular amino acid at position Y. The entropy at both positions would therefore be 0, and since there can be no further reduction in uncertainty, the mutual information between the positions will be 0. Our approach is therefore unable to detect coevolving positions that show no sequence variation.
In summary, we were able to not only show that the amino acids surrounding the iron-sulfur cluster and ATP molecules in NifH show very little variation, but that many of the positions on the inside of the NifH structure appear to coevolve!
1. Jiao, J., Venkat, K., Han, Y. & Weissman, T. Minimax estimation of discrete distributions. IEEE Int. Symp. Inf. Theory – Proc. 2015–June, 2291–2295 (2015).
2. Dunn, S. D., Wahl, L. M. & Gloor, G. B. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340 (2008).
3. Schindelin, H., Kisker, C., Schlessman, J. L., Howard, J. B. & Rees, D. C. Structure of ADP·AlF4–stabilized nitrogenase complex and its implications for signal transduction. Nature 387, 370–376 (1997).
4. Margolin, A. A. et al. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7, 1–15 (2006).