Jesse Gibson and Robert Gordan
A brief history of biological information
The flow of information is a central paradigm in biology and the DNA comprising an organism’s genome is in many ways the fundamental unit of information storage and processing in living things. The information stored in this molecular sequence contains instructions for building the molecular machines that carry out the chemical functions of life and when and where to turn them on and off. Information from the environment is accumulated as genetic variation over the course of evolution, and the information that proves useful is carefully transmitted from generation to generation.
Ever since the human genome project completed its roughly 15 year long endeavor in 2006, DNA sequencing technology has seen rapid development and extraordinarily rapid decrease in cost, outpacing the analogous improvements in computing technology predicted by Moore’s Law. With the ever increasing ease of sequencing, the genomic DNA sequences of more and more organisms have found their way into the scientific literature, alongside a plethora of computational tools to interpret the wealth of data. Among those tools, some have looked to Shannon’s Information Theory to model the reading and comparison of biological sequences as a communication problem. While some of the most prominent applications of information theory are in the realm of electronic communications, interestingly, Shannon’s PhD work focused on topics in theoretical genetics. For a good review of these approaches, take a look at this article.
The evolution of genome size
If you’ve taken an introductory biology class then when you think of evolution, one of the first things that comes to mind is probably variation in the sequence of proteins – a gradual diversification of the basic functional units of life with the ones that give their hosts an edge in reproduction being expanded in future generations. Evolutionary biologists have also considered other kinds of selective pressures. Consider a simple self-replicating system that may even be composed of a single molecule of RNA as proposed in the RNA world hypothesis. We can imagine that in a population of these kinds of molecules, being shorter would mean the ability to reproduce faster, letting the replicator expand in the population. It’s not an unreasonable stretch of the imagination to see how this kind of pressure might remain influential in the evolution of simple enough organisms like bacteria or even non-living replicators like viruses.
Looking at the issue of size from another perspective, we know that while the lengths of many genomes are macroscopic (the human genome laid out flat is about 2 meters long), they are packaged into spaces that are typically only a few micrometers across. Here is a great website to get a sense of scales in biology. This challenge is even more pronounced in viruses which tightly pack their genomes into a protein-based capsule, often taking advantage of the pressure to inject their genetic material into a host cell. In these contexts, we see a push towards shorter genomes not only for the benefit of shorter replication times but also for their superior physical compressibility. With these size constraints as a backdrop, we asked ourselves if we might see a similar evolutionary pressure on the informational compressibility of genomes.
Information Density of DNA Sequences
Trying to model the way that environmental information is transmitted to DNA sequences is incredibly complex. It is comparatively simple on the other hand to think about the entropy rate of a given sequence. Imagining DNA as a random source pulling symbols from an alphabet of size four – adenine (A), thymine (T), guanine (G), and cytosine (C) – Shannon’s theory tells us that the entropy of such a source can be at most bits. In other words, if we wanted to represent DNA in a computer, a maximally complex sequence would require two bits to represent each nucleotide. We wondered if in the same way that simple replicators like bacteria and viruses feel pressure to have a shorter genome, they might also be closer to this information encoding limit. We hoped to build on the work of others who have considered sequence entropy of genomes by examining a larger set of sequences and taking several complementary analytical approaches.
Genomic Sequence Cohort
Before performing our analyses we wanted to gather a set of genomic DNA sequences representing a broad sample of the tree of life. While the National Center for Biotechnology Information (NCBI) maintains an enormous database of submitted sequences we chose to focus on a subset that we annotated with various factors we thought might influence information density to help with interpreting our results. We chose representatives of the major kingdoms – eukaryotes, prokaryotes, archaea – as well as several non-‘living’ replicators like viruses, organelles, and viroids; organisms that reproduce sexually, or asexually, that have one more many cells, that hunt or make their own food or take resources from others. A brief summary of the dataset is included in the table at the end of this post.
One straightforward and informal way to explore the information density of genetic data is to compress it using off-the-shelf compression algorithms. For genetic sequences drawn from high-entropy distributions, the sequences should not be very compressible. However, many organisms have highly repetitive DNA that can be compressed.
Although DNA compression has received much academic attention, state-of-the-art genome compression algorithms leverage a reference genome, allowing them to store only differences with respect to this baseline. These algorithms are fundamentally interested in the conditional entropy of a genome with respect to the reference, whereas we are interested in the entropy of the sequence in isolation.
In order to estimate the compressibility of a series of genomes, we fed them into zlib, lzma, gzip, bzip2, zstandard, and blosc compression libraries, and looked at the most successful (smallest) compressed size of the resulting file
Zlib uses the DEFLATE algorithm, which combines Huffman coding with the LZ77 algorithm. Bzip2 performs the Burrows-Wheeler transform, rearranging symbols, before a Huffman coding step. Lzma uses an improved dictionary compression scheme similar to LZ77. Zstandard again uses the LZ77 algorithm with very fast entropy coding using tANS algorithm.
Lzma consistently performed the best in terms of compressed file size among all the compression algorithms used. It was most effective in ~93% of cases.
We can see from the chart above that the compressibility of genomes aligns roughly with our expectation on information density based on reasoning from the pressures on genome size. By far the last compressible organism is the tiny viroid, which must fit all its genetic data into a very physically constrained space. The viruses are the next least compressible organisms. On the upper end of the scale, we see a large amount of variety and no clear trendline. Some organisms in the same kingdom have very different genome sizes and similar compression ratios.
If all our genomic sequences were drawn from the same distribution, we would expect to see the data points fall along an “efficient frontier” with the left side asymptotically approaching the theoretical maximum compression rate defined by the entropy and the right side going to 1 as the number of symbols in the genomic sequence declines. We see this pattern roughly above, suggesting that the distributions are indeed similar. Some outliers are the viroid in the bottom right and two of the bacterial genomes.
K-Mer Counting Approach
Calculating the entropy of a random variable requires knowing the underlying probability distribution. In the case of genomes, we don’t have access to any kind of theoretical generating distribution – if such a thing even exists it would probably be incredibly complicated to model, trying to represent all of the underlying evolutionary processes.
Instead, what we do have access to are the genome sequences themselves – individual realizations of this source, often with upwards of tens of millions of samples. If we’re concerned purely with global properties of the sequence, this is more than enough to approximate the underlying distribution. The simplest way to perform that approximation is to count the empirical frequency with which different ‘words’ appear throughout the sequence. From this we can calculate the entropy as
These words could be as short as a single nucleotide, although intuitively this feels like oversimplifying the complexities of a genome. Past work has found that dinucleotide frequencies make for good signatures of a given organism. Going farther, we know from Information Theory that encoding symbols from a given source in longer tuples allows us to get closer to the optimal encoding (at the source entropy) but also at a certain point empirical calculation becomes problematic when the number of possible length-k sequences (k-mers) outpaces the length of the sequence being analyzed. With these considerations, we first looked at calculating the entropy of all of our genome sequences using the empirical distribution of k-mers of length 1 through 5.
While the idea of k-mer counting is simple in concept, it can be very difficult to do quickly, especially when considering sequences on the order of or more characters in length. To get around this difficulty we implemented a random sampling based approach with the idea that if we count the number of times k-mers appear at some subset of randomly chosen locations in the genome we can approximate the global properties of the sequence. To test this idea we ran the sampling algorithm with varying sampling sizes on the C. ruddii and E. coli genomes – two sequences which span an order of magnitude difference in length but are short enough to quickly calculate the exact k-mer distributions in addition to the approximations. The results of this experiment can be seen below
These plots show the error relative to the exact k-mer entropy value for the mean value of simulations of a given size, with error bars representing the variance in the simulated results. We can see that by the time even a hundredth of the genome has been sampled, most of these measurements have converged to the value calculated based on the exact empirical distribution. For the longer genome of E. coli they converge even earlier, which makes sense since smaller local features get averaged out more looking at global properties of longer genomes. The caveat here is that while we can look at a small fraction of longer genomes to approximate global k-mer distributions, doing so with smaller genomes will clearly present a biased sample (1% of the ~300 base pair long GYSV sequence is only 3 bases) and so in doing this analysis we chose to set a lower bound on the number of loci considered. Using this characterization of the sampling method, we went on to calculate k-mer based entropies for all of the selected genome sequences by sampling 5% of all loci, with a lower bound of 5000 samples per genome. Plotting the results of this sampling for all of the genomes considered gives the following plot where the average number of bits per nucleotide are plotted against genome length on a log scale.
Looking at this plot there doesn’t seem to be a clear correlation between length and information density. One trend we can note is that there seems to be fewer organisms with low global sequence entropy at higher genome lengths, a trend which seems contrary to our initial intuition. It’s also possible however that we simply didn’t sample enough larger genomes to observe the rare low-entropy genomes. We can also see that the left most points move downward rapidly with increasing k, but this is probably because at that length scale the number of possible k-mers starts to outpace the length of the genome. Coloring this plot by kingdom and then by lifestyle (independent, parasitic, symbiotic, or communal) gives the following plots
Here we see that the viruses (red in the top plot) tend to be very information dense, having average entropy between 1.9 and 2 bits per nucleotide which actually does line up with our expectations. We notice that the bacteria however (purple in the top plot) are fairly spread out. Strangely enough there does seem to be a nearly linear relationship in the log space between genome length and information density for these organisms.
We were excited to realize that the two lowest entropy genome sequences we analyzed belonged to C. ruddii and N. deltocephalinicola – two bacteria that live within the cells of other organisms. We hoped that this might be a trend among such endosymbiotes and so analyzed several other bacterial endosymbiotes and even organelles. Unfortunately the trend did not continue very strongly. There may be some vague hope for this idea even so – when comparing two endosymbiotes of the Tse-Tse fly, one of which is ‘primary’ and cannot be cultured outside of the host cells and the other of which is ‘secondary’ (has been cultured alone) – the primary endosymbiote had much lower sequence entropy.
It is possible that by including even more organisms in our analysis we may be able to notice more interesting trends, or more likely outliers like C. ruddii. Another possible way to build up the data set would be to pick some fixed length of DNA (say about bases) and sample subsequences of that length from many different organisms. This helps to remove length as a confounding factor – longer sequences tend to be more compressible because there’s more chance for repetition – and lets us focus more on differences between the organisms themselves.
Comparison Between Approaches
We compared the results of the entropy estimation and compression experiments by looking at the relationship between the compression rate implied by the estimated entropy and the one empirically observed from the compression algorithms.
The naive representation of the genetic alphabet is 2 bits, so dividing the 2-mer entropy by 2 gives the implied compression ratio. For the generic compression results, we divided the bits in the compressed file by the number of base pairs in the genome. This means bits/bp is strictly overestimated, but this should not matter for large enough genomes, where the metadata makes up a vanishing fraction of the overall file size. However, as shown in the chart, there is no meaningful linear relationship between the two quantities. Furthermore, while it may be unsurprising that 2-mers have close to 2 bits of entropy, the relative failure of compression algorithms that can jointly compress the whole genome to improve significantly upon this baseline suggests an inability to exploit long-term dependency or structure in the genome. While off-the-shelf compression algorithms cannot be expected to model the distribution of DNA on a biological level, we know that many organisms possess long repeated sequences.
At the outreach event, we wanted to show how basic concepts of information relate to DNA and the genetic system underlying all organisms, including humans. First, we used DNA and protein production as a concrete example of a system that employs a simple code to reproduce complex information. For this demonstration, we used stick and ball magnet toys to make shapes representing amino acids and proteins. We presented a complex shape representing a protein
and then challenged visitors to come up with ways to describe how to build the shape without access to visual information. With some help and prompting on our part, the students found out they could employ an “alphabet” of individual shapes. In our case, the letters of this alphabet (the amino acids) were simpler color-coded shapes.
Participants were then able to compose the complex protein by linking and folding these simple “amino acids”.
Many of our younger visitors got extra creative and came up with their own irregular proteins! As participants played with the demonstration and thought about how to represent information, we used a very simple slideshow to present, in broad strokes, how the genetic code works, drawing again on the alphabet analogy to explain how characters/sequences stand in for shapes.
Quite a few of the kids that came up to our station hadn’t heard of DNA before, but they quickly grasped the link between DNA and information. We followed up by challenging some of the intuition that flows from this notion of DNA as information. Most participants agreed with the idea that more complex organisms would need to store more information in their genetic code, and therefore would have longer genomes. This stage of our outreach activity tied in with our project exploring various measures of information on various reference genomes. To help our visitors test their assumptions, we put the images of four organisms (Sorangium cellulosum, pufferfish, humans, and bread wheat) on a poster and threaded pieces of yarn through a hole below each image. While only a tiny bit showed in the front, the full length of each piece of yarn corresponded to the actual unspooled length of DNA. Visitors could pull on these bits of yarn to discover the actual size of the organism’s DNA. Aligning with expectations related to “complexity”, the bacteria had the shortest genome, and then the pufferfish, and then the human. However, many were surprised to discover that the wheat had the longest genome of all, coming in at five meters, five times longer than the human genome. This helped them understand that the relationship wasn’t quite as clear-cut as they might have thought. Parents, who had more background and thus stronger intuition about length, were the most surprised, while kids seemed to enjoy the tactile aspect, as with the magnet component of our outreach activity.
Table of Organism Information
|Organism Name||Brief Description||Total Genome Length|
|Caenorhabditis elegans||Nematode, common model organism|
|Escherichia coli||Bacteria, common model organism|
|Drosophila melanogaster||Fruit fly, common model organism|
|Schistosoma mansonii||Parasitic blood fluke|
|Saccharomyces cerevisiae||Budding yeast, common model organism|
|Danio rerio||Zebrafish, common model organism|
|Dictyostelium discoideum||Facultative multicellular slime mold|
|Hepatitis D||Human virus|
|Carsonella ruddii||Endosymbiotic bacteria|
|Mycoplasma genitalium||Bacteria with smallest genome of free-living organism|
|Schmidtea mediterranea||Highly regenerative flatworm|
|Helicobacter pylori||Human pathogenic bacteria|
|Thermoplasma acidophilum||Low pH extremophile, archaea|
|Agrobacterium tumefaciens||Bacteria used in plant genetic engineering|
|M. jannaschii||Deep ocean extremophile, archaea|
|Pyrococcus furiosus||Thermophile, archaea|
|TMV||Tobacco mosaic virus||6395|
|Pandoravirus salinus||Virus with largest known viral genome|
|Bacillus subtilis||Sporulating bacteria, common model organism|
|Chlamydomonas reinhardtii||Green algae|
|Takifugu rubripres||Puffer fish, one of the smallest vertebrate genomes|
|Sorangium cellulosum||Bacteria with one of the largest bacterial genomes|
|Schizosaccharomyces pombe||Fission yeast|
|RSV||Rous Sarcoma Virus||9392|
|Nasuia deltocephalinicola||Bacterial endosymbiont|
|Ebola virus||Human virus|
|GYSV||Grapevine yellow speckle viroid||366|
|Staphylococcus aureus||Human pathogenic bacteria|
|Aplysina aerophoba||Tube sponge|
|Plasmodium falciparum||Malaria parasite|
|Ostreococcus lucimarinus||Protist with one of the smallest eukaryotic genomes|
|Neurospora crassa||Red bread mold|
|Apis mellifera||Honey bee|
|Nasonia vitripennis||Jewel wasp|
|Anopheles gambiae||Mosquito, common malaria host|
|Genlisea aurea||Carnivorous plant with relatively small genome|
|Sterkiella histriomuscorum||Protist that undergoes genome fragmentation|
|Loxodonta africana||African elephant|
|Eschrichtius robustus||Grey whale|
|Belgica antarctica||Antarctic midge, one of smallest insect genomes|
|Cytophaga hutchinsonii||Cellulose-eating bacteria|
|Ashbya gossypii||Multinucleate fungus|
|Volvox carteri||Simple multicellular protist|
|Hydra vulgaris||Hydra, simple highly regenerative invertebrate|
|Ophiocordyceps unilateralis||Parasitic fungus, invades ant colonies|
|Vibrio fischeri||Communal bacterium|
|Octopus bimaculoides||California two spot octopus|
|A. thaliana chloroplast||Organelle|
|H. sapiens mitochondria||Organelle|
|A. thaliana mitochondria||Organelle|
|Richelia intracellularis||Bacterial intracellular endosymbionts|
|Buchnera sp. APS||Pea aphid endosymbiont|
|Wigglesworthia glossinidia||Tse-tse fly primary endosymbiont|
|Wolbachia pipientis||Intracellular parasite of insects|
|Calothrix rhizosoleniae||Intracellular endosymbionts|
|Endosymbiont of R. pachyptila||Tube worm endosymbiont – replaces digestive system|
|Sodalis glossinidius||Tse-tse fly secondary endosymbiont|
|Ooceraea biroi||Raider ant|
|Sinorhizobium meliloti||Legume endosymbiont|