In this post we’re interested in storing information in DNA. Why? Well, because it’s so small. No, seriously- DNA base pairs are something like a third of a nanometer (in length along the DNA strand). Meanwhile, bits of today’s hardware memory take space on the order of tens or hundreds of nanometers. That’s a pretty sizable jump in density, especially as building at smaller nodes becomes harder and harder. Ok, so putting data in DNA is a cool idea. However, DNA is susceptible to tricky kinds of errors: insertions and deletions. Let’s look at some of our information science knowledge and try to understand why these are so tricky.
The phrase “information science” already tells you that we’re going to be talking a lot about information. What information? Well, this is a central question to setting up any kind of communications scheme. Let’s say I tell you “hey, check out this blog post!” I have an idea in my head that might be described in terms of some number of neurons firing with certain frequency, and then the sentence leaves my mouth as a set of vibrations with particular length and frequency. Maybe these vibrations enter a text-to-speech program on my phone, and suddenly they’re 30 characters of text data, etc…
Ok, so to make it easier we define what setting we’re interested in, and we abstract our setting to bits of data crossing a channel from sender to receiver. Now our problems pretty much always look something like this: I have n bits of message data and I want to encode and send it across the channel in a light-weight form, say k bits, so that you can decode what I sent. Usually we measure the rate of an encoding- how much information the receiver gets (k bits relative to n)- as well as how good the decoder is at its job.
Well cool! Except, we’re already kind of cheating! See, it’s important to our evaluations that we not be sneaking any information through a back door- and we are, in the form of high-level synchronization. Synchronization as a word describes things that act at the same time. In this context, it refers to the sender and receiver both knowing that the message is being sent, “message-level” synchronization. “Bit-level” synchronization, similarly, refers to the receiver knowing which received bits correspond to which sent encoded bits.
Message-level synchronization is often a reasonable real-world assumption, and bit-level synchronization can be accomplished in most contexts by timing recovery techniques. Such techniques give a picture of the sent encoded bits as well as timing gaps corresponding to erasures. However, the “reception” of DNA is unconnected from any kind of timing of symbols: a portion of a message received as “AA” may have come from an encoded string “AAA”, “ACA”, “ATA”, “AGA”, or just “AA”. This gives us the concept of deletions, a form of synchronization error.
Deletions and insertions (the “opposite” kind of synchronization error) are both so tricky because they invalidate assumptions we usually like to make about information shared between sender and receiver. Synchronization codes have to ensure that instances of these errors on different members of their codebooks will still leave them unambiguous- even the single example in the previous paragraph should already give the reader a sense of how hard this task is.
One way of approaching the problem, as taken by Wang, 2012, is with a concatenated code. The general idea of a concatenated code is to encode and decode twice, once with an outer code giving some set of properties, and again with an inner code giving some other set. In this case, the outer code will be some standard “good” code like and LDPC or turbo code, making the usual assumptions of bit-level synchronization. We then can satisfy these assumptions with an inner synchronization code.
In our EE376a project we follow Wang in implementing simple marker synchronization codes. In such a code, a particular sequence of “pilot” bits is inserted at a set interval throughout the input. Knowledge of what and where these bits are is synchronized between the sender and receiver, so the receiver can use this information to help detect where insertions and deletions occurred, and then pass this information along to the decoder for the outer code.
While our marker code implementations are sufficient to use in such a scheme, we continue to work on a complete concatenated encoding/decoding model to conduct error correction analysis. For future work, we hope to use these full models to replicate some of Wang’s simulated results, and if possible to apply the scheme in actual DNA coding in the lab.
While working on this project, we had the opportunity to engage in the information science outreach event at Nixon Elementary. Some of the concepts of synchronization and concatenated codes were likely to be a bit tricky for our audience, but we took the opportunity to teach about DNA as a form of information storage, with a focus on recovering information from incomplete sets.
For an activity, we gave the kids pictures of animals with erasures and asked them to draw what they thought should go in the missing parts. To some of them this was just a fun kind of puzzle game, but even so we’re happy that it got them thinking about recovering information from corrupted messages. For others, the activity seemed to prompt some really great questions about DNA’s role in conveying information in the body! One favorite was “how do the [ribosomes] know if the DNA is [mutated too much]?” which sparked an interesting discussion about probabilistic processes with several of the adults around as well.
Overall, this project was a great opportunity to question assumptions and learn new things, both about DNA and about synchronization codes. And hopefully it has been through this blog post for you as well! Next time you see a communication scheme, ask yourself how much implicit synchronization it relies on- and how it ensures that synchronization. And, for more reading on synchronization codes, check out some of our sources:
M. Mitzenmacher, 2019:
F. Wang, 2012: