PJS in Fearnan

Science

Programming Language DNA

14th August 2017 – Capital and Conflict.
‘Last week, a team of hackers [University of Washington] demonstrated that it is possible to lace physical DNA with malicious software – when that DNA is tested, the results themselves take over the computer on which they are being analysed. From that point forward, any amount of digital mischief can be performed. The DNA transmission mechanism is a platform, and could be used to host any software, nefarious or not. In theory, you could keep a copy of Flight Simulator or the audio of your favourite song in your saliva if you wanted to.’

Is that exciting or scary? I suppose that depends on one’s point of view and quite how mischievous or devious a one you might be. But it got me thinking about the difference, or relationship, between a computer programming language and the DNA found in each and every one of our cells. In computing there are many languages, each designed for different purposes. At University I came across the teaching language Pascal, IBM’s C and its Object-Oriented version C++. At work I used IBM’s mainframe language PL/1 and Java, based on C++. Ultimately, whichever one you use, a computer understands nothing more than the flow of electrons, an electrical current passing through switches. For this to happen, human readable code must be compiled into machine code that mimics this. An electrical current passes through, or is blocked by, millions of logic gates and causes things to happen according to the rules of Boolean algebra (an interesting story for another day). In other words a computer understands nothing more than on or off millions of times over. We represent this as 1s and 0s, binary. It is a two-character instruction set.

DNA is a four character instruction set represented by A, C, G & T. This got me wondering about how much storage capacity there might be in each cell in terms of bytes and how I might relate this to something familiar. I have seen DNA described as the equivalent of so many hundreds of books or volumes, but that is a rather vague and inexact measurement. The amount of data-equivalence per printed page depends on so many different factors that must be defined for the comparison to work - the typeface, font size, the page and margin size and so on. If we are to compare it to a printed book, why not use one that has become a virtual SI unit in itself, in the manner that ‘football pitch’ has become a modern-day standard unit of measurement for dummies. A number of candidates present themselves – the Bible, the complete works of Shakespeare, War and Peace. Let’s go with the Bible and let’s use basic ASCII 8 bit coding.

Each character comprises one byte. One byte is made up of 8 bits so that we have a 256 character set represented by values from 00000000 through to 11111111, or 2 to the power 8. This is ample for characters a..z, A..Z, 0..9 plus all standard English punctuation marks. There are slightly fewer than 800,000 words in the Bible, the exact number depending on translation. If we assume that each word is on average 6 letters long we have 4.8m letters. A space is also a character, as are punctuation marks. So let’s add on 1m of these by assuming one space per word and one punctuation mark per four words. There are 1,189 chapters and 31,102 verses to be numbered. One megabyte is 1,024,000 bytes and therefore we can represent the entire Bible in a basic, unadorned text file of less than 6,144,000 bytes (6 MB). A standard 700 MB CD can thus more than 110 of such Bibles. We have already noted that DNA has four characters in its coding set. The letters represent the bases Adenine, Cytosine, Guanine and Thymine. Adenine always connects to Thymine and Cytosine connects to Guanine giving four possible pairings: A-T, T-A, C-G, G-C. These four possibilities could be represented in 2 bit binary: 00,01,10,11. The haploid human genome consists of about 3 billion base pairs grouped into 23 chromosomes. We inherit a set of chromosomes from each of our parents so that the diploid genome of 46 chromosomes consists of 6 billion base pairs. Two bits is a quarter of a byte and so 6x10^9x0.25=1,500,000,000 bytes. This means that in each and every one of our cells we have chemical storage equal to the digital storage of two CDs and capable of containing nearly 250 Bibles. And bear in mind that it would take about 500 of these information-packed cells to create a visible blob the size of a full stop!

That 1.5 GB of data is not just random information or garbage. It is an instruction set, software, a program that can be read, executed and copied. When viewed in those terms it becomes completely understandable that anyone who learns the syntax of the language and the requirements of the compiler can, with suitable equipment, amend the code, or even write executable programs of their own. Medical scientists are making great leaps and bounds in the debugging of the human genome to fix genetic defects and illnesses. Designer babies are no longer the stuff of science fiction. The ultimate genetic defect is the aging process with its inevitable outcome, death. Is it within the grasp of humankind to defeat ‘natural’ death? If it’s possible to debug the code in order to correct genetic errors then it logically follows that code can be written that performs other functions, things that might be less salubrious, more of a heinous and malevolent nature. This is, after all, an evil world, Satan’s world. Bwa-ha-ha-ha! What was that observation recorded in Genesis 11:6 that lead to direct divine intervention? ‘There is nothing that they may have in mind to do that will be impossible for them…’
Back