This post will teach you the basics of what DNA is and how it can be translated to be read on a computer to be analysed in Python. You will also learn how to write two useful programs needed to analyse DNA.

What is DNA?


DNA (Deoxyribonucleic acid) is a molecule that encodes genetic information for making RNA and proteins inside cells (like binary 1s and 0s but for living organisms). Strands of DNA are spread across chromosomes (46 chromosomes in humans, 23 maternal and 23 paternal) in cell nuclei which make up genes determining characteristics and functions of our body. The structure and function was identified between the 1860s and 1950s by work between multiple scientists⁽¹⁾.

Structure of DNA

DNA has a double-helical structure made of four molecules called nucleotides. Around 3.2 billion nucleotides make up the human genome⁽²⁾. One strand is made of adjacent nucleotides linked by phosphodiester bonds between specific carbon atoms of a ribose sugar. Nucleotides consist of a negatively charged sugar-phosphate backbone and a base, either Adenine (A), Thymine (T), Guanine (G) or Cytosine (C). Bases form bonds with each other, adenine always forms a double bond with thymine and guanine forever forms a triple bond with cytosine in DNA, making up the double helix structure.

Figure 1. Shows the structure of DNA and how nucleotides come together to form chromosomes.

Figure 1. Shows the structure of DNA and how nucleotides come together to form chromosomes.

Setup


Now we have a little background on DNA, let’s get started. First, we will need to install Python 3 and an IDE (we could use notepad but it won’t be pretty) used to write code in. I use VSCode so I’ll go through how to set it up.

Installing Python 3

  1. You can download python by clicking the link. we want to download a version of Python 3 and click install.

  2. Verify Python is installed correctly by launching the command prompt on Windows or terminal on Mac and type the following command and press enter.

python

python test

  1. You should see Python followed by the version, indicating it has been successfully installed. Try type 2 + 2 and hit enter. Well done! we’ve got python installed 😄.

Installing VSCode

Now instead of writing all our code in the terminal, using an IDE to write our code allows the use of plugins and extra features like code highlighting and autocomplete.

  1. Download VSCode and launch it.
  2. On the side menu to the left press the extensions icon. extensions icon
  3. Search for Python and install the extension. This will let you use Python in VSCode. Python extension

We’re now finally ready to start playing with DNA on our computers ⚡.

DNA on computers


DNA sequences can be identified using labeled dideoxyribonecleoside triphosphates* and gel electrophoresis which produces sequences that can be read by a detector. On computers we can simply store sequences as text e.g. “AGTGTCCCTG”, in Python we can store DNA sequences of millions of bases long in a string variable and analyse it.

Counting DNA length

We are going to write a function that counts the bases of a short sequence and prints it out, there is some extra code in the snippets below too. This is just code that lets python run our code. Remember the indentations are important in Python code and anything following a # or "”” is just a comment to explain what the lines are doing and do not need to be written.

  1. In a folder create a new file and save it as a *.py file e.g. sequence_count.py and open it in VSCode.
  2. Follow and copy the code below all the lines have comments explaining what they do.
#define our functions
def print_sequence_length():
  seq = "AGTGTCCCTG" #store DNA strand as a string in the variable seq
  print(len(seq)) #prints the sequence length


if __name__ == "__main__": #telling python to run the code below
  print_sequence_length() #run our function
  1. Press the play button at the top right corner of the window play button and this will run the code in the built-in terminal. You could also open a terminal in the same folder and run the python command followed by your file path.
python sequence_count.py

You’ve just made your first program that counts sequence bases and prints it! Next, we’re going to get learn how to import a entire genome using a common file format used to store DNA sequences.

Parsing FASTA files

If you stored real DNA sequences in your python code it would get really messy, so it’s best to import sequences that have been stored on external files. A common file type used is the FASTA format (just a formatted text file with *.fasta, *fna extensions) Heres an example.

>Biospace_1218
CCTGCGGAAGATCGGCGTCGGTCTAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGC
>Biospace_2730
CCATCGGTAGCGCTTCCTTCGGCTAATTAAGTCCCTATCCAGGCGCTCCGACGAAGGTCT
ATATCCATTTGTCAGCAGACACTC
>Biospace_4280
CCACCCTCGTGGTATGGCTAGGCATTTTCGAACCGGATAACGCTTCAGACAAGCCCGGAC
TGGGAACCTGCGCGCAGTAGGTAGAAT

Each sequence is identified by “>” followed by the name and sometimes extra information about it. The lines following the one starting with “>” is the sequence until we reach the next identifier. To be able to work with these sequences we need to first read, clean and parse the sequences, so lets start by downloading a FASTA file from here or you can find one from NCBI. I have provided the entire genome of the bacteria Candidatus Carsonella ruddii which has one of the smallest genomes found at a 159,662 base pairs of DNA and 182 protein-encoding genes.

Entire genome base frequency

Lets make a new file and write a function to read the FASTA file and another to count the number of occurrences of bases in the genome.

#imports a library included in Python to give extra functions
from collections import Counter

"""function takes in a filepath then opens & parses the
fasta file returning a dictionary {'label':'seq'}."""
def open_and_parse_fasta(filepath):
    FASTA_file = open(f"{filepath}", "r")
    FASTA_dict = {} 
    FASTA_label = ""
    for line in FASTA_file: 
        line = line.rstrip() #removes spaces and newline chars each line
        if line.startswith(">"): #catches sequence label
            FASTA_label = line[1:] #set the label as 1st char and after
            #set sequence label in dict with empty string
            FASTA_dict[FASTA_label] = "" 
        else: #must be sequence
            #add sequence to end of dict entry
            FASTA_dict[FASTA_label] += line 
    return FASTA_dict

#function to count base occurence and print results
def count_bases():
  #run our open FASTA function and set returned FASTA_dict to FASTA variable
  FASTA = open_and_parse_fasta("./GCF_000010365.1_ASM1036v1_genomic.fna")
  #select the sequence we want to look at from the dict
  carsonella_ruddi = FASTA.get('NC_008512.1 Candidatus Carsonella ruddii PV DNA, complete genome')
  #counts and prints the bases in the genome
  print(Counter(carsonella_ruddi)) 

if __name__ == "__main__":
  count_bases()

If everything works correctly, assuming you used the genome for Carsonella ruddi you should get these results.

Counter({'A': 66734, 'T': 66481, 'C': 13501, 'G': 12946})

Congratulations! 🎉 you’ve made multiple programs that can be used to analyse DNA that would take humans forever to do. If there are any questions or you have feedback on how to improve the code you can post them in the comments below.

In the next post, we will be writing more code that will create reverse complements and mimic DNA replication.

References

[expand]

  1. Pray, L., 2008. Discovery Of DNA Double Helix: Watson And Crick. [online] Nature.com. Available at: https://www.nature.com/scitable/topicpage/discovery-of-dna-structure-and-function-watson-397/ [Accessed 22 April 2020].

  2. Chial, H., 2008. Human Genome Project: Sequencing The Human Genome. [online] Nature.com. Available at: https://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human-828/# [Accessed 22 April 2020].

[/expand]

Appendix

(*) These are derivates of normal deoxyribonucleoside triphosphates but lack the 3’ OH group causing termination of replication. Usually they incorporate a fluorescent or radioactive molecule thus, the end of the DNA is labeled and can be detected.

Full code on Github.