Practical 10

In this practical we will will get acquainted with Biopython.

Slides

The slides of the introduction can be found here: Intro

Biopython

From the Biopython tutorial: The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (FASTA, FASTQ, BLAST, PDB, Clustalw, Genbank,…), access to online services (NCBI, Expasy,…), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS…), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.

In this practical we will see some features of Biopython but refer to biopython documentation to discover all its features, recipes etc.

These notes are largely based on what available here.

Installation of Biopython

To test if Biopython is installed in your system you can import the library with:

import Bio

if the execution gives:

Biopython needs to be installed.

In windows installing Biopython should be as easy as opening the command prompt as admininstrator (typing cmd and then right clicking on the link choosing run as admininstrator) and then pip3 install biopython. If that does not work, you should be able to install biopython with the following: python3.xx -m pip install biopython (where you should change python3.xx with the path to your python3.xx executable).

In linux sudo pip3 install biopython will install biopython for python3 up to python3.5. On python 3.6 and later the command is: python3.6 -m pip install biopython or, more in general, pythonX.X -m pip install biopython.

Detailed installation instructions can be found here.

General overview

Biopython provides the capability of parsing standard bioinformatics formats into python usable data structures. Some of the supported tools and formats are:

  1. Blast output – both from standalone and WWW Blast
  2. Clustalw
  3. FASTA
  4. GenBank
  5. PubMed and Medline
  6. ExPASy files, like Enzyme and Prosite
  7. SCOP, including ‘dom’ and ‘lin’ files
  8. UniGene
  9. SwissProt

among the others. In particular it provides a sequence class to perform the most common operations on sequences, parsing of alignments, clustering etc. and it can quickly connect to the most popular databases to download data or parse information on the fly.

Sequence objects

Biopython has a specific object Seq to deal with biological sequences which is a lot more powerful than the str object of python when it comes to manage biological data. The object Seq has methods like translate() and reverse_complement() which are very handy.

As in the case of python’s str objects, the object Seq is immutable. The mutable version is MutableSeq.

All the information on Seq objects can be found here.

The alphabet

Seq objects are composed by two information: the actual sequence and the alphabet the sequence refers to. We normally will use the IUPAC alphabet as defined in the Bio.Alphabet.IUPAC which provides a suitable alphabet to deal with DNA, RNA and proteins. Several options are available, including Bio.Alphabet.generic_dna or Bio.Alphabet.generic_protein, Bio.Alphabet.ThreeLetterProtein and other.

Specifying an alphabet quickly tells you what information you are dealing with and moreover it provides a way to specify a constraint on the data.

If the definition of the alphabet is not specified, the default one is used (Alphabet).

In [1]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

#No alphabet specified
s = Seq("GATTACATAATA")
dna_seq = Seq("GATTATACGTAC", IUPAC.unambiguous_dna)
print("S:", s)
print("S's alphabet:", s.alphabet)
print("dna_seq:", dna_seq)
print("dna_seq's alphabet:", dna_seq.alphabet)

my_prot = Seq("MGNAAAAKKGSEQE", IUPAC.protein)
print("my_prot:", my_prot)
print("my_prot's alphabet:", my_prot.alphabet)
S: GATTACATAATA
S's alphabet: Alphabet()
dna_seq: GATTATACGTAC
dna_seq's alphabet: IUPACUnambiguousDNA()
my_prot: MGNAAAAKKGSEQE
my_prot's alphabet: IUPACProtein()

Looping, slicing and concatenation

Note also that Seq objects behave like strings, but the consistency of the alphabet is checked too. For example we cannot concatenate a unambiguous_dna with a IUPAC.protein sequence. This is the error message we would get:

To do it one should change the alphabet to generic_alphabet:

In [2]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_alphabet

dna_seq = Seq("GATTATACGTAC", IUPAC.unambiguous_dna)
my_prot = Seq("MGNAAAAKKGSEQE", IUPAC.protein)

my_prot.alphabet = generic_alphabet

#Does it really make sense though?!?
print(dna_seq + my_prot)
GATTATACGTACMGNAAAAKKGSEQE

We can also loop through the elements and slice strings with the usual [S:E:step] operator (and as usual S is included, E is excluded). We can also convert a Seq object into a string.

In [3]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

dna_seq = Seq("GATTATACGTACGGCTA", IUPAC.unambiguous_dna)

for base in dna_seq:
    print(base, end = " ")

print("")

sub_seq = dna_seq[4:10]
print(sub_seq)

#Let's reverse the string:

print("Reversed: ", dna_seq[::-1])
#from Seq to string:
dna_str = str(dna_seq)
print("As string:", dna_str)
print(type(dna_str))
G A T T A T A C G T A C G G C T A
ATACGT
Reversed:  ATCGGCATGCATATTAG
As string: GATTATACGTACGGCTA
<class 'str'>

The object Seq provides several methods similar to the corresponding methods of strings:

  1. Seq.count(s) : counts the number of times s appears in the sequence;
  2. Seq.upper() : makes the sequence of the object Seq upper case
  3. Seq.lower() : makes the sequence of the object Seq lower case

Example: Let’s compute the GC content of the sequence “GATTRWWACGTACGGCTASATTACSCCGGCTA”.

In [4]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqUtils import GC

dna_seq = Seq("GATTRWWACGTACGGCTASATTACSCCGGCTA", IUPAC.ambiguous_dna)

gc = (dna_seq.count("G") + dna_seq.count("C"))/len(dna_seq)
print("GC % is {:.2f}".format(100*gc))

correct_gc = GC(dna_seq)
print("GC % is {:.2f}".format(correct_gc))

print("original:", dna_seq)
print("lower_case:",dna_seq.lower())
GC % is 40.62
GC % is 46.88
original: GATTRWWACGTACGGCTASATTACSCCGGCTA
lower_case: gattrwwacgtacggctasattacsccggcta

The difference in the results goes down to the fact that Biopython’s GC method takes into account ambiguous nucleotides like (S which stands for C or G ref iupac). This is a method provided in the SeqUtils module (more info on all the methods here).

Complement and reverse complement

A usual operation on DNA strings is the reverse complement. Biopython provides two specific methods to do that:

  1. Seq.complement() to complement the sequence
  2. Seq.reverse_complement() to reverse complement the sequence.

Note that these methods will fail with a Value Error if we try to apply them to something that is not a nucleotide sequence (e.g. proteins).

In [5]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)

print("Sequence (alphabet: {}):".format(my_seq.alphabet))
print(my_seq)
comp  = my_seq.complement()
print("")
print("Complement (alphabet: {}):".format(comp.alphabet))
print(comp)
print("")
revcomp = my_seq.reverse_complement()
print("Rev Complement (alphabet: {}):".format(revcomp.alphabet))
print(revcomp)
Sequence (alphabet: IUPACUnambiguousDNA()):
GATCGATGGGCCTATATAGGATCGAAAATCGC

Complement (alphabet: IUPACUnambiguousDNA()):
CTAGCTACCCGGATATATCCTAGCTTTTAGCG

Rev Complement (alphabet: IUPACUnambiguousDNA()):
GCGATTTTCGATCCTATATAGGCCCATCGATC

Transcription and translation

Remember that:

[taken from the Biopython tutorial].

Biopython has a method to transcribe a DNA sequence into a mRNA sequence (and one to go back):

  1. Seq.transcribe() that transcribes a DNA sequence into mRNA (converting Ts in Us).
  2. Seq.back_transcribe() that back-transcribes a mRNA sequence into DNA (converting Us in Ts).

The alphabet of the transcribed sequence is converted into mRNA. Note that transcribing a protein will result in a ValueError.

Alike, a DNA or mRNA sequence can be translated into a protein with:

Seq.translate(table=N, to_stop=True/False, stop_symbol =symbol) that converts a DNA sequence or mRNA into the corresponding protein.

Optionally, the translate method accepts a parameter table to specify the NCBI translation table used (see below for codes). to_stop makes the translation stop at the first stop codon if True (in this case the stop codon is not translated!), stop_symbol to replace the asterisk with any other stop character (Es. “@”).

image0 Stop codons will be represented by an asterisk.

Example: Let’s translate into mRNA the coding sequence: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG.

In [6]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
                 IUPAC.unambiguous_dna)
print(coding_dna)

mrna = coding_dna.transcribe()
print(mrna)
print("")
print(mrna.alphabet)
print("")
print("... and back")
print(mrna.back_transcribe())
print("")
print("Translation to protein:")
prot = mrna.translate()
print(prot)
print("")
print("Up to first stop:")
print(mrna.translate(to_stop = True))
print("")
print("Mitocondrial translation: (TGA is W!)")
mit_prot = mrna.translate(table=2)
print(mit_prot)

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG

IUPACUnambiguousRNA()

... and back
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG

Translation to protein:
MAIVMGR*KGAR*

Up to first stop:
MAIVMGR

Mitocondrial translation: (TGA is W!)
MAIVMGRWKGAR*

Translation tables can be accessed at runtime through:

In [7]:
from Bio.Data import CodonTable
#equivalent to:
#standard_table = CodonTable.unambiguous_dna_by_id[1]
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
#equivalent to:
#standard_table = CodonTable.unambiguous_dna_by_id[2]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]


print(standard_table)
print(standard_table)

print(mito_table)
Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--
Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--
Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

MutableSeq

Seq objects are immutable and therefore we cannot change their value. If we need to do so, we need to convert them into MutableSeq objects to avoid a runtime TypeError that will occur when we try to change a Seq. Seq to MutableSeq conversion can be done by:

  1. Seq.tomutable() to convert a Seq to a MutableSeq.
  2. MutableSeq.toseq() to convert a MutableSeq to a Seq.

This will allow us to assign the value of characters within the MutableSeq with MutableSeq[ind] = newValue. Note also that methods like complement change the actual object if applied to a MutableSeq:

In [8]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
print("Coding DNA:")
print(coding_dna)
mut_dna = coding_dna.tomutable()
print("Mutable DNA:")
print(mut_dna)
mut_dna[10:20] = "TATATATATA"
print("Mutable DNA now:")
print(mut_dna)
print("")
print("Rev comp (Seq not changed!):")
coding_dna.reverse_complement()
print(coding_dna)
print("Rev comp (MutableSeq changed!):")
mut_dna.reverse_complement()
print(mut_dna)
Coding DNA:
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
Mutable DNA:
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
Mutable DNA now:
ATGGCCATTGTATATATATACTGAAAGGGTGCCCGATAG

Rev comp (Seq not changed!):
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
Rev comp (MutableSeq changed!):
CTATCGGGCACCCTTTCAGTATATATATACAATGGCCAT

Sequence annotation

Sequence records are annotations associated to sequences. In biopython there is a SeqRecord object to store this information. SeqRecord are part of the Bio.SeqRecord module which therefore needs to be imported prior to be used.

The SeqRecord class has the following attributes:

  1. SeqRecord.seq : the sequence (the Seq object)
  2. SeqRecord.id : the identifier of the sequence, typically an accession number
  3. SeqRecord.name : a “common” name or identifier sometimes identical to the accession number
  4. SeqRecord.description : a human readable description of the sequence
  5. SeqRecord.letter_annotations : a per letter annotation using a restricted dictionary (e.g. quality)
  6. SeqRecord.annotations : a dictionary of unstructured annotation (e.g. organism, publications,…)
  7. SeqRecord.features : a list of SeqFeature objects with more structured information (e.g. genes pos).
  8. SeqRecord.dbxrefs : a list of database cross references.

Information on SeqFeature objects can be found here.

Although you can manually create a SeqRecord, normally you will access SeqRecords populated by methods like SeqIO, here you can find how to manually create a SeqRecord.

Example:

Read a fasta file NC005816.fna containing the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 and retrieve some information about the sequence.

In [9]:
from Bio import SeqIO

record = SeqIO.read("file_samples/NC_005816.fna", "fasta")

print(record)
print("")
print("Sequence [first 30 bases]:")
print(record.seq[0:30])
print("")
print("The id:")
print(record.id)
print("")
print("The description:")
print(record.description)
print("")
print("The record is a: ", type(record))

ID: gi|45478711|ref|NC_005816.1|
Name: gi|45478711|ref|NC_005816.1|
Description: gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Number of features: 0
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet())

Sequence [first 30 bases]:
TGTAACGAACGGTGCAATAGTGATCCACAC

The id:
gi|45478711|ref|NC_005816.1|

The description:
gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence

The record is a:  <class 'Bio.SeqRecord.SeqRecord'>

Sequence I/O

The Bio.SeqIO module aims to provide a simple way to work with several different sequence file formats. Detailed information on the module can be found here.

WARNING: When dealing with very large FASTA or FASTQ files, the overhead of working with all these objects can make scripts too slow. In this case SimpleFastaParser and FastqGeneralIterator parsers might be better as they return just a tuple of strings for each record.

For example, to use SimpleFastaParser:

In [8]:
from Bio.SeqIO.FastaIO import SimpleFastaParser


with open("file_samples/NC_005816.fna") as in_handle:
    for title, seq in SimpleFastaParser(in_handle):
        s = seq[0:50]+"\n"+seq[51:101]+"\n..."
        print("Header:\n{}\nSequence:\n{}".format(title,s))

#Another example:
print("\n")
labels = ["1st","2nd","3rd"]

with open("file_samples/contigs82.fasta") as cont_handle:
    for l in labels:
        ID, seq  = next(SimpleFastaParser(cont_handle))

        print(l, "entry:")
        print(ID, " has size ", len(seq))
        print(seq[:50]+"...")
        print("")

Header:
gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Sequence:
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATC
AGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAG
...


1st entry:
MDC020656.85  has size  2802
GAGGGGTTTAGTTCCTCATACTCGCAAAGCAAAGATACATAAATTTAGAA...

2nd entry:
MDC013284.379  has size  5173
TATCGTTTCCTCTGAGTAGAATATCGTTATAACAAGATTTTTTTTTTCCT...

3rd entry:
MDC018185.241  has size  23761
AAAACGAGGAAAATCCATCTTGATGAACAGGAGATGCGGAGGAAAAAAAT...

BioSeqIO.parse()

The method Bio.SeqIO.parse is used to parse some sequence data into a SeqRecord iterator. In particular, the basic syntax is:

SeqRecordIterator = Bio.SeqIO.parse(filename, file_format)

where filename is typically an open handle to a file and file_format is a lower case string describing the file format. Possible options include fasta, fastq-illumina, abi, ace, clustal… all the options are available here.

Example: Let’s read the first 10 entries in the .fasta file contigs82.fasta printing off the length of the sequence and the first 50 bases of each sequence followed by “…”.

In [11]:
from Bio import SeqIO
cnt = 0
for seq_record in SeqIO.parse("file_samples/contigs82.fasta", "fasta"):
    if(cnt == 10):
        break
    print("Seq {} has length {}".format(seq_record.id, len(seq_record)))
    print(seq_record.seq[:50]+"...")
    print("")
    cnt += 1
Seq MDC020656.85 has length 2802
GAGGGGTTTAGTTCCTCATACTCGCAAAGCAAAGATACATAAATTTAGAA...

Seq MDC001115.177 has length 3118
TGAATGGTGAAAATTAGCCAGAAGATCTTCTCCACACATGACATATGCAT...

Seq MDC013284.379 has length 5173
TATCGTTTCCTCTGAGTAGAATATCGTTATAACAAGATTTTTTTTTTCCT...

Seq MDC018185.243 has length 22724
CCATTAGTGACCCCCAATGCGGATTAACCAAGCACGGTCAAGATAACCAA...

Seq MDC018185.241 has length 23761
AAAACGAGGAAAATCCATCTTGATGAACAGGAGATGCGGAGGAAAAAAAT...

Seq MDC004527.213 has length 3551
CTTGTATGTTGAAGCTTTGTGAGTGGAGCATATAGGTTGAGGTAGTGTTC...

Seq MDC003661.174 has length 3334
AGTAAGGTTGTAGAATTCAATTTCCACGGAACTCCAGCACAGCTTAGGCA...

Seq MDC012176.157 has length 2236
ATAGTCACTGACAGCCGGTGACAGTTCGACAACAATAAGTTCAGGATGTT...

Seq MDC001204.812 has length 5804
TAATGTCCGAAAAAGATGAAAATGTAGTTTGCATGAAGAAGGAGCTGTCC...

Seq MDC001204.810 has length 9798
ATGCTGCTGACATATTCCAATCGACTACACGTCACATCCGTTGCTATTAT...

Note that Bio.SeqIO.parse returns an iterator, therefore it is possible to manually fetch one SeqRecord after the other with the next(iterator) method.

Example: Let’s get the first 3 entries of the .fasta file contigs82.fasta printing off the length of the sequence and the first 50 bases of each sequence followed by “…”.

In [12]:
from Bio import SeqIO

seqIterator = SeqIO.parse("file_samples/contigs82.fasta", "fasta")

labels = ["1st","2nd","3rd"]
for l in labels:
    seqRec = next(seqIterator)
    print(l, "entry:")
    print(seqRec.id, " has size ", len(seqRec.seq))
    print(seqRec.seq[:50]+"...")
    print("")

1st entry:
MDC020656.85  has size  2802
GAGGGGTTTAGTTCCTCATACTCGCAAAGCAAAGATACATAAATTTAGAA...

2nd entry:
MDC001115.177  has size  3118
TGAATGGTGAAAATTAGCCAGAAGATCTTCTCCACACATGACATATGCAT...

3rd entry:
MDC013284.379  has size  5173
TATCGTTTCCTCTGAGTAGAATATCGTTATAACAAGATTTTTTTTTTCCT...

Sequences as dictionaries

The module Bio.SeqIO also has three different ways to allow random access to elements:

  1. Bio.SeqIO.to_dict(file_handle/iterator) : builds a dictionary of all the SeqRecords keeping them in memory and allowing modifications to the records. This potentially uses a lot of memory but is very fast;
  2. Bio.SeqIO.index(filename,file_type) : builds a sort of read-only dictionary, parses the elements into SeqRecords on demand (i.e. it returns an iterator!). This method is slower, but more memory efficient;
  3. Bio.SeqIO.index_db(indexName.idx,filenames, file_format) : builds a read-only dictionary, but stores ids and offsets on a SQLite3 database. It is slower but uses less memory.

The input of Bio.SeqIO.to_dict(file_handle/iterator) is a open file handler or the iterator to SeqRecords (like the output of Bio.SeqIO.parse()).

Example: Let’s read all the .fasta file contigs82.fasta in memory printing off the entry corresponding to contig “MDC019140.399”.

In [13]:
from Bio import SeqIO
seqDict = SeqIO.to_dict(SeqIO.parse("file_samples/contigs82.fasta", "fasta"))

print(list(seqDict.keys())[0:5])
print("")
print("Number of sequences: ", len(seqDict))
print("")
mySeq = seqDict["MDC019140.399"]
print("Description:", mySeq.description)
print("")
print(mySeq)


['MDC020656.85', 'MDC001115.177', 'MDC013284.379', 'MDC018185.243', 'MDC018185.241']

Number of sequences:  82

Description: MDC019140.399

ID: MDC019140.399
Name: MDC019140.399
Description: MDC019140.399
Number of features: 0
Seq('GACGGGCGGGGAGGGTTTGGTTTTTTTTGGTTTTTAAAAAATTCAGGTTTNNAA...TTT', SingleLetterAlphabet())

The input of Bio.SeqIO.index(filename,file_type) is the file name to read sequences from and the file type (like fasta, genebank,…).

Example: Let’s read all the .fasta file contigs82.fasta in memory (as an index) printing off the entry corresponding to contig “MDC019140.399”.

In [14]:
from Bio import SeqIO
seqDict =  SeqIO.index("file_samples/contigs82.fasta", "fasta")
print("SeqDict is an iterator:")
print(seqDict.keys())
print("")
print(list(seqDict.keys()))
print("")
print("Number of sequences: ", len(seqDict))
print("")
mySeq = seqDict["MDC019140.399"]
print("Description:", mySeq.description)
print("")
print(mySeq)
SeqDict is an iterator:
<odict_iterator object at 0x7fc4c416ff68>

['MDC020656.85', 'MDC001115.177', 'MDC013284.379', 'MDC018185.243', 'MDC018185.241', 'MDC004527.213', 'MDC003661.174', 'MDC012176.157', 'MDC001204.812', 'MDC001204.810', 'MDC004389.256', 'MDC024257.15', 'MDC018297.229', 'MDC001802.364', 'MDC016621.241', 'MDC014057.243', 'MDC021015.302', 'MDC018185.242', 'MDC051782.000', 'MDC017187.314', 'MDC017187.311', 'MDC012865.410', 'MDC000427.83', 'MDC017187.319', 'MDC017187.318', 'MDC004081.319', 'MDC021913.275', 'MDC015147.205', 'MDC000038.355', 'MDC016032.95', 'MDC052568.000', 'MDC008119.414', 'MDC026201.7', 'MDC003995.601', 'MDC009211.561', 'MDC009211.567', 'MDC054294.001', 'MDC004364.265', 'MDC002360.219', 'MDC003408.117', 'MDC015155.172', 'MDC053310.000', 'MDC019140.398', 'MDC019140.399', 'MDC011390.337', 'MDC007154.375', 'MDC006346.716', 'MDC010588.505', 'MDC002519.240', 'MDC031322.5', 'MDC010588.502', 'MDC006346.711', 'MDC011551.182', 'MDC002717.156', 'MDC006346.719', 'MDC007838.447', 'MDC007018.186', 'MDC017873.233', 'MDC016296.138', 'MDC019067.226', 'MDC036568.1', 'MDC014019.318', 'MDC007995.528', 'MDC026961.60', 'MDC013443.168', 'MDC022800.298', 'MDC021558.159', 'MDC002479.192', 'MDC010751.301', 'MDC000219.190', 'MDC020963.161', 'MDC020963.162', 'MDC051637.000', 'MDC005174.220', 'MDC040033.7', 'MDC019674.147', 'MDC010450.877', 'MDC007097.457', 'MDC016278.70', 'MDC013723.254', 'MDC002838.179', 'MDC009771.217']

Number of sequences:  82

Description: MDC019140.399

ID: MDC019140.399
Name: MDC019140.399
Description: MDC019140.399
Number of features: 0
Seq('GACGGGCGGGGAGGGTTTGGTTTTTTTTGGTTTTTAAAAAATTCAGGTTTNNAA...TTT', SingleLetterAlphabet())

The input of Bio.SeqIO.index_db(indexName.idx,filenames, file_format) is the name of the (SQLite3) index to store (.idx), the file (or files) to index and the format of those files like fasta, genebank,…

Example: Let’s read all the .fasta file contigs82.fasta in memory (as an index) printing off the entry corresponding to contig “MDC019140.399”.

In [15]:
from Bio import SeqIO
seqDict =  SeqIO.index_db("file_samples/ctgs.idx","file_samples/contigs82.fasta", "fasta")
print("SeqDict is an iterator:")
print(seqDict.keys())
print("")
print(list(seqDict.keys()))
print("")
print("Number of sequences: ", len(seqDict))
print("")
mySeq = seqDict["MDC019140.399"]
print("Description:", mySeq.description)
print("")
print(mySeq)
SeqDict is an iterator:
<generator object _SQLiteManySeqFilesDict.__iter__ at 0x7fc4c416fbf8>

['MDC020656.85', 'MDC001115.177', 'MDC013284.379', 'MDC018185.243', 'MDC018185.241', 'MDC004527.213', 'MDC003661.174', 'MDC012176.157', 'MDC001204.812', 'MDC001204.810', 'MDC004389.256', 'MDC024257.15', 'MDC018297.229', 'MDC001802.364', 'MDC016621.241', 'MDC014057.243', 'MDC021015.302', 'MDC018185.242', 'MDC051782.000', 'MDC017187.314', 'MDC017187.311', 'MDC012865.410', 'MDC000427.83', 'MDC017187.319', 'MDC017187.318', 'MDC004081.319', 'MDC021913.275', 'MDC015147.205', 'MDC000038.355', 'MDC016032.95', 'MDC052568.000', 'MDC008119.414', 'MDC026201.7', 'MDC003995.601', 'MDC009211.561', 'MDC009211.567', 'MDC054294.001', 'MDC004364.265', 'MDC002360.219', 'MDC003408.117', 'MDC015155.172', 'MDC053310.000', 'MDC019140.398', 'MDC019140.399', 'MDC011390.337', 'MDC007154.375', 'MDC006346.716', 'MDC010588.505', 'MDC002519.240', 'MDC031322.5', 'MDC010588.502', 'MDC006346.711', 'MDC011551.182', 'MDC002717.156', 'MDC006346.719', 'MDC007838.447', 'MDC007018.186', 'MDC017873.233', 'MDC016296.138', 'MDC019067.226', 'MDC036568.1', 'MDC014019.318', 'MDC007995.528', 'MDC026961.60', 'MDC013443.168', 'MDC022800.298', 'MDC021558.159', 'MDC002479.192', 'MDC010751.301', 'MDC000219.190', 'MDC020963.161', 'MDC020963.162', 'MDC051637.000', 'MDC005174.220', 'MDC040033.7', 'MDC019674.147', 'MDC010450.877', 'MDC007097.457', 'MDC016278.70', 'MDC013723.254', 'MDC002838.179', 'MDC009771.217']

Number of sequences:  82

Description: MDC019140.399

ID: MDC019140.399
Name: MDC019140.399
Description: MDC019140.399
Number of features: 0
Seq('GACGGGCGGGGAGGGTTTGGTTTTTTTTGGTTTTTAAAAAATTCAGGTTTNNAA...TTT', SingleLetterAlphabet())

Note that a file ctgs.idx has been created to store ids and offsets.

Writing sequence files

SeqRecords can be written out to files by using

N = Bio.SeqIO.write(records,out_filename, file_format)

where records is a list of the SeqRecords to write, out_filename is the string with the filename to write and file_format is the format of the file to write. N is the number of sequences written.

WARNING: If you write a file that is already present, SeqIO.write will just rewrite it without telling you.

Example: Write three protein records to file then read the file and print the three sequences.

In [16]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")

rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \
+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]")

rec3 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC" \
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT", generic_protein),
id="gi|13925890|gb|AAK49457.1|",
description="chalcone synthase [Nicotiana tabacum]")

my_records = [rec1, rec2, rec3]
print(rec1)

seqWritten = SeqIO.write(my_records, "file_samples/my_proteins.fa", "fasta")
print(seqWritten, "sequences written!")
print("")
print("Reading file")
seqIT = SeqIO.parse("file_samples/my_proteins.fa", "fasta")

cnt = 1
for s in seqIT:
    print("Sequence ", cnt, "(", s.id, ")")
    print(s.seq)
    print("")
    cnt += 1
ID: gi|14150838|gb|AAK54648.1|AF376133_1
Name: <unknown name>
Description: chalcone synthase [Cucumis sativus]
Number of features: 0
Seq('MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVG...SAC', ProteinAlphabet())
3 sequences written!

Reading file
Sequence  1 ( gi|14150838|gb|AAK54648.1|AF376133_1 )
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGDGAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNMSSAC

Sequence  2 ( gi|13919613|gb|AAK33142.1| )
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQDMVVVEIPKLGKEAAVKAIKEWGQ

Sequence  3 ( gi|13925890|gb|AAK49457.1| )
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMCEKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQPKSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAENNKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELVSAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFWIAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGTTGEGLEWGVLFGFGPGLTVETVVLHSVAT

Multiple sequence alignment

Multiple Sequence Alignments, as the name says, are a collection of multiple sequences which have been aligned together – usually with the insertion of gap characters, and addition of leading or trailing gaps – such that all the sequence strings are the same length.

Such an alignment can be regarded as a matrix of letters, in biopython each row is a SeqRecord object. These alignments are stored in an object MultipleSeqAlignment

Parsing multiple sequence alignments

The function Bio.AlignIO.parse() returns an iterator of MultipleSeqAlignment objects. In the frequent case that we have to deal with a single alignment we will have to use the Bio.AlignIO.read() function.

The basic syntax of the two functions:

Bio.AlignIO.parse(file_handle, alignment_format)
Bio.AlignIO.read(file_handle, alignment_format)

where file_handle is the handler to the opened file, while the alignment_format is a lower case string with the alignment format (e.g. fasta, clustal, stockholm, mauve, phylip,…). Check here for all possible formats.

When more than one multiple alignment has to be read we will have to use the parse function, for single entries, we would use the function read.

Example: Load and visualize the seed alignment of the Piwi (PF02171) family stored in the pfam (stockholm) format. Information on the format can be found here. The file is available here: PF02171_seed.sth.

In [17]:
from Bio import AlignIO

alignments = AlignIO.read("file_samples/PF02171_seed.sth", "stockholm")

print(alignments)

SingleLetterAlphabet() alignment with 16 rows and 395 columns
YLFFILDK-NSPEP-YGSIKRVCNTMLGVPSQCAISKHILQS--...QDV AGO1_SCHPO/500-799
FILCILPERKTSDI-YGPWKKICLTEEGIHTQCICPIKI-----...FTK AGO6_ARATH/541-851
FILCVLPDKKNSDL-YGPWKKKNLTEFGIVTQCMAPTRQPND--...FMK AGO4_ARATH/577-885
CIIVVLQS-KNSDI-YMTVKEQSDIVHGIMSQCVLMKNVSRP--...HVK TAG76_CAEEL/660-966
LIVVVLPG--KTPI-YAEVKRVGDTVLGIATQCVQAKNAIRT--...HLV O16720_CAEEL/566-867
TFVFIITD-DSITT-LHQRYKMIEKDTKMIVQDMKLSKALSV--...LWN O62275_CAEEL/594-924
DILVGIAR-EKKPD-VHDILKYFEESIGLQTIQLCQQTVDKMMG...NYK YQ53_CAEEL/650-977
TIVFGIIA-EKRPD-MHDILKYFEEKLGQQTIQISSETADKFMR...NYK NRDE3_CAEEL/673-1001
MLVVMLAD-DNKTR-YDSLKKYLCVECPIPNQCVNLRTLAGKSK...SLH Q17567_CAEEL/397-708
IVMVVMRS-PNEEK-YSCIKKRTCVDRPVPSQVVTLKVIAPRQQ...SIN AUB_DROME/555-852
LILCLVPN-DNAER-YSSIKKRGYVDRAVPTQVVTLKTTKNRSL...NLH PIWI_DROME/538-829
IVVCLLSS-NRKDK-YDAIKKYLCTDCPTPSQCVVARTLGKQQT...SIH PIWL1_HUMAN/555-847
GIMLVLPE-YNTPL-YYKLKSYLINS--IPSQFMRYDILSNRNL...VNR PIWI_ARCFU/110-406
CFALIIGKEKYKDNDYYEILKKQLFDLKIISQNILWENWRKDDK...NWK Y1321_METJA/426-699
LVIVFLEEYPKVDP-YKSFLLYDFVKRELLKKMIPSQVILNRTL...GIE O67434_AQUAE/419-694
LLLAILPD-NNGSL-YGDLKRICETELGLISQCCLTKHVFKI--...YLE AGO10_ARATH/625-946

The code above prints a summary of the information, but each Bio.Align.MultipleSeqAlignment is composed of several alignment records (SeqRecord) that we can access with all their information looping through the object. Each SeqRecord contains several information like the ID, Name, Description, Number of features, start, end and sequence.

Example:

Load the seed alignment of the Piwi (PF02171) family stored in the pfam (stockholm) format PF02171_seed.sth. For each record print the description, start and the end point, the sequence and the external dbxrefs (if any).

In [31]:
from Bio import AlignIO

alignments = AlignIO.read("file_samples/PF02171_seed.sth", "stockholm")


for align in alignments:
    start = align.annotations["start"]
    end = align.annotations["end"]
    seq = align.seq
    desc = align.description
    dbref = ",".join([x for x in align.dbxrefs])
    print("{} S:{} E:{}".format(desc, start, end))
    if(len(dbref) > 0):
        print(dbref)
    print("{}".format(seq))
    print("")
ID: AGO1_SCHPO/500-799
Name: AGO1_SCHPO
Description: AGO1_SCHPO/500-799
Number of features: 0
/accession=O74957.1
/start=500
/end=799
Seq('YLFFILDK-NSPEP-YGSIKRVCNTMLGVPSQCAISKHILQS---------KPQ...QDV', SingleLetterAlphabet())
AGO1_SCHPO/500-799 S:500 E:799
YLFFILDK-NSPEP-YGSIKRVCNTMLGVPSQCAISKHILQS---------KPQYCANLGMKINVKVGGIN-CSLIPKSNP----LGNVPTL---------ILGGDVYHPGVGA----------TGVSIASIVASVD-LNGCKYTAVSRSQPRHQEVIEG-MKD------------IVVYLLQGFRAMTKQ-QPQRIIYFRDGTSEGQFLSVINDELSQIKEACH-------SLSPKYN--PKILVCTTQKRHHARFFIKNKSDG----------------------DRNGNPLPGTII---EKHVTHPYQYDFYLISHPSLQGVSVPVHYTVLHDEIQMPPDQF-QTL------CYNLCYVYARAT----SAVSLVPPVYYAHLVSNLARYQDV

ID: AGO6_ARATH/541-851
Name: AGO6_ARATH
Description: AGO6_ARATH/541-851
Number of features: 0
/accession=O48771.2
/start=541
/end=851
Seq('FILCILPERKTSDI-YGPWKKICLTEEGIHTQCICPIKI------------SDQ...FTK', SingleLetterAlphabet())
AGO6_ARATH/541-851 S:541 E:851
FILCILPERKTSDI-YGPWKKICLTEEGIHTQCICPIKI------------SDQYLTNVLLKINSKLGGIN-SLLGIEYSYNIPLINKIPTL---------ILGMDVSHGPPGR---------ADVPSVAAVVGSKCWPLISRYRAAVRTQSPRLEMIDSLFQPIENTE--KGDNGIMNELFVEFYRTSRARKPKQIIIFRDGVSESQFEQVLKIEVDQIIKAYQ-------RLGESDV--PKFTVIVAQKNHHTKLFQAKGPE---------------------------NVPAGTVV---DTKIVHPTNYDFYMCAHAGKIGTSRPAHYHVLLDEIGFSPDDL-QNL------IHSLSYVNQRST----TATSIVAPVRYAHLAAAQVAQFTK

ID: AGO4_ARATH/577-885
Name: AGO4_ARATH
Description: AGO4_ARATH/577-885
Number of features: 0
/accession=Q9ZVD5.2
/start=577
/end=885
Seq('FILCVLPDKKNSDL-YGPWKKKNLTEFGIVTQCMAPTRQPND-----------Q...FMK', SingleLetterAlphabet())
AGO4_ARATH/577-885 S:577 E:885
FILCVLPDKKNSDL-YGPWKKKNLTEFGIVTQCMAPTRQPND-----------QYLTNLLLKINAKLGGLN-SMLSVERTPAFTVISKVPTI---------ILGMDVSHGSPGQ---------SDVPSIAAVVSSREWPLISKYRASVRTQPSKAEMIESLVKKNGT-----EDDGIIKELLVDFYTSSNKRKPEHIIIFRDGVSESQFNQVLNIELDQIIEACK-------LLDANWN--PKFLLLVAQKNHHTKFFQPTSPE---------------------------NVPPGTII---DNKICHPKNNDFYLCAHAGMIGTTRPTHYHVLYDEIGFSADEL-QEL------VHSLSYVYQRST----SAISVVAPICYAHLAAAQLGTFMK

ID: TAG76_CAEEL/660-966
Name: TAG76_CAEEL
Description: TAG76_CAEEL/660-966
Number of features: 0
/accession=P34681.2
/start=660
/end=966
Seq('CIIVVLQS-KNSDI-YMTVKEQSDIVHGIMSQCVLMKNVSRP---------TPA...HVK', SingleLetterAlphabet())
TAG76_CAEEL/660-966 S:660 E:966
CIIVVLQS-KNSDI-YMTVKEQSDIVHGIMSQCVLMKNVSRP---------TPATCANIVLKLNMKMGGIN--SRIVADKITNKYLVDQPTM---------VVGIDVTHPTQAEM-------RMNMPSVAAIVANVD-LLPQSYGANVKVQKKCRESVVY-LLD------------AIRERIITFYRHTKQ-KPARIIVYRDGVSEGQFSEVLREEIQSIRTACL-------AIAEDFR--PPITYIVVQKRHHARIFCKYQNDM---------------------VGKAKNVPPGTTV---DTGIVSPEGFDFYLCSHYGVQGTSRPARYHVLLDECKFTADEI-QSI------TYGMCHTYGRCT----RSVSIPTPVYYADLVATRARCHVK

ID: O16720_CAEEL/566-867
Name: O16720_CAEEL
Description: O16720_CAEEL/566-867
Number of features: 0
/accession=O16720.2
/start=566
/end=867
Seq('LIVVVLPG--KTPI-YAEVKRVGDTVLGIATQCVQAKNAIRT---------TPQ...HLV', SingleLetterAlphabet())
O16720_CAEEL/566-867 S:566 E:867
LIVVVLPG--KTPI-YAEVKRVGDTVLGIATQCVQAKNAIRT---------TPQTLSNLCLKMNVKLGGVN-SILLPNVRPR---IFNEPVI---------FLGCDITHPAAGD---------TRKPSIAAVVGSMD-AHPSRYAATVRVQQHRQEIITD-LTY------------MVRELLVQFYRNTRF-KPARIVVYRDGVSEGQLFNVLQYELRAIREACV-------MLESGYQ--PGITFIAVQKRHHTRLFAADKADQ---------------------VGKAFNIPPGTTV---DVGITHPTEFDFFLCSHAGIQGTSRPSHYHVLWDDNDLTADEL-QQL------TYQMCHTYVRCT----RSVSIPAPAYYAHLVAFRARYHLV

ID: O62275_CAEEL/594-924
Name: O62275_CAEEL
Description: O62275_CAEEL/594-924
Number of features: 0
/accession=O62275.1
/start=594
/end=924
Seq('TFVFIITD-DSITT-LHQRYKMIEKDTKMIVQDMKLSKALSV--IN---AGKRL...LWN', SingleLetterAlphabet())
O62275_CAEEL/594-924 S:594 E:924
TFVFIITD-DSITT-LHQRYKMIEKDTKMIVQDMKLSKALSV--IN---AGKRLTLENVINKTNVKLGGSN--YVFVDAKKQL-----DSHL---------IIGVGISAPPAGTKYAM-ENKGVLNPNVIGYAYNA--QHNQEFSGDFVLNSASQDTLAP-IED------------IVMHSLNEYQKFHDGGLPRRVIVYRTGTSEGNHGSIMAYEIPLARAAMR-------DFSPDIQ----LVYIVVSKDHSFRFFKPDLASLASRPQATSSTASRHSAMPAAPKAWDLNIAPGILV---DSIVTNPACKQFFLNSHITLQGTAKTPLYTVLADDAKVSMTAL-EDI------TYKLCHLHQIVG----LPTSLPTPLYVANEYAKRGRNLWN

ID: YQ53_CAEEL/650-977
Name: YQ53_CAEEL
Description: YQ53_CAEEL/650-977
Number of features: 0
/accession=Q09249.1
/start=650
/end=977
Seq('DILVGIAR-EKKPD-VHDILKYFEESIGLQTIQLCQQTVDKMMGG----QGGRQ...NYK', SingleLetterAlphabet())
YQ53_CAEEL/650-977 S:650 E:977
DILVGIAR-EKKPD-VHDILKYFEESIGLQTIQLCQQTVDKMMGG----QGGRQTIDNVMRKFNLKCGGTNFFVEIPNAVRGKAVCSNNETLRKKLLEHVQFIGFEISHGASRTLFDRSRSQMDGEPSVVGVSYSLT---NSTQLGGFTYLQTQKEYKLQKLDE------------FFPKCVRSYKEHSKT-LPTRIVIYRVGAGEGNFNRVKE-EVEEMRRTFD-------KIQPGYR--PHLVVIIAQRASHARVFPSCISGN---------------------RATDQNIPSGTCV---ENVLTSYGYDEFILSSQTPLIGTVRPCKYTILVNDAKWSKNEL-MHL------TYFRAFGHQVSY----QPPSVPDVLYAAENLAKRGRNNYK

ID: NRDE3_CAEEL/673-1001
Name: NRDE3_CAEEL
Description: NRDE3_CAEEL/673-1001
Number of features: 0
/accession=Q21691.1
/start=673
/end=1001
Seq('TIVFGIIA-EKRPD-MHDILKYFEEKLGQQTIQISSETADKFMRD----HGGKQ...NYK', SingleLetterAlphabet())
NRDE3_CAEEL/673-1001 S:673 E:1001
TIVFGIIA-EKRPD-MHDILKYFEEKLGQQTIQISSETADKFMRD----HGGKQTIDNVIRKLNPKCGGTNFLIDVPESVGHRVVCNNSAEMRAKLYAKTQFIGFEMSHTGARTRFDIQKVMFDGDPTVVGVAYSL--KHSAQLGGFSYFQESRLHKLTN-LQE------------KMQICLNAYEQSSSY-LPETVVVYRVGSGEGDYPQIVN-EVNEMKLAAR-------KKKHGYN--PKFLVICTQRNSHIRVFPEHINERG--------------------KSMEQNVKSGTCV---DVPGASHGYEEFILCCQTPLIGTVKPTKYTIIVNDCRWSKNEI-MNV------TYHLAFAHQVSY----APPAIPNVSYAAQNLAKRGHNNYK

ID: Q17567_CAEEL/397-708
Name: Q17567_CAEEL
Description: Q17567_CAEEL/397-708
Number of features: 0
/accession=Q17567.1
/start=397
/end=708
Seq('MLVVMLAD-DNKTR-YDSLKKYLCVECPIPNQCVNLRTLAGKSKDGGENKNLGS...SLH', SingleLetterAlphabet())
Q17567_CAEEL/397-708 S:397 E:708
MLVVMLAD-DNKTR-YDSLKKYLCVECPIPNQCVNLRTLAGKSKDGGENKNLGSIVLKIVLQMICKTGGA---LWKVNIPL-------KSTM---------IVGYDLYHDSTLK-----------GKTVGACVSTTS-NDFTQFYSQTRPHENPTQLGNN-LTH------------FVRKSLKQYYDNNDKTLPSRLILYRDGAGDGQIPYIKNTEVKLVRDACDAVTDKAAELSNKVQEKIKLAFIIVTKRVNMRILKQGSSS-----------------------KSAINPQPGTVV---DTTVTRPERMDFYLVPQFVNQGTVTPVSYNIIHDDTGLGPDKH-QQL------AFKLCHLYYNWQ----GTVRVPAPCQYAHKLAFLTAQSLH

ID: AUB_DROME/555-852
Name: AUB_DROME
Description: AUB_DROME/555-852
Number of features: 0
/accession=O76922.1
/start=555
/end=852
Seq('IVMVVMRS-PNEEK-YSCIKKRTCVDRPVPSQVVTLKVIAPRQQKP---TGLMS...SIN', SingleLetterAlphabet())
AUB_DROME/555-852 S:555 E:852
IVMVVMRS-PNEEK-YSCIKKRTCVDRPVPSQVVTLKVIAPRQQKP---TGLMSIATKVVIQMNAKLMGA---PWQVVIPL-------HGLM---------TVGFDVCHSPKNK-----------NKSYGAFVATMDQKESFRYFSTVNEHIKGQELSEQ-MSV------------NMACALRSYQEQHRS-LPERILFFRDGVGDGQLYQVVNSEVNTLKDRLDEI---YKSAGKQEG--CRMTFIIVSKRINSRYFT-----------------------------GHRNPVPGTVV---DDVITLPERYDFFLVSQAVRIGTVSPTSYNVISDNMGLNADKL-QML------SYKMTHMYYNYS----GTIRVPAVCHYAHKLAFLVAESIN

ID: PIWI_DROME/538-829
Name: PIWI_DROME
Description: PIWI_DROME/538-829
Number of features: 0
/accession=Q9VKM1.1
/start=538
/end=829
Seq('LILCLVPN-DNAER-YSSIKKRGYVDRAVPTQVVTLKTTKNRSL--------MS...NLH', SingleLetterAlphabet())
PIWI_DROME/538-829 S:538 E:829
LILCLVPN-DNAER-YSSIKKRGYVDRAVPTQVVTLKTTKNRSL--------MSIATKIAIQLNCKLGYT---PWMIELPL-------SGLM---------TIGFDIAKSTRDR-----------KRAYGALIASMDLQQNSTYFSTV-TECSAFDVLANTLWP------------MIAKALRQYQHEHRK-LPSRIVFYRDGVSSGSLKQLFEFEVKDIIEKLKTE---YARVQLSP---PQLAYIVVTRSMNTRFFLN-----------------------------GQNPPPGTIV---DDVITLPERYDFYLVSQQVRQGTVSPTSYNVLYSSMGLSPEKM-QKL------TYKMCHLYYNWS----GTTRVPAVCQYAKKLATLVGTNLH

ID: PIWL1_HUMAN/555-847
Name: PIWL1_HUMAN
Description: PIWL1_HUMAN/555-847
Number of features: 0
/accession=Q96J94.1
/start=555
/end=847
Seq('IVVCLLSS-NRKDK-YDAIKKYLCTDCPTPSQCVVARTLGKQQT-------VMA...SIH', SingleLetterAlphabet())
PIWL1_HUMAN/555-847 S:555 E:847
IVVCLLSS-NRKDK-YDAIKKYLCTDCPTPSQCVVARTLGKQQT-------VMAIATKIALQMNCKMGGE---LWRVDIPL-------KLVM---------IVGIDCYHDMTAG-----------RRSIAGFVASIN-EGMTRWFSRCIFQDRGQELVDG-LKV------------CLQAALRAW-NSCNEYMPSRIIVYRDGVGDGQLKTLVNYEVPQFLDCLK-------SIGRGYN--PRLTVIVVKKRVNTRFFAQSGGR-------------------------LQNPLPGTVI---DVEVTRPEWYDFFIVSQAVRSGSVSPTHYNVIYDNSGLKPDHI-QRL------TYKLCHIYYNWP----GVIRVPAPCQYAHKLAFLVGQSIH

ID: PIWI_ARCFU/110-406
Name: PIWI_ARCFU
Description: PIWI_ARCFU/110-406
Database cross-references: PDB; 2W42 B; 110-406;, PDB; 1YTU B; 110-406;, PDB; 2BGG B; 110-406;, PDB; 1W9H A; 110-406;, PDB; 2BGG A; 110-406;, PDB; 1YTU A; 110-406;, PDB; 2W42 A; 110-406;
Number of features: 0
/accession=O28951.1
/start=110
/end=406
Per letter annotation for: secondary_structure
Seq('GIMLVLPE-YNTPL-YYKLKSYLINS--IPSQFMRYDILSNRNL--------TF...VNR', SingleLetterAlphabet())
PIWI_ARCFU/110-406 S:110 E:406
PDB; 2W42 B; 110-406;,PDB; 1YTU B; 110-406;,PDB; 2BGG B; 110-406;,PDB; 1W9H A; 110-406;,PDB; 2BGG A; 110-406;,PDB; 1YTU A; 110-406;,PDB; 2W42 A; 110-406;
GIMLVLPE-YNTPL-YYKLKSYLINS--IPSQFMRYDILSNRNL--------TFYVDNLLVQFVSKLGGK---PWILNVDPEK--------------------GSDIIIGTGAT--------RIDNVNLFCFAMVFK-KDGTMLWNEISPIVTSSEYLTY-LKS------------TIKKVVYGFKKSNPDWDVEKLTLHVSG----KRPKMKDGETKILKETVE-------ELKKQEMVSRDVKYAILHLNETHPFWVMGDPN------------------------NRFHPYEGTKVKLSSKRYLLTLLQPYLKRNGLEMVTPIKPLSVEIVSDN--WTSEEYYHNVHEILDEIYYLSKMNWRGF----RSRNLPVTVNYPKLVAGIIANVNR

ID: Y1321_METJA/426-699
Name: Y1321_METJA
Description: Y1321_METJA/426-699
Number of features: 0
/accession=Q58717.1
/start=426
/end=699
Seq('CFALIIGKEKYKDNDYYEILKKQLFDLKIISQNILWENWRKDDK--------GY...NWK', SingleLetterAlphabet())
Y1321_METJA/426-699 S:426 E:699
CFALIIGKEKYKDNDYYEILKKQLFDLKIISQNILWENWRKDDK--------GYMTNNLLIQIMGKLGIK---YFILDSKTPYDY----------------IMGLDTGLGIFGN------------HRVGGCTVVYDSEGKIRRIQPIETPAPGERLHLP----------------YVIEYLEN--KANIDMENKNILFLRDG-------FIQNSERNDLKEISK-------ELN------SNIEVISIRKNNKYKVFTSDYRI------------------------GSVFGNDGIFLPHKTPFGSNPVKLSTWLRFNCGNEEGLK---IN----------ESI-MQL------LYDLTKMNYSALYGEGRYLRIPAPIHYADKFVKALGKNWK

ID: O67434_AQUAE/419-694
Name: O67434_AQUAE
Description: O67434_AQUAE/419-694
Database cross-references: PDB; 1YVU A; 419-694;, PDB; 2F8S A; 419-694;, PDB; 2F8T A; 419-694;, PDB; 2F8S B; 419-694;, PDB; 2NUB A; 419-694;, PDB; 2F8T B; 419-694;
Number of features: 0
/accession=O67434.1
/start=419
/end=694
Per letter annotation for: secondary_structure
Seq('LVIVFLEEYPKVDP-YKSFLLYDFVKRELLKKMIPSQVILNRTLKN---ENLKF...GIE', SingleLetterAlphabet())
O67434_AQUAE/419-694 S:419 E:694
PDB; 1YVU A; 419-694;,PDB; 2F8S A; 419-694;,PDB; 2F8T A; 419-694;,PDB; 2F8S B; 419-694;,PDB; 2NUB A; 419-694;,PDB; 2F8T B; 419-694;
LVIVFLEEYPKVDP-YKSFLLYDFVKRELLKKMIPSQVILNRTLKN---ENLKFVLLNVAEQVLAKTGNIP--YKLKEIEGKVDA----------------FVGIDISRITRDG----------KTVNAVAFTKIFNSKGELVRYYLTSYPAFGEKLTEK----------------AIGDVFSLLEKLGF-KKGSKIVVHRDG-------RLYRDEVAAFK-----------KYGELYG--YSLELLEIIKRNNPRFFSNEKFI------------------------------KGYFYKLSEDSVILATYNQVY-------EGTHQPIKVRKVYGE--LPVEVL-CSQ------ILSLTLMNYSSF----QPIKLPATVHYSDKITKLMLRGIE

ID: AGO10_ARATH/625-946
Name: AGO10_ARATH
Description: AGO10_ARATH/625-946
Number of features: 0
/accession=Q9XGW1.1
/start=625
/end=946
Seq('LLLAILPD-NNGSL-YGDLKRICETELGLISQCCLTKHVFKI---------SKQ...YLE', SingleLetterAlphabet())
AGO10_ARATH/625-946 S:625 E:946
LLLAILPD-NNGSL-YGDLKRICETELGLISQCCLTKHVFKI---------SKQYLANVSLKINVKMGGRN-TVLVDAISCRIPLVSDIPTI---------IFGADVTHPENGE---------ESSPSIAAVVASQDWPEVTKYAGLVCAQAHRQELIQDLYKTWQDPVRGTVSGGMIRDLLISFRKATGQ-KPLRIIFYRDGVSEGQFYQVLLYELDAIRKACA-------SLEPNYQ--PPVTFIVVQKRHHTRLFANNHRDKN-------------------STDRSGNILPGTVV---DTKICHPTEFDFYLCSHAGIQGTSRPAHYHVLWDENNFTADGI-QSL------TNNLCYTYARCT----RSVSIVPPAYYAHLAAFRARFYLE

Writing multiple alignments

Biopython provides a function Bio.AlignIO.write() to write alignments to file. The basic syntax is:

N = Bio.AlignIO.write(alignments,outfile,file_format)

where alignments are a MultipleSeqAlignment object with the alignments to write to the output file with name outfile that has format file_format (a low case string with the file format). N is the number of entries written to the file.

Example: Let’s create some multiple sequence alignments and store them in phylip format. The output file is present here my_malign.phy. Finally, read and print the content of the file.

In [19]:

from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio import AlignIO

align1 = MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
])
align2 = MultipleSeqAlignment([
SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"),
SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"),
SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"),
])
align3 = MultipleSeqAlignment([
SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"),
SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"),
SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"),
])

my_alignments = [align1, align2, align3]
N = AlignIO.write(my_alignments, "file_samples/my_malign.phy", "phylip")

print("{} entries written to the file".format(N))
print("")
alignments = AlignIO.parse("file_samples/my_malign.phy", "phylip")
for align in alignments:
    print(align)

3 entries written to the file

SingleLetterAlphabet() alignment with 3 rows and 12 columns
ACTGCTAGCTAG Alpha
ACT-CTAGCTAG Beta
ACTGCTAGDTAG Gamma
SingleLetterAlphabet() alignment with 3 rows and 9 columns
GTCAGC-AG Delta
GACAGCTAG Epsilon
GTCAGCTAG Zeta
SingleLetterAlphabet() alignment with 3 rows and 13 columns
ACTAGTACAGCTG Eta
ACTAGTACAGCT- Theta
-CTACTACAGGTG Iota

Note that it is also possible to convert one format into the other (provided that all information needed for the second format is available) by using the function:

Bio.AlignIO.convert(input_file, input_file_format, output_file, output_file_format)

basically by passing the input file name and format and output file name and format.

Ex:

Bio.AlignIO.convert("PF05371_seed.sth", "stockholm", "PF05371_seed.aln", "clustal")

converts a stockholm file into a clustal.

Manipulating alignments

It is possible to slice alignments using the [] operator applied on a SeqRecord. In particular:

  1. SeqRecord[i,j] returns the jth character of alignment i as a string;
  2. SeqRecord[:,j] returns all the jth characters of the multiple alignment as a string;
  3. SeqRecord[:,i:j] returns a MultipleSeqAlignment with the sub-alignments going for i to j (excluded)
  4. SeqRecord[a:b,i:j] similar to 3. but for alignments going from a to b (excluded) only

One can also combine different portions of the alignment by slicing and adding the slices together with

SeqRecord[:,i:j] + SeqRecord[:,t:k]

Example: Load the seed alignment of the Piwi (PF02171) family stored in the pfam (stockholm) format PF02171_seed.sth. For each record print the description, the first 20 bases and the final 20 bases of the sequence.

In [20]:
from Bio import AlignIO

alignments = AlignIO.read("file_samples/PF02171_seed.sth", "stockholm")

edited_aligns = alignments[:,0:20] + alignments[:,-20:]


for align in edited_aligns:
    print(align.description, "\t", align.seq)

AGO1_SCHPO/500-799       YLFFILDK-NSPEP-YGSIKLVPPVYYAHLVSNLARYQDV
AGO6_ARATH/541-851       FILCILPERKTSDI-YGPWKIVAPVRYAHLAAAQVAQFTK
AGO4_ARATH/577-885       FILCVLPDKKNSDL-YGPWKVVAPICYAHLAAAQLGTFMK
TAG76_CAEEL/660-966      CIIVVLQS-KNSDI-YMTVKIPTPVYYADLVATRARCHVK
O16720_CAEEL/566-867     LIVVVLPG--KTPI-YAEVKIPAPAYYAHLVAFRARYHLV
O62275_CAEEL/594-924     TFVFIITD-DSITT-LHQRYLPTPLYVANEYAKRGRNLWN
YQ53_CAEEL/650-977       DILVGIAR-EKKPD-VHDILVPDVLYAAENLAKRGRNNYK
NRDE3_CAEEL/673-1001     TIVFGIIA-EKRPD-MHDILIPNVSYAAQNLAKRGHNNYK
Q17567_CAEEL/397-708     MLVVMLAD-DNKTR-YDSLKVPAPCQYAHKLAFLTAQSLH
AUB_DROME/555-852        IVMVVMRS-PNEEK-YSCIKVPAVCHYAHKLAFLVAESIN
PIWI_DROME/538-829       LILCLVPN-DNAER-YSSIKVPAVCQYAKKLATLVGTNLH
PIWL1_HUMAN/555-847      IVVCLLSS-NRKDK-YDAIKVPAPCQYAHKLAFLVGQSIH
PIWI_ARCFU/110-406       GIMLVLPE-YNTPL-YYKLKLPVTVNYPKLVAGIIANVNR
Y1321_METJA/426-699      CFALIIGKEKYKDNDYYEILIPAPIHYADKFVKALGKNWK
O67434_AQUAE/419-694     LVIVFLEEYPKVDP-YKSFLLPATVHYSDKITKLMLRGIE
AGO10_ARATH/625-946      LLLAILPD-NNGSL-YGDLKIVPPAYYAHLAAFRARFYLE

Example: Convert the seed alignment of the Piwi (PF02171) family stored in the pfam (stockholm) format PF02171_seed.sth into phylip format. Print some stats on the data.

In [21]:
from Bio import AlignIO

alignments = AlignIO.read("file_samples/PF02171_seed.sth", "stockholm")

out = AlignIO.convert("file_samples/PF02171_seed.sth",
                      "stockholm",
                      "file_samples/PF02171_seed.aln",
                      "clustal")

print("N. of seq: {}\nLen of seq: {}".format(
                                                len(alignments),
                                                len(alignments[0])))
print("{} multiple alignments converted to phylip".format(out))
N. of seq: 16
Len of seq: 395
1 multiple alignments converted to phylip

Biopython’s pairwise2 alignment

Biopython supports running some classic multiple alignment tools like clustal, MUSCLE but these require to be installed in your system first.

Biopython has its own module to make pairwise alignment. It provides two algorithms: Smith-Waterman for local alignment and Needleman-Wunsch for global alignment. These methods are implemented in two Biopython functions of the Bio.pairwise2 module:

pairwise2.align.globalxx() and pairwise2.align.localxx().

The basic syntax of the two methods is the same:

aligns = pairwise2.align.globalxx(seq1,seq2)
aligns = pairwise2.align.localxx(seq1,seq2)

where seq1 and seq2 are two str objects. These methods return a list of alignments (at least one) that have the same optimal score. Each alignment is represented as tuples with the following 5 elements in order:

  1. The alignment of the first sequence;
  2. The alignment of the second sequence;
  3. The alignment score;
  4. The start of the alignment (for global alignments this is always 0);
  5. The end of the alignment (for global alignments this is always the length of the alignment).

The xx in globalxx or localxx encodes the penalties and scores for gaps and matches (and mismatches) of the alignment. The first letter is the score for a match the second letter is the penalty for a gap:

Match parameters can be:

  • x : means that a match scores 1 a mismatch 0;
  • m : the match and mismatch score are passed as additional params after the sequence (es. aligns = pairwise2.align.globalmx(seq1,seq2, 1, -1) to set 1 as match score and -1 as mismatch penalty.

Gap parameters can be:

  • x : gap penalty is 0;
  • s : same gap open and gap extend penalties for the 2 sequences (passed as additional params after seqs).
  • d : different gap open and gap extend penalties for the 2 seqs (additional params after the seqs).

Example:

Let’s perform the alignment of the two sequences “ACCGTTATATAGGCCA” and “ACGTACTAGTATAGGCCA”

In [34]:
from Bio import pairwise2
from Bio import SeqIO


alignments = pairwise2.align.globalxx("ACCGTTATATAGGCCA",
                                      "ACGTACTAGTATAGGCCA")

for i in range(len(alignments)):
    print(alignments[i])

print("")
print("Looping through aligns")
for align in alignments:
        print(align[0])
        print(align[1])
        print("Score: {}, Start: {}, End: {}".format(align[2],
                                                     align[3],
                                                     align[4]))
        print("")

alignments = pairwise2.align.globalms("ACCGTTATATAGGCCA",
                                      "ACGTACTAGTATAGGCCA",
                                      1,-1,-0.5,-0.2)
print("")
print("Match: 1, Mismatch: -1, Gap open: -0.5, Gap extend: -0.2")
for align in alignments:
    print(align[0])
    print(align[1])
    print("Score: {}, Start: {}, End: {}".format(align[2],
                                                 align[3],
                                                 align[4]))
    print("")

('ACCGT--TA-TATAGGCCA', 'A-CGTACTAGTATAGGCCA', 15.0, 0, 19)
('ACCGT--TA-TATAGGCCA', 'AC-GTACTAGTATAGGCCA', 15.0, 0, 19)

Looping through aligns
ACCGT--TA-TATAGGCCA
A-CGTACTAGTATAGGCCA
Score: 15.0, Start: 0, End: 19

ACCGT--TA-TATAGGCCA
AC-GTACTAGTATAGGCCA
Score: 15.0, Start: 0, End: 19


Match: 1, Mismatch: -1, Gap open: -0.5, Gap extend: -0.2
ACCGT--TA-TATAGGCCA
A-CGTACTAGTATAGGCCA
Score: 13.5, Start: 0, End: 19

ACCGT--TA-TATAGGCCA
AC-GTACTAGTATAGGCCA
Score: 13.5, Start: 0, End: 19

It is also possible to specify a substitution matrix like the BLOSUM to improve the alignment of protein sequences. Substitution matrices like the blosum62 are present in Biopython and before using it we need to import it with from Bio.SubsMat.MatrixInfo import blosum62.

You can find all the substitution matrices available here.

Example: Align the alpha and beta subunits of the human hemoglobin protein stored in hem_alpha.fasta and hem_beta.fasta with globalxx and with a blosum62 substitution matrix and gap open -10 and gap extension -0.5.

In [23]:
from Bio import pairwise2
from Bio import SeqIO
from Bio.pairwise2 import format_alignment
from Bio.SubsMat.MatrixInfo import blosum62



seq1 = SeqIO.read("file_samples/hem_alpha.fasta", "fasta")
seq2 = SeqIO.read("file_samples/hem_beta.fasta", "fasta")


alignments = pairwise2.align.globalxx(seq1.seq,seq2.seq)
print(len(alignments), "optimal alignments")

align = alignments[0]
print(align[0])
print(align[1])
print("Score: {}, Start: {}, End: {}".format(align[2],align[3],align[4]))
print("")

alignments = pairwise2.align.globalds(seq1.seq, seq2.seq, blosum62, -10, -0.5)
print(len(alignments), "optimal alignments")

align = alignments[0]
print(align[0])
print(align[1])
print("Score: {}, Start: {}, End: {}".format(align[2],align[3],align[4]))

80 optimal alignments
MV-LSPADKTNV---K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-
MVHL-----T--PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H
Score: 72.0, Start: 0, End: 217

2 optimal alignments
MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Score: 292.5, Start: 0, End: 149

Exercises

  1. Write a python function that reads a genebank file given in input and prints off the following information:
    1. Identifier, name and description;
    2. The first 100 characters of the sequence;
    3. Number of external references (dbxrefs) and ids of the external refs.
    4. The name of the organism (hint: check the annotations dictionary at the key “organism”)
    5. Retrieve and print all (if any) associated publications (hint: annotation dictionary, key:”references”)
    6. Retrieve and print all the locations of “CDS” features of the sequence (hint: check the features )

Hint: go back and check the details of the SeqRecord object.

Test the program downloading some files from genebank like this or this

Show/Hide Solution

  1. Write a python program that loads a pfam file (stockholm format .sth) and reports for each record of the alignment:
    1. the id of the entry
    2. the start and end points
    3. the number of gaps and the % of gaps on the total length of the alignment
    4. the number of external database references (dbxrefs), and the first 3 external references comma separated (hint: use join).

Print these information to the screen. Finally, write this information in a tab separated file (.tsv) having the following format: #ID\tstart\tend\tnum_gaps\tpercentage_gaps\tdbxrefs.

You can test the program using the TP53 family file PF00870_seed.sth.

Show/Hide Solution

  1. Given a multiple sequence alignment stored in “phylip” format, write three methods: readAlignment that reads the input file and prints the number of sequences present, printAlignments(alignments) that prints the alignments to the screen and computeConsensus(alignments,minFrequency) that creates the consensus of all the alignments. MinFrequency is the minimum frequency (that has to be > 0.5) to keep a base in the consensus. “?” is put put if frequency < minFrequency for all possible bases.

Ex. if alignments are:

ATC-G
AAC-G
AAG-G
ATCGT

computeConsensus(alignments,0.6) is:

A?C-G

Test the script with the file alpha-globin.phy.

Show/Hide Solution

  1. Load the contigs present in the filtered_contigs.fasta file and translate each DNA sequence into the corresponding protein. Count the number of stop codons (i.e. *) for each sequence and print them to the user (e.g. MDC020656.85 51). Finally, write the translated proteins in another .fasta file (e.g. filtered_contigs_translated.fasta).

Show/Hide Solution

  1. Write the following three functions:
    1. countMatches(s1,s2) that takes two sequences aligned with pairwise2 local or global of the same length and returns how many elements match;
    2. countMismatches(s1,s2) that takes two sequences aligned with pairwise2 local or global of the same length and returns how many elements are different (i.e. they are not gaps but the character is different);
    3. countGapOpens(s1,s2) that takes two sequences aligned with pairwise2 local or global of the same length and returns the number of gap opened in the alignment;
    4. countGapExtensions(s1,s2) that takes two sequences aligned with pairwise2 local or global of the same length and returns the number of gap extensions in the alignment
    5. getScore(s1,s2, matchScore, mismatchPenalty, gapOpenPenalty, gapExtensionPenalty) and returns the score of the alignment of s1 and s2 (aligned with pairwise2 as before) given the scores and penalties in input.

Finally, align the sequences of the Interleukin-12 chain A (let’s call it s1) IL12A.fasta and B IL12B.fasta (let’s call it s2) with the following parameters:

pairwise2.align.globalxx(s1,s2)
pairwise2.align.globalms(s1,s2,,1,-0.1,-1,-0.1)
pairwise2.align.globalxs(s1, s2,1,-0.5,-5,-0.5)
pairwise2.align.globalxd(s1, s2,1,-0.5,-5,-0.5,-5,-5)

and check the score as computed from pairwise2 and from your functions (use the first of all the alignments to check). Note the effect of changing parameters.

Show/Hide Solution