Custom Strategies¶
One of the major features of Goldilocks is its extensibility. Strategies are both easily customisable and interchangeable, as they all share a common interface. This interface also provides a platform for users with some knowledge of Python to construct their own custom census rules. One such example follows below:
A Simple ORF Finder¶
Code Sample¶
# Import Goldilocks and the BaseStrategy class
from goldilocks import Goldilocks
from goldilocks.strategies import BaseStrategy
# Define a new class for your custom strategy that inherits from BaseStrategy
class MyCustomSimpleORFCounterStrategy(BaseStrategy):
# Initialising function boilerplate, required to set-up some properties of the census
def __init__(self, tracks=None, min_codons=1):
# Initialise the custom class with super
super(MyCustomSimpleORFCounterStrategy, self).__init__(
tracks=range(0,3), # Use range to specify a counter for
# each of the three possible forward
# reading frames in which to search
# to search for open reading frames
label="Forward Open Reading Frames" # Y-Axis Plot Label
)
self.MIN_CODONS = min_codons
# This function defines the actual behaviour of a census for a given region
# of sequence and the current counting track (one of three reading frames)
def census(self, sequence, track_frame, **kwargs):
STARTS = ["ATG"]
STOPS = ["TAA", "TGA", "TAG"]
CODON_SIZE = 3
# Split input sequence into codons. Open a frame if a START is found
# and increment the ORF counter if a STOP is encountered afterward
orfs = orf_open = 0
for i in xrange(track_frame, len(sequence), CODON_SIZE):
codon = sequence[i:i+CODON_SIZE].upper()
if codon in STARTS and orf_open == 0:
orf_open = 1
elif codon in STOPS and orf_open > 0:
if orf_open > self.MIN_CODONS:
orfs += 1
orf_open = 0
elif orf_open > 0:
orf_open += 1
return orfs
# Organise and execute the census
sequence_data = { "hs37d5": {"file": "/store/ref/hs37d5.1-3.fa.fai"} }
g = Goldilocks(MyCustomSimpleORFCounterStrategy(min_codons=30), sequence_data,
length="1M", stride="1M", is_faidx=True, processes=4)
Implementation Description¶
Strategies are defined as Python classes, inheriting from the
BaseStrategy
class found in the goldilocks.strategies
subpackage. The class requires just two function definitions to be
compliant with the shared interface; __init__
: the class initializer
that takes care of the setup of the strategy’s internals via the
BaseStrategy
parent class, and census
: the function actually
responsible for the behaviour of the strategy itself.
The example presented is a very simple open reading frame counter. It
searches the three forward frames for start codons that are then
followed by one of the three stop codons. The ``tracks” in this
example are the three possible frames. Note on line 9 that our
__init__
provides a default argument for tracks
of None
.
Thus this particular strategy does not need the tracks
argument.
Instead, the track list is provided by the strategy itself, and passed
to the BaseStrategy
__init__
(line 12), forcing tracks to be the
list [0, 1, 2]. The elements of this list are used as an integer offset
from which to begin splitting input DNA sequences when conducting the
census later, which is why on this occasion we don’t want to allow the
user to specify their own tracks. Other strategies, such as the included
NucleotideCounterStrategy
just pass the tracks
argument from the
user through to the super __init__
.
For a given array of sequence
data and a frame offset
(track_frame
), the census
function splits the sequence into
nucleotide triplets from the offset and searches for open reading
frames. A subsequence is considered an ORF by this strategy if the ATG
START codon is encountered and later followed by any STOP codon.
Our example finishes with the familiar specification of the location of
input sequence data and the construction of the census itself. Here we
specify a census of all 1Mbp regions with no overlap (that is, the
stride is equal to the size of the regions) and instantiate our new
MyCustomSimpleORFCounterStrategy
with a keyword requiring valid ORFs
to be at least 30 codons in length (excluding start and stop).
Every strategy’s census
function is expected to return a numerical
result that can be used to rank and sort regions, in this scenario,
census
returns the number of ORFs found.
Note also, strategies may specify any number of keyword arguments that
are not found in the BaseStrategy
. In our example, min_codons
can be set by a user to specify how many codons must lie between an
opening and closing codon to be counted as an open reading frame. We
store this value as a member of the strategy object on line 18 and use
it on line 35 to ensure the orfs
counter is only incremented when
the length of the current open reading frame has exceeded the provided
threshold. One could store any number of configurable parameters inside
of the strategy class in this fashion. This framework allows one to
increase the complexity of strategies while still providing a friendly
and interchangeable interface for end users.