What is Mojo Hand?
Mojo Hand is software (and web service) for assisting molecular biologists and their ilk in designing a DNA-binding protein. Read all about it: manuscript.
What's in a name?
TALEN happens to be a homonym for talon, which has the conotation of a fierce predatory grip, of a hawk diving from the sky to strike at tiny camaflaged pray. This is fitting for a protein that can be designed to bind a specific sequence buried deep in the genome---the TALEN reaches into the genome and grasps an almost-indistinguishable sequence tightly.
There is also a somewhat magical aspect to editing the genome. To the public, the prospect of editing the genome appears as fantastic as the success of sulpha drugs and vaccines in their day. To capture both of these ideas, we decided to call the program Mojo Hand. Freely associating, it has the word hand, which signifies its ability to grasp the DNA specifically. Mojo is a term from hoodoo that refers to a small cloth bag containing amulets. In the African-American folk tradition, it can be called a hand, mojo hand, conjure bag, trick bag, jomo, etc.
There is also a song called Mojo Hand recorded by Sam John "Lightning" Hopkins (1912--1982) for Fire Records in 1960. Several recordings exist, including this video.
Who is responsible for Mojo Hand?
Mojo Hand was designed and implemented by Kevin Neff and David Argue, with help from Steve Ekker, members of his laboratory, and collaborators. See the members of the focus group (below) and acknowledgements in the manuscript
Members of the focus group include: Patrick Blackburn, Randall Krug, Jarryd Campbell, Sumedha Penheiter, Wiebin Liu, Alvin Ma, Melissa McNulty, Karl Clark, and Chris Ward.
Where do I start?
Let's say you're interested in SUC2, a sucrose transport protein found in Arabidopsis thaliana. First thing to do is find the unique identifier used in the NCBI Gene databse. Go to the NCBI Gene Database and search fpr "suc2 sucrose" or whatever.
One of the hits is for Arabidopsis. There are similarly named genes in other species, which is interesting but not terribly important for our purposes. The identifier you found should be 838877. Click that result or enter the identifier in a new search. Either way, you'll get a detailed report about the gene in human-readable format, as shown below. To get your bearings, you may want to find the link to the familiar GenBank output of NCBI Nucleotide a little lower down the page.
Now, go to talendesign.org and you'll be confronted by our TALEN design tool. Enter the identifier for your gene and email address. Then click the green button. Just use the defaults for the other parameters.
The output will look something like the excerpt below. Let's go through each section. The gene and exon are indicated first, followed by the numerical index--39 in this case--of the candidate binding site. Some candidate binding sites do not have any adequate restriction sites, so do not be alarmed if the first site is not 1. The full TALEN sequence is also printed, with or without CpG islands hightlighted.
C2_E4 site 39 GGTAATATCTGTGGGAGGGGACCATTCGGACGGAACTATTCGGGTGGTGG TAL1: 5'-GGTAATATCTGTGGGAGGT-3' 28--46 (18) Spacer: 5'-GGACCATTCGACG-3' 46--59 (13) TAL2: 5'-AACTATTCGGTGGTGG-3' 60--74 (14) Binding Strand (reverse complement): 5'-CCACCACCGAATAGTT-3' BLAST against nt/nt database TAL1: NN NN NG NI NI NG NI NG HD NG NN NG NN NN NN NI NN NN NG TAL2: HD HD NI HD HD NI HD HD NN NI NI NG NI NN NG NG PCR-Buffer Score (0-9) Enzyme Cut Locations Recognition Enzyme std therm phu crim First (abs) Second (rel) Site Suppliers --------------------------------------------------------------------------------------------------------------------------------------------------- TaqI/ 9 9 2 9 53 GGACCATTCGACG Invitrogen Minotech Thermo SibEnzyme Nippon Takara Roche NEB Toyobo CHIMERx Promega Sigma Bangalore Vivantis EURx CinnaGen FspEI/ 0 0 0 0 49 +121 GGACCATTCGACG NEB ---------------------------------------------------------------------------------------------------------------------------------------------------
The next section lists the sequences to which the TAL effector domains bind. The second TAL site is on the opposite strand, so the reverse complement is shown in addition to the forward sequences. The index the beginning and end of each part of the binding site is also shown. The length of each part is shown in parentheses.
The next section is the RVDs for the two binding sites. They are normally tab delimited to make copy/paste operations compatible with spreadsheet software.
The final section is a list of enzymes that cut once or twice in the amplicon defined by the long flanking length (usually 150 on either site of the pair of binding sites). The name of the enzyme is listed with activity in common full-strength PCR buffers, etc. The PCR-buffer scores are re-scaled from 0-9, with 0 indicating no activity. The index of the cut sites is given relative to the start of the amplicon. The first cut is at the absolute position, but the second is a relative position. If the cut is 3' to the first cut, the distance will be positive.
And if you set verbose output to anything over 0, it will include some other output at the beginning of the file. This is mainly for troubleshooting and may not be of general use. For SUC2, it looks like this:
Executed 15:14:49 9 May 2012 1 E1 15 15 15 392-- 406 407-- 421 422-- 435 GGTCCAATCTCCGGT ATGCTTGTTCAGCCT ATCGTCGGTTACCAC (GTGGTAACCGACGAT) 2 E1 17 13 15 392-- 408 409-- 421 422-- 435 GGTCCAATCTCCGGTAT GCTTGTTCAGCCT ATCGTCGGTTACCAC (GTGGTAACCGACGAT) 3 E1 20 15 20 402-- 421 422-- 436 437-- 455 CCGGTATGCTTGTTCAGCCT ATCGTCGGTTACCAC AGTGACCGTTGCACCTCAAG (CTTGAGGTGCAACGGTCACT) 4 E1 15 14 17 512-- 526 527-- 540 541-- 556 GTTTTCCTTATCGGT TACGCTGCCGATAT AGGTCACAGCATGGGCG (CGCCCATGCTGTGACCT) 5 E1 16 13 17 512-- 527 528-- 540 541-- 556 GTTTTCCTTATCGGTT ACGCTGCCGATAT AGGTCACAGCATGGGCG (CGCCCATGCTGTGACCT) 6 E1 18 18 15 1202--1219 1220--1237 1238--1251 GGTTTCATGTCTCTTGGT GTTGAATGGATTGGTCGG AAATTGGGAGGAGCT (AGCTCCTCCCAATTT) 7 E1 20 16 15 1202--1221 1222--1237 1238--1251 GGTTTCATGTCTCTTGGTGT TGAATGGATTGGTCGG AAATTGGGAGGAGCT (AGCTCCTCCCAATTT) 8 E1 19 18 19 1212--1230 1231--1248 1249--1266 CTCTTGGTGTTGAATGGAT TGGTCGGAAATTGGGAGG AGCTAAAAGGCTTTGGGGT (ACCCCAAAGCCTTTTAGCT) 9 E1 20 17 19 1212--1231 1232--1248 1249--1266 CTCTTGGTGTTGAATGGATT GGTCGGAAATTGGGAGG AGCTAAAAGGCTTTGGGGT (ACCCCAAAGCCTTTTAGCT) ...
My gene doesn't have an mRNA feature. What do I do?
If your gene doesn't have mRNA features associated with it, you can use CDS, misc_RNA, or abandon sub-sequence features and process the whole sequence.
How do I prioritize the CDS features rather than the mRNA associated with my gene?
Use the option --cds-index=1. Or if there are multiple CDS records, you can set the value to 2, 3, etc. Incidently, you can set misc_RNA to highest priority with the same sort of methods, but it's --misc-rna-index=#
How do I select one of many mRNA features?
You can select subsequence features by index using mrna-index=# or you can identify the particular transcript accession in the option --mrna-transcript-id=... The same trick works for the other types of features
Why is my email required? Are you going to like sell it or something?
Your email address is needed to ensure that NCBI staff can contact you if you are violating the end-user agreement for E-Utilities. Read all about it here. Otherwise, we don't keep track of who is using this service. We keep track of the number of hits because that is a sign of problems with the web site. If you're concerned that your research will be scooped, the source code and various ancilery files are freely available. And if you're really concerned about security, you could always install an onion router first.
According to NCBI, what's the difference between mRNA, CDS, misc_RNA, and exon features?
Here are the definitions of the subsequence features according to the DDBJ/EMBL/GenBank Feature Table Definition.
mRNA - messenger RNA; includes 5' untranslated region (5'UTR), coding sequences (CDS, exon) and 3'untranslated region (3'UTR); CDS - coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation. misc_RNA - any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5'UTR, 3'UTR, exon, CDS, sig_peptide, transit_peptide, mat_peptide, intron, polyA_site, ncRNA, rRNA and tRNA). exon - region of genome that codes for portion of spliced mRNA, rRNA and tRNA; may contain 5'UTR, all CDSs and 3' UTR.
How do I force Mojo Hand to process the entire sequence, regardless of GenBank features?
Use the option --region=all or download your sequence and upload (or cut/paste) it in FASTA format. When file uploads are processed, it is assumed that all regions of interest are designated on separate lines, so no further subsequence information is used.
How do I specify a particular mRNA transcript accession instead of using the numeric order of the mRNA features?
Use the --mrna-transcript-id option. For example, if you're working with PIH1D2 (120379), which has severa mRNA features. Importantly, the two associated with our gene differ in the last exon, which are significant in length and non overlapping. See this GenBank record PIH1D2
If you want the mRNA feature from transcript NM_138789.3, you add --mrna-transcript-id=NM_138789.3 to the command line.
Can I use Mojo Hand as a generic tool for searching DNA sequences?
Yes, you can search for any consensus sequence you can imagine. But it differs from other search methods (such as BLAST) because you must first specify a fragment of a genome to search. Unless you're doing something very off-beat, you're probably better off using BLAST.
How do I search for a consensus sequence that has requirements at the end, rather than the preceeding T needed for TAL effectors?
The consensus sequence should be start with s.* followed by your postfix requirements. If the sequence should end with AAA, then use --template="s.*eAAA"
Are there any other tools to help me design TALENs?
Yes, yes there are. For example, Bogdanove at ISU has a similar web service, called TALEN Targeter. It requires that the full sequence be submitted in FASTA format, which can be inconvenient when the analysis of all exons is needed. Mojo Hand performs the automated download and extraction of exons for most genes. Also, Mojo Hand uses a more extensive database of commercially available restriction enzymes than the ISU Targeter.
Another web service is called idTAL. It does perform automated download of genes based on Ensembl gene identifiers, but artifically limits the genes available for analysis based on species. For example, they do not include zebrafish, which is a problem for the Ekker lab. They also do not support restriction length fragment polymorphism (RLFP) for detecting activity. Finding sequences for which a TALEN can be made is quite easy---the challenge is finding one with a single restriction site within the spacer.
I ran Mojo Hand and it hangs. What's wrong?
The most frequent problem is that the user has made an error in the gene identifier. Unique identifiers of genes differ between databases. OMIM records, for example, usually won't work. And even if you get some results, it probably won't be the sequence you want.
What is TALBLAST?
TALBLAST is a web service that uses BLASTN and the nt database to discover potential off-site binding that could lead to artefacts. If a TALEN site is not unique within a genome, the TALEN protein may edit a genome at the incorrect location.
TALBLAST is available as a stand-alone application or it can be used through the output of Mojo Hand, which includes links so you can check the uniqueness of any particular TALEN. The core of TALBLAST is a script that searches for each TAL site separately and interprets the results. To set it up on your own machine, you'll need stand-alone BLAST, which you must have installed locally (BLAST+). and use blastn in "blastn-short" mode.
TALBLAST returns accession numbers for the subject sequence where my TALEN's bits match. What happened to the Gene Identifier?
The accession numbers are for sequences in the Nucleotide database. There are often multiple sources for a given gene, so there will be several accession numbers but it does not indicate an off-site effect necessarily.
To decode the accession numbers, refer to the NCBI documentation on that topic.
In TALBLAST output, what are the fields?
query id, subject acc.ver, q. start, q. end, s. start, s. end, gaps, evalue, bit score
The query ID is the sequence that's being searched for. The subject is the nt database. The query start (q.start) and end are the part of the TAL site that was found to match part of the genome. If the length of the match is less than the number of RVDs, you have a partial match. The subject start and end tell you where in a particular bit of the genome the query sequence matches. The number of gaps, expected value, and bit score are also given.
This program uses NCBI EUtilities, which requires that users enter their email address. It helps the NCBI staff track problems. (We will not collect, store, or use your email address for nefarious commercial or academic plots.) For example, if you are using this program in batch mode on a cluster and you accidently send 10000 requests; NCBI will probably want to contact you to let you know what\'s going on.
Straight from the horse's mouth: NCBI EUtilities Handbook
Parameter: Gene ID
Use gene identifiers for the NCBI Gene database. Try 3239, which is HOXD13 in humans. For help finding this identifier, see the SUC2 example above.
Parameter: FASTA text box
Paste in the DNA sequence you\'re interested in. Each sequence should be on a single line and the preceeding line should look like '>gene_E1'. Try this sequence:
Retrieves sequence from the NCBI Nucleotide database. Enter accession, beginning, and end of sequence. For example, HOXD13 can be retrieved with NC_000002.11 176957532, and 176960666. You can find this information in GenBank records. This should be considered a last-ditch option.
Parameter: Consensus Sequence
TAL binding sites can be identified by a consensus sequence based on naturally occuring TAL effectors in rice. Recent evidence indicates that the consensus sequence may be less complicated than previously thought. Because there is some debate, we have made TAL Tool as general as possible, and the consensus sequence is defined here based on a simple notation. Use s to indicate the start of a binding site and e to indicate the end. Bases are indicated as majiscules. When several bases are acceptable, use the notation [AG] to indicate the choice. Non-standard base codes (X, Y, etc.) are not supported. Use . to indicate a base when the particular base is not important or .* for 0 or more unspecified bases.
Parameter: Single Binding Site
Mojo Hand can find pairs of binding sites or single isolated binding sites. If you check this box, the consensus sequence is used to find isolated binding sites. When checked, no restriction analysis is performed.
Parameter: Highlight CpG Islands
There may be some effects of methylation on TAL binding, so this option allows users to highlight CpG islands in the output of Mojo Hand to make it easier to choose the ideal TALEN.
Parameter: Sequence Type
TAL binding sites can be designed in any region of the genomic DNA. In many applications, conserved exons are the target, but there may be uses for finding TAL sites in intronic and intergenic regions. Intronic regions can only be examined if a the gene is downloaded from Gene or Nucleotide. The entire sequence is used for file submissions.
Parameter: Short Flank Length
Some additional sequence on either side of each exon/intron may be needed when searching for binding sites. This is especially important for very short exons or binding sites very near the beginning or end of an exon.
Parameter: Long Flank Length
Some additional sequence on either side of a binding site is used when looking for restriction sites. The flanking sequence is particularly important when TAL sites that are near the beginning or end of an exon. It also sets the length of the amplicon considered when searching for single- and double-cutting enzymes.
Parameter: Minimum Distance to 2nd Cut
If the enzyme restriction site matches twice within the sequence, this value defines the minimum distance that the second cut site must be from the cut site found in the spacer.
Parameter: Minimum Spacer Length
The spacer between the two TAL binding sites plays an important role in TALEN design. If it is too short or too long, the Fok1 domains will not bind to form an effective endonuclease.
The spacer between the two TAL binding sites plays an important role in TALEN design. If it is too short or too long, the Fok1 domains will not bind to form an effective endonuclease.
Parameter: Minimum Binding Length
The specificity of TAL binding depends on the length of the binding site and number of repeat-variable di-residues (RVDs). Short binding sites may be promiscuous and very long sequences may be troublesome to construct. TALBLAST requires at least 17 bases, so if you want to analyze the output for uniqueness in a genome or rank off-site effects, use a value of 17.
Parameter: Maximum Binding Length
The specificity of TAL binding depends on the length of the binding site and number of repeat-variable di-residues (RVDs). Short binding sites may be promiscuous and very long sequences may be troublesome to construct.
Parameter: mRNA Index
Use this option to force Mojo Hand to use a particular mRNA feature. They're numbered according to their location in the GenBank file, starting from 1.
Parameter: mRNA Transcript
Use this option to force Mojo Hand to use a particular mRNA features. The first encountered feature will be used if no value is provided here. The value must be a properly formatted accession.
Parameter: misc_RNA Index
Use this option to force Mojo Hand to use misc_RNA features instead of mRNA or other features that may be present. Also, if multiple misc_RNA features are present, you may select which will be used. They're numbered according to their location in the GenBank file, starting from 1.
Parameter: CDS Index
Use this option to force Mojo Hand to use CDS features instead of mRNA or other features that may be present. Also, if multiple CDS features are present, you may select which will be used. They're numbered according to their location in the GenBank file, starting from 1.
This option produces additional output that may be useful if an error is suspected. Its main utility is for developers; most researchers will not need to adjust this value. Values of 1 and 2 will produce debugging output that may be of somewhat general use.
Copyright © Mayo Clinic 2012
Updated Tue Aug 8 09:30:00 CDT 2014