TCF: Transposon Cluster Finder Joti Giordano, Yongchao Ge, Yevgeniy Gelfand, Gyorgy Abrusan, Gary Benson, Peter E. Warburton, 2007 Evolutionary History of Mammalian Transposons Determined by Genome-Wide Defragmenation PLoS Computational Biology Vol 3, No. 7, e137 http://www.mssm.edu/labs/warbup01/paper/files.html --- HOW TCF WORKS --- TCF defragments RepeatMasker data and identifies transposons that have inserted themselves into other transposons. In order for TCF to join two fragments together, they must have the same name, the same orientation, and be within 500 unmasked base pairs of each other. After defragmenting the data, TCF then counts how many times every transposon interrupts every other transposon in the input data set and produces an interruption matrix. This matrix can be used as input for IMA to produce a chronological ordering of the transposable elements. TCF also produces a custom track that can be used with the UCSC genome browser to display the clusters. TCF takes into consideration special cases that do not represent independent transposition events, and does not count them as interruptions. In particular, it can recognize intact LTR tansposons and L1 5' inversions that appear in the RepeatMasker data as multiple separate transposons. --- LTR DETECTION --- LTR transposons often travel in pairs, with an internal transposon between them. The pair of LTRs should, then, be counted as a single insertion. TCF identifies intact LTR transposons in which two LTRs with the same name precisely flank a full-length internal LTR element in the same orientation. The second LTR element is not considered an independent transposon, and is not counted as interrupting anything. See Giordano et al. 2007 for more details. The user can specify which LTRs go with which internal elements by using an LTR file. Each line of the file should have the name of an LTR followed by a tab and one or more tab-delimited names of internal transposons that the LTR can pair up with. For example, suppose you run TCF using an LTR file with the following contents: THE1A THE1A-int THE1B-int THE1C-int THE1D-int MSTB1 MSTA-int MSTA1-int The THE1A element will be considered an LTR of THE1A-int, THE1B-int, THE1C-int, and THE1D-int for the purpose of detecting intact LTRs. Likewise, MSTB1 will be considered an LTR of MSTA-int and MSTA1-int. Therefore, if TCF sees a pair of MSTB1 elements closely flanking a full-length MSTA-int or MSTA1-int, it will consider it an intact LTR and will not count the second MSTB1 as interrupting anything. You can also use the star character (*) as a wildcard to represent zero or more characters in a transposon name within the LTR file. So, the following would be valid entries: LTR* *ERV* * *-int MER*1 Line 1 above would have the effect of associating any transposon that has a name beginning with "LTR" with any internal transposon with the letters "ERV" in its name. Line 2 would allow any transposon of any sort to be considered an LTR of any other transposon with a name ending in "-int", or starting with "MER" and ending with "1". Note that these rules are case-sensitive. Also note that TCF does not perform LTR detection when using the RepeatMasker defragmentation algorithm (see below). RepeatMasker identifes LTRs and assigns them ID numbers accordingly. The LTR file name should be specified by using the -l (lowercase L) command line option: tcf -l ltr_file repeatmasker_file If you do not use the -l option, then TCF will not try to detect intact LTRs. --- L1 5' INVERSION --- TCF looks for cases where the 5' end of an L1 element has been inverted; if it encounters two adjacent L1 elements with contiguous repeat indices but opposite orientation, the second L1 is not considered an independent transposon, and is not counted as interrupting anything. See Giordano et al. 2007 for more details. This applies to all genomes. TCF does not identify L1 inversions when using the RepeatMasker defragmentation algorithm (see below). L1 5' inversion detection can be disabled by using the -I option: tcf -I repeatmasker_file --- OUTLIER PAIRS --- The user can specify certain pairs of transposons as "outliers," i.e. special cases that should not be counted. If you provide a file with a list of pairs of transposon names, TCF will never count the first transposon in the pair as interrupting the second. For example, suppose you run TCF using an outlier file with the following contents: AluSx AluY AluY AluSx L2 L3 Regardless of what TCF finds in the input data, it will never count an AluSx as interrupting an AluY (due to line 1 in the outlier file), nor AluY interrupting AluSx (line 2), nor L2 interrupting L3 (line 3). Each line specifies first an interrupting transposon name, followed by a tab, and then an interruptee. The list of pairs may be arbitrarily long. Note that in the above example, although TCF will not count interruptions by L2 into L3, it will count interruptions by L3 into L2. The outlier file name should be specified by using the -o command line option: tcf -o outlier_file repeatmasker_file --- THE PUTBACK FILE --- When TCF creates the interruption matrix, it only includes elements that interact with at least 30% of the other elements in the data set. This percentage can be set by using the -c command line option. For one transposon to interact with another means to either interrupt it or be interrupted by it at least once. You can specify that certain elements should always be included in the matrix, even if they do not meet the minimum percent connectedness, by listing the transposon names in a "putback" file, with one transposon name per line. For example, suppose you run TCF using a putback file with the following contents: L1MC3 LTR13 MERm The transposons L1MC3, LTR13, and MERm will be included in the matrix regardless of how connected they are. In other words, if they are taken out for having a low percent connected, they will be put back. The putback file name should be specified by using the -p command line option: tcf -p putback_file repeatmasker_file The minimum percent connected can be specified using the -c command line option: tcf -c 30 repeatmasker_file --- DEFRAGMENTATION ALGORITHMS --- TCF combines fragments into what it refers to as units by examining their repeat indices. Two fragments within 500 base pairs of nontransposon sequence will be defragmented if their repeat indices are increasing, or overlap by no more than 50% of the length of the shorter fragment. The maximum distance between the fragments can be set by using the -s option: tcf -s 500 repeatmasker_file RepeatMasker has its own defragmentation algorithm. Each repeat in a RepeatMasker file has an ID number in the rightmost column. Two fragments with the same ID may have been defragmented by RepeatMasker. TCF can use the RepeatMasker ID numbers to defragment the data if you choose; however, the RepeatMasker file must be processed with the section.pl Perl script before TCF can make use of the RepeatMasker ID numbers. This script adds an additional column to the RepeatMasker file that allows TCF to use the RepeatMasker ID numbers correctly. It reads a RepeatMasker file as input and writes the modified file to standard output, so you will want to redirect its output to a file: perl section.pl repeatmasker_file.txt > sectioned_repeatmasker_file.txt Use the -r option to direct TCF to use the RepeatMasker defragmentaion instead of its own algorithm: tcf -r sectioned_repeatmasker_file --- EXCLUDED REGIONS --- The user can specify coordinates in the input genome that should be ignored by TCF, such as regions containing tandem repeats that may produce spurious interruptions. Each line of the file should contain the chromosome name, start, and end coordinates, separated by tabs. There must also be a fourth column containing a label which explains why the region is to be excluded from the analysis. The label must not contain any spaces, but can otherwise be any text you choose. This label will be used to identify the excluded clusters in the custom track. Note that if a cluster overlaps an excluded region, the entire cluster will be excluded. For example, suppose you run TCF using an exclude file with the following contents: chr1 13688 14311 SegmentalDuplication chr9 89731067 89735730 LTRArray chrX 82839554 82839843 TandemRepeat Any interruptions that are found in the regions of chr1:13688-14311, chr9:89731067-89735730, or chrX:82839554-82839843 will not be counted. Any clusters in those regions will be marked in the custom track with the label from the fourth column. The exclude file name should be specified by using the -x command line option: tcf -x exclude_file repeatmasker_file --- REPEATMASKER FILES --- TCF takes RepeatMasker input in two different file formats. The first file format is a RepeatMasker file that contains only the columns that TCF uses. It must begin with a pound sign (#) in the very first position of the first line. This line is a comment header; it will be ignored, and is typically used to label the columns of the input file. You cannot use the -r option with this file format. TCF expects the following columns (in this order): Divergence(in thousandths) Chromosome Genomic_Start Genomic_End Strand Repeat_Name Repeat_Class Repeat_Family Repeat_start Repeat_End Repeat_Left Here is an example of how this input might look: #milliDiv genoName genoStart genoEnd strand repName repClass repFamily repStart repEnd repLeft 266 chr1 1347 1538 - L1MC LINE L1 -498 5668 5449 294 chr1 1540 1643 - MER5B DNA MER1_type -74 104 1 230 chr1 5127 5218 - MIR SINE MIR -65 143 49 322 chr1 8769 8912 + L2 LINE L2 2942 3105 -314 The other format is a complete RepeatMasker file. A full RepeatMasker file begins with 3 header lines: two lines of column labels followed by a blank line. A sample input file might look like this: SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID 439 26.6 16.2 1.9 chr1 1348 1538 (247248181) C L1MC LINE/L1 (498) 5668 5449 3 278 29.4 1.9 1.0 chr1 1541 1643 (247248076) C MER5B DNA/MER1_type (74) 104 1 4 310 23.0 3.8 0.0 chr1 5128 5218 (247244501) C MIR SINE/MIR (65) 143 49 5 266 32.2 14.7 0.0 chr1 8770 8912 (247240807) + L2 LINE/L2 2942 3105 (314) 7 As of this writing, RepeatMasker files can be obtained from ftp://hgdownload.cse.ucsc.edu/goldenPath or http://hgdownload.cse.ucsc.edu/downloads.html -- look in the directory for the latest version of the species you are interested in. For example, the directory for the March 2006 version of the human genome is called hg18. Under that directory, there will be a directory called bigZips if you are using FTP, or a Full data set if using a web browser. There you will see a file called chromOut.zip. Download this file and decompress it. The file may be quite large (over 100 MB), so UCSC recommends that you use FTP rather than downloading from the web site. After unzipping the downloaded RepeatMasker file, there might be a separate file for each chromosome or contig. If so, run the concat.pl script on the .out files as follows: perl concat.pl *.out > repeatmasker_file.txt This will concatenate the separate files into one big file. --- OUTPUT FILES --- chrN_clusters.html: Lists the clusters for each chromosome. See Figure 1 in Giordano et al 2007. chrN_l1.html: Lists the clusters that contain L1 5' inversions for each chromosome with links to the genome browser. See Data Set 2 in the paper. chrN_ltr.html: Lists the clusters that contain intact LTRs for each chromosome. See Data Set 1 in the paper. customtrack.txt: A custom track for viewing clusters with the UCSC genome browser. features_all.txt: Shows information such as number of fragments, units, and interruptions for each repeat. See Table S1 in the paper. features_matrix.txt: Shows pecent connected and number of interruptions for each element in the interruption matrix. See Table S2 in the paper. matrix.txt: Interruption matrix showing how many times the transposons in the rows interrupted the transposons in the columns. See Figure 2 in the paper. names.txt: Column and row labels for the matrix in matrix.txt. See Figure 2 in the paper. removed.txt: Shows the elements that were not included in the matrix because they were not sufficiently connected. stats.txt: General information like the number of clusters, number of interruptions, etc. See Table S3 in the paper. Note that for scaffold genomes, it is best to run TCF with the -q option. This prevents TCF from writing out files that are specific to each contig, such as chrN_cluster.html, chrN_l1.html, and chrN_ltr.html. Otherwise, the program will produce hundreds or even thousands of output files. The html files have links that allow you to view the clusters using the UCSC genome browser. By default, they link to the March 2006 version of the human genome. You can specify a different genome using the -g option, followed by the name of the UCSC database for the appropriate species: tcf -g hg18 repeatmasker_file Here are the database names for the most recent versions of several genomes as of this writing: Human, March 2006: hg18 Chimp, March 2006: panTro2 Rhesus, January 2006: rheMac2 Mouse, Februrary 2006: mm8 Rat, November 2004: rn4 Cat, March 2006: felCat3 Dog, May 2005: canFam2 Horse, January 2007: equCab1 Cow, March 2005: bosTau2 Opossum, January 2006: monDom4 Consult the UCSC genome browser downloads page for the most up to date database names: http://hgdownload.cse.ucsc.edu/downloads.html --- COMMAND LINE OPTIONS --- TCF has one required argument, which is the name of the RepeatMasker file, and several optional arguments which are described here. The optional arguments should precede the input file name. c connectedness Specify minimum percent connected for inclusion in the matrix. connectedness is a real number between 0 and 100. The default is 30%. g database_name Specify the input genome. I Do not detect L1 5' inversions. l ltr_file Use ltr_file to define LTR pairs. o outlier_file Use outlier_file to define outlier pairs. p putback_file Always include elements listed in putback_file. q Quiet mode: no chromosome-specific output r Use RepeatMasker defragmentation. s space Set the max unmasked base pairs between fragments to be defragmented. Has no effect if RepeatMasker defragmentation used. x exclude_file Ignore regions listed in exclude_file. Note for Mac and Unix users: TCF does not allow you to group command line arguments. For example, instead of tcf -rq repeatmasker_file, use tcf -r -q repeatmasker_file. --- IMA: Interruption Matrix Analysis --- IMA is a program that analyzes the interruption matrix produced by TCF and provides a chronological ordering of the transposons based on how many times each one interrupts all the others. It can be accessed at: http://tandem.bu.edu/cgi-bin/ima/imaweb.exe In order to use IMA, you must first run TCF. Then submit the matrix.txt and names.txt files that TCF creates to IMA using the web site shown above. Depending on the size of the matrix, IMA can take several hours to run, so you will have to submit an e-mail address so that IMA can send you the results.