F-Seq: A Feature Density Estimator for High-Throughput Sequence Tags

Tag sequencing using high-throughput sequencing technologies are now regularly employed to identify specific sequence features such as transcription factor binding sites (ChIP-seq) or regions of open chromatin (DNase-seq). To intuitively summarize and display individual sequence data as an accurate and interpretable signal, we developed F-Seq, a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser.

F-Seq Version 1.81
usage: fseq [options]... [file(s)]...
-b <background dir>     background directory (default=none)
-d <input dir>          input directory (default=current directory)
-f <arg>                fragment size (default=estimated from data)
-h                      print usage
-l <arg>                feature length (default=600)
-o <output dir>         output directory (default=current directory)
-of <wig | bed | npf>   output format (default wig)
-p <ploidy dir>         ploidy/input directory (default=none)
-s <arg>                wiggle track step (default=1)
-t <arg>                threshold (standard deviations) (default=4.0)
-v                      verbose output
Note: DNase HS data (5' ends) - set -f 0

Newest Version: Download from github

Version 1.84: Download

Bioinformatics paper version: Download

Source available: Download

Software requirements

This software requires Java version 1.5 or greater.

To see your version of java (or if it is installed), type ‘java -version’

If java is not installed or you do not have the correct version, download here .

Current accepted input formats:
Bed

Currently accepted output formats:
Wiggle

Bed

narrowPeak

narrowPeak: Narrow (or Point-Source) Peaks Format

This format is used to provide called peaks of signal enrichment
based on pooled, normalized (interpreted) data. It is a BED6+3 format.

field type description
chrom string Name of the chromosome
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position
of the feature in the chromosome or scaffold. The chromEnd base is not
included in the display of the feature. For example, the first 100 bases
of a chromosome are defined as chromStart=0, chromEnd=100, and span the
bases numbered 0-99.
name string Name given to a region (preferably unique). Use ‘.’ if no name is assigned.
score int Indicates how
dark the peak will be displayed in the browser (1-1000). If ’0′, the
DCC will assign this based on signal value. Ideally average signalValue
per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use ‘.’ if no orientation is assigned.

e.g.

chrX    9091548 9091648 .       0       .       182     5.0945  -1  50
chrX    9358722 9358822 .       0       .       91      4.6052  -1  40
chrX    9391082 9391182 .       0       .       182     9.2103  -1  75

 

Usage in Linux

Make sure ‘bin/fseq’ is executable (chmod 0755 bin/fseq)

For a list of options, type ‘fseq -h’

Example: fseq -v -of wig chr1.bed chr2.bed

This takes as input the chr1.bed and chr2.bed files.
Will use verbose output and outputs the densities in the wiggle format.

A perl script has been included which converts data from mapview format (MAQ alignment results) to BED format.

Usage:  mapviewToBed.pl <MIN QUAL> <MAX_HITS> <FILE>
        MIN_QUAL        = Exclude alignments with <= this mapping quality score.
        MAX_HITS        = Exclude alignments with > this number of hits.
        FILE            = Mapview file to convert.

Example: perl mapviewToBed.pl 0 2 sequence.mapview
This takes as input sequence.mapview, only uses sequences aligning in fewer than 2 locations with 0 or 1 mismatches, and only sequences with an alignment score of 0 or better.

Features

Estimated fragment length.

As F-seq was created initially for DNase-seq data, no shift based on fragment length was actually necessary and a simple command line procedure was used if the user wanted to shift sequence based on strand. To bring F-seq up to speed with other technologies, we implemented a method to determine the distance between positive and negative strand sequence densities using the Wilcoxon-Mann-Whitney Rank Sum Test. This test allows for a non-parametric estimate of the difference in positive-negative sequence distances from the expected value at 0 by ranking all possible positive-negative sequence differences and taking the median difference as the estimate. For practical reasons we limit the fragment shift to less than 500bp. This method also allows for calculating a confidence interval for the median distance but it has not been implemented.

Strict enforcement of strand.

In addition to fragment length estimation for shifting sequences based on strand, we have also decided to enforce strandedness. That is, one expects a fragment from ChIP-seq to only be available to be pulled down only if it is pointing toward the bound protein. Because of this, we assume that any other fragments are simply noise and they are discarded.

Background model.

The human genome contains a large amount of DNA that exists in multiple copies. To attempt to call peaks in regions which may exist more than once in the genome, we developed a background model which represents how many places a given sequence can align. We have generative alignability tracks by converting a set of all longest unique n-mers at each genomic position into alignment tracks for 20-mers (DNase) and 35-mers (all others). Anything that aligns more than four times is discarded by setting the background track to 0 at this position. The value at each position is then multiplied by the density estimate to correct for peaks that may be diluted by their sequences being placed in different genomic positions by MAQ which randomly assigns locations of reads which align in more than one place.

Background directory – This directory must contain a formatted .bff tracks which represents alignability across the genome.

bff files for sequences of 20bp: HG18 HG19

bff files for sequences of 35bp: HG18 HG19

NEW – test bff files based on CRG mappability annotations

bff files for sequences of 100bp: HG19

Creating your own bff files (NEW)

bff files for use with F-Seq can be created for any sequence assembly using our java program bffBuilder. bffBuilder takes as input a fixed step, step=1 wig file that contains a header and one value on each succeeding line for each base. For assemblies with multiple chromosomes, each chromosome should have a bff file generated separately. Each line of the wig file should contain a value between 0 and 1 that reflects the uniqueness of sequences aligning to that base on either strand. For example, if the sequence tag length is 35 bases, and the 35bp sequence starting at that position on the positive strand occurs 3 times in the genome, and the 35bp sequence starting on the negative strand occurs 5 times in the genome, this could be assigned the value 1/8 or 0.125. We have also been successful in creating bff files from wig files generated by the software program GEM Mapper (see files above from the CRG Mappability annotation). Note: input wig files should have exactly one value for every base.

 

Usage: bffBuilder chr1.wig
Example wig input:
fixedStep chrom=chr1 start=1 step=1
0
0
0.5
0.25
1

A chr1.bff will be generated, based on the “chrom=chr1″ parameter in the header line of the wig file.

Copy number / karyotype correction.

Parallel to the issue of alignability in the genome is that of copy number variations. Copy number variations lead to regions of chromosomes being over-represented in the genomic DNA as compared to the genome build. These are especially problematic when performing analysis on cancer cell lines (eg. HelaS3 or K562). We want to remove the effect of regions by normalizing the signal based on a non-enriched input sample. To do this, an estimate of the copy depth of all sequences is computed by running a 10kb window across the aligned input sequences in the genome and comparing the sequence depth to the mean genomic sequencing depth. In a karyotypically normal cell line this value should be close to 1 for the entire genome. As copy number estimates for sequences increase, this value should reflect the actual sample copy number compared to the reference sequence. The Parzen density estimates are then divided by these values to correct for differences in DNase-seq or ChIP-seq data.

Ploidy directory – This directory must contain formatted .iff tracks which represent estimated ploidy information for the cell line of interest. Again, we will shortly provide the script to generate these tracks.

iff files using 10kb windowing for GM12878: HG18 HG19

iff files using 10kb windowing for K562: HG18 HG19

iff files using 10kb windowing for HelaS3: HG18 HG19

iff files using 10kb windowing for HepG2: HG18 HG19

iff files using 10kb windowing for MCF7: HG18 HG19

iff files using 10kb windowing for H54: HG19

iff files using 10kb windowing for FB8470: HG19

iff files using 10kb windowing for FB0167P: HG19

iff files using 10kb windowing for A549: HG19

For file types for which input data is not available, we have created “generic” iff files based on averages of the above iff files. While these will not correct specifically for copy number changes unique to a particular cell type, it will reduce other biases due to seuqnecing technologies and the DNase experiment.

generic male iff files using 10kb windowing: HG19

generic female iff files using 10kb windowing: HG19

To generate ploidy tracks, a series of wiggle files representing ploidy information for each chromosome should be converted into our iff format using iffBuilder. Each position should be a relative ploidy level (a normal cell line would have a 1 in each position). Ploidy level should be < 65.535.

 

Usage: iffBuilder chr1.wig
Example wig input:
fixedStep chrom=chr22 start=14430371 step=1
0.0032
0.0032
0.0032
0.0032
0.0032

Troubleshooting

A likely cause for errors is an “OutOfMemory” exception.
To increase the available memory to the java virtual machine, edit
‘bin/fseq’ file and change the JAVA_OPTS property to increase the heap size.

One Trackback

  1. [...] F-Seq is a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser. [...]