DeFCoM: Analysis and Modeling of Transcription Factor Binding Sites Using a Motif-centric Genomic Footprinter

Motivation: Identifying the locations of transcription factor binding sites is critical for understanding how gene transcription is regulated across different cell types and conditions. Chromatin accessibility experiments such as DNaseI sequencing (DNase-seq) and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) produce genome-wide data that include distinct “footprint” patterns at binding sites. Nearly all existing computational methods to detect footprints from these data assume that footprint signals are highly homogeneous across footprint sites. Additionally, a comprehensive and systematic comparison of footprinting methods for specifically identifying which motif sites for a specific factor are bound has not been performed.

Results: Using DNase-seq data from the ENCODE project, we show that a large degree of previously uncharacterized site-to-site variability exists in footprint signal across motif sites for a transcription factor. To model this heterogeneity in the data, we introduce a novel, supervised learning footprinter called DeFCoM (Detecting Footprints Containing Motifs). We compare DeFCoM to nine existing methods using evaluation sets from four human cell lines and eighteen transcription factors and show that DeFCoM outperforms current methods in determining bound and unbound motif sites. We also analyze the impact of several biological and technical factors on the quality of footprint predictions to highlight important considerations when conducting footprint analyses and assessing the performance of footprint prediction methods. Lastly, we show that DeFCoM can detect footprints using ATAC-seq data with similar accuracy as when using DNase-seq data.

Software
Python code available at bitbucket

Contact Bryan Quach (bquach@email.unc.edu)

Datasets
The following files contain locations of motifs for individual transcription factors in four different cell types that were designated as either active or inactive based on ChIP-seq data for that factor in that cell type. Motif locations were determined based on position weight matrices downloaded from http://compbio.mit.edu/encode-motifs/. ChIP-seq data sets are from the ENCODE project. Human reference genome hg19 (GRCh37) was used. All files are in compressed tarballs.

README – Description of files.

GM12878: lymphoblastoid cell line active sites inactive sites

K562: leukemia cell line active sites inactive sites

H1 hESC: embryonic stem cell line active sites inactive sites

HepG2: hepatocellular carcinoma cell line active sites inactive sites