Allele-Aware ALignments for the Investigation of GeNetic Effects on Regulation (AA-ALIGNER)

AA-ALIGNER: Investigating genetic effects on gene regulation using complete or incomplete genotypes

Martin L. Buchkovich1, Karl Eklund1, Qing Duan1, Yun Li1, 2, 3, Karen L. Mohlke1 and Terrence S. Furey1,4


Genetic variation can alter transcriptional regulatory activity contributing to variation in complex traits and risk of disease, but identifying individual variants that affect regulatory activity has been challenging. Quantitative sequence-based experiments such as ChIP-seq and DNase-seq can detect sites where alleles contribute disproportionately to the overall signal suggesting allelic differences in regulatory activity. We developed our novel AA-ALIGNER pipeline to accurately determine short-read sequence signal at sample-specific heterozygous sites and detect allelic imbalance. AA-ALIGNER performs well even when limited or no sample-specific genotypes are available. Compared to using full genotype information, sites of allelic imbalance can be recovered with >95% sensitivity and >90% precision at heterozygous sites identified using imputed genotypes, and nearly as well at common variants when genotypes are unknown. In contrast, predicting additional heterozygous sites and imbalances using the sequence data led to >50% false positive rates. We evaluated effects of key parameter settings and data characteristics on imbalance detection. Overall, total base coverage and signal dispersion across the genome most affected our ability to identify imbalance. Parameters, such as imbalance significance, imputation quality thresholds, and alignment mismatches did not have large effects on accuracy. To assess the accuracy of imbalance predictions, we used electrophoretic mobility shift assays to functionally test for predicted allelic differences in CREB1 binding in the GM12878 lymphoblast cell line. Two variants confirmed to bind CREB1 with different strength, rs2382818 and rs713875, are within inflammatory bowel disease-associated loci. These studies provide empirically-based guidelines and a robust pipeline to detect genetic variants that drive changes in regulatory activity.