Statistical Genetics and Computational Biology

The Huang lab develops cutting-edge computational methods to tackle problems arising from the human genetics research. Despite that we often draw inspirations from our work on the genetics of inflammatory bowel diseases and psychiatric disorders, methods that we develop can be applied broadly to most human complex disorders and traits. The Huang lab developed the Bayesian fine-mapping method to resolve known genetic associations to variants with high causal probabilities (Huang et al., Nature, 2017) and extended this method to model the diversity in linkage disequilibrium across ancestries to more precisely isolate causal alleles (Lam and Chen et. al. Nature Genetics 2019). Working with the Ge lab at the Massachusetts General Hospital, we are developing methods to address critical challenges in studies using multiple ancestral populations, including methods to improve the polygenic risk prediction across ancestries, to further leverage genomic diversity to improve the resolution of fine-mapping, and to accurately characterize the cross-ancestry genetic correlations. Other method development activities in the Huang lab include methods for admixture populations and for rare variant association studies using large-scale sequencing data with population structure.

Selected Publications in Computational Biology

Transcriptome-scale RNase-footprinting of RNA-protein complexes

Zhe Ji, Ruisheng Song, Hailiang Huang, Aviv Regev, and Kevin Struhl. 2016. “Transcriptome-scale RNase-footprinting of RNA-protein complexes.” Nat Biotechnol, 34, 4, Pp. 410-3. Abstract

Ribosome profiling is widely used to study translation in vivo, but not all sequence reads correspond to ribosome-protected RNA. Here we describe Rfoot, a computational pipeline that analyzes ribosomal profiling data and identifies native, nonribosomal RNA-protein complexes. We use Rfoot to precisely map RNase-protected regions within small nucleolar RNAs, spliceosomal RNAs, microRNAs, tRNAs, long noncoding (lnc)RNAs and 3′ untranslated regions of mRNAs in human cells. We show that RNAs of the same class can show differential complex association. Although only a subset of lncRNAs show RNase footprints, many of these have multiple footprints, and the protected regions are evolutionarily conserved, suggestive of biological functions.

HistoneHits: a database for histone mutations and their phenotypes

Hailiang Huang, Alexandra M Maertens, Edel M Hyland, Junbiao Dai, Anne Norris, Jef D Boeke, and Joel S Bader. 2009. “HistoneHits: a database for histone mutations and their phenotypes.” Genome Res, 19, 4, Pp. 674-81. Abstract

Histones are the basic protein components of nucleosomes. They are among the most conserved proteins and are subject to a plethora of post-translational modifications. Specific histone residues are important in establishing chromatin structure, regulating gene expression and silencing, and responding to DNA damage. Here we present HistoneHits, a database of phenotypes for systematic collections of histone mutants. This database combines assay results (phenotypes) with information about sequences, structures, post-translational modifications, and evolutionary conservation. The web interface presents the information through dynamic tables and figures. It calculates the availability of data for specific mutants and for nucleosome surfaces. The database currently includes 42 assays on 677 mutants multiply covering 405 of the 498 residues across yeast histones H3, H4, H2A, and H2B. We also provide an interface with an extensible controlled vocabulary for research groups to submit new data. Preliminary analyses confirm that mutations at highly conserved residues and modifiable residues are more likely to generate phenotypes. Buried residues and residues on the lateral surface tend to generate more phenotypes, while tail residues generate significantly fewer phenotypes than other residues. Yeast mutants are cross referenced with known human histone variants, identifying a position where a yeast mutant causes loss of ribosomal silencing and a human variant increases breast cancer susceptibility. All data sets are freely available for download.

Precision and recall estimates for two-hybrid screens

Hailiang Huang and Joel S Bader. 2009. “Precision and recall estimates for two-hybrid screens.” Bioinformatics, 25, 3, Pp. 372-8. Abstract

MOTIVATION: Yeast two-hybrid screens are an important method to map pairwise protein interactions. This method can generate spurious interactions (false discoveries), and true interactions can be missed (false negatives). Previously, we reported a capture-recapture estimator for bait-specific precision and recall. Here, we present an improved method that better accounts for heterogeneity in bait-specific error rates. RESULT: For yeast, worm and fly screens, we estimate the overall false discovery rates (FDRs) to be 9.9%, 13.2% and 17.0% and the false negative rates (FNRs) to be 51%, 42% and 28%. Bait-specific FDRs and the estimated protein degrees are then used to identify protein categories that yield more (or fewer) false positive interactions and more (or fewer) interaction partners. While membrane proteins have been suggested to have elevated FDRs, the current analysis suggests that intrinsic membrane proteins may actually have reduced FDRs. Hydrophobicity is positively correlated with decreased error rates and fewer interaction partners. These methods will be useful for future two-hybrid screens, which could use ultra-high-throughput sequencing for deeper sampling of interacting bait-prey pairs. AVAILABILITY: All software (C source) and datasets are available as supplemental files and at http://www.baderzone.org under the Lesser GPL v. 3 license.

Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps

Hailiang Huang, Bruno M Jedynak, and Joel S Bader. 2007. “Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps.” PLoS Comput Biol, 3, 11, Pp. e214. Abstract

Yeast two-hybrid screens are an important method for mapping pairwise physical interactions between proteins. The fraction of interactions detected in independent screens can be very small, and an outstanding challenge is to determine the reason for the low overlap. Low overlap can arise from either a high false-discovery rate (interaction sets have low overlap because each set is contaminated by a large number of stochastic false-positive interactions) or a high false-negative rate (interaction sets have low overlap because each misses many true interactions). We extend capture-recapture theory to provide the first unified model for false-positive and false-negative rates for two-hybrid screens. Analysis of yeast, worm, and fly data indicates that 25% to 45% of the reported interactions are likely false positives. Membrane proteins have higher false-discovery rates on average, and signal transduction proteins have lower rates. The overall false-negative rate ranges from 75% for worm to 90% for fly, which arises from a roughly 50% false-negative rate due to statistical undersampling and a 55% to 85% false-negative rate due to proteins that appear to be systematically lost from the assays. Finally, statistical model selection conclusively rejects the Erdös-Rényi network model in favor of the power law model for yeast and the truncated power law for worm and fly degree distributions. Much as genome sequencing coverage estimates were essential for planning the human genome sequencing project, the coverage estimates developed here will be valuable for guiding future proteomic screens. All software and datasets are available in and , -, and -, and are also available from our Web site, http://www.baderzone.org.

Selected Publications in Statistical Genetics

RICOPILI: Rapid Imputation for COnsortias PIpeLIne

Lam M, Awasthi S, Watson HJ, Goldstein J, Panagiotaropoulou G, Trubetskoy V, Karlsson R, Frei O, Fan CC, De Witte W, Mota NR, Mullins N, Brügger K, Lee H, Wray N, Skarabis N, Huang H, Neale B, Daly M, Mattheissen M, Walters R, Ripke S. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics. 2019 Aug 8.pii: btz633. Epub ahead of print

[DOI] [PubMed]

BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters

Hailiang Huang, Sandeep Tata, and Robert J Prill. 2013. “BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters.” Bioinformatics, 29, 1, Pp. 135-6. Abstract

SUMMARY: Computational workloads for genome-wide association studies (GWAS) are growing in scale and complexity outpacing the capabilities of single-threaded software designed for personal computers. The BlueSNP R package implements GWAS statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop, a de facto standard framework for distributed data processing using the MapReduce formalism. BlueSNP makes computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype-phenotype datasets. AVAILABILITY AND IMPLEMENTATION: http://github.com/ibm-bioinformatics/bluesnp

Fast association tests for genes with FAST

Pritam Chanda, Hailiang Huang, Dan E Arking, and Joel S Bader. 2013. “Fast association tests for genes with FAST.” PLoS One, 8, 7, Pp. e68585. Abstract

UNLABELLED: Gene-based tests of association can increase the power of a genome-wide association study by aggregating multiple independent effects across a gene or locus into a single stronger signal. Recent gene-based tests have distinct approaches to selecting which variants to aggregate within a locus, modeling the effects of linkage disequilibrium, representing fractional allele counts from imputation, and managing permutation tests for p-values. Implementing these tests in a single, efficient framework has great practical value. Fast ASsociation Tests (Fast) addresses this need by implementing leading gene-based association tests together with conventional SNP-based univariate tests and providing a consolidated, easily interpreted report. Fast scales readily to genome-wide SNP data with millions of SNPs and tens of thousands of individuals, provides implementations that are orders of magnitude faster than original literature reports, and provides a unified framework for performing several gene based association tests concurrently and efficiently on the same data. AVAILABILITY: https://bitbucket.org/baderlab/fast/downloads/FAST.tar.gz, with documentation at https://bitbucket.org/baderlab/fast/wiki/Home.

Gene-based tests of association

Hailiang Huang, Pritam Chanda, Alvaro Alonso, Joel S Bader, and Dan E Arking. 2011. “Gene-based tests of association.” PLoS Genet, 7, 7, Pp. e1002177. Abstract

Genome-wide association studies (GWAS) are now used routinely to identify SNPs associated with complex human phenotypes. In several cases, multiple variants within a gene contribute independently to disease risk. Here we introduce a novel Gene-Wide Significance (GWiS) test that uses greedy Bayesian model selection to identify the independent effects within a gene, which are combined to generate a stronger statistical signal. Permutation tests provide p-values that correct for the number of independent tests genome-wide and within each genetic locus. When applied to a dataset comprising 2.5 million SNPs in up to 8,000 individuals measured for various electrocardiography (ECG) parameters, this method identifies more validated associations than conventional GWAS approaches. The method also provides, for the first time, systematic assessments of the number of independent effects within a gene and the fraction of disease-associated genes housing multiple independent effects, observed at 35%-50% of loci in our study. This method can be generalized to other study designs, retains power for low-frequency alleles, and provides gene-based p-values that are directly compatible for pathway-based meta-analysis.