BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters
Hailiang Huang, Sandeep Tata, and Robert J Prill. 2013. “BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters.” Bioinformatics, 29, 1, Pp. 135-6. Abstract
SUMMARY: Computational workloads for genome-wide association studies (GWAS) are growing in scale and complexity outpacing the capabilities of single-threaded software designed for personal computers. The BlueSNP R package implements GWAS statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop, a de facto standard framework for distributed data processing using the MapReduce formalism. BlueSNP makes computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype-phenotype datasets. AVAILABILITY AND IMPLEMENTATION: http://github.com/ibm-bioinformatics/bluesnp