Interested in working on the development of this resource? Apply here.
All data here are released under a Fort Lauderdale Agreement for the benefit of the wider biomedical community. You can freely download and search the data, and we encourage the use and publications of frequency data for specific targeted sets of variants (for instance, assessing a set of candidate causal variants observed in a collection of rare disease patients). However, we ask that you not publish global (genome-wide) analyses of these data, or of large gene sets, until after the ExAC flagship paper has been published (estimated to be in early 2015).
The data are available under the ODC Open Database License (ODbL) (summary available here): you are free to share and modify the ExAC data so long as you attribute any public use of the database, or works produced from the database; keep the resulting data-sets open; and offer your shared or adapted version of the dataset under the same ODbL license.
If you’re uncertain which category your analyses fall into, please email us.
We request that any use of data obtained from the ExAC browser cite our preprint on bioRxiv.
We also ask that the Consortium be acknowledged as follows:
The current data release (0.2) was initially generated across a combined data set of 91,796 exomes, from which 61,486 were extracted for public release based on consent, consortium permission, exome data quality, and lack of relatedness with other samples. For more information on the consortia that contributed individuals to the public release please see the About page. The full sites data set can be downloaded here.
Exome sequencing data was processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known insertion/deletions (indels). We used the BWA aligner for mapping reads to the human genome build 37 (hg19). Genome Analysis Toolkit (GATK v3.1) HaplotypeCaller algorithm was then used to generate gVCFs for all 91,796 BAMs across a defined exome interval set and known sites were annotated with dbSNP135. The gVCFs were combined in 298 groups (~sqrt(n) gVCFs in each group) and then joint genotyping of SNPs and Indels were performed on all groups.
GATK Variant Quality Score Recalibration (VQSR) was used to filter variants. To train the SNP VQSR model HapMap3.3 and 1KG Omni2.5 SNP sites were used and a 99.6% sensitivity threshold was used to filter variants, while Mills et. al. 1KG gold standard and Axiom Exome Plus sites were used VQSR index model and a 95.0% sensitivity threshold was used. For more information on VQSR please refer to link. This resulted in ~80% of bi-allelic singleton SNPs to be filtered. From analyzing TiTv, singleton transmission in trios and validated sites, the VQSLOD PASS cut off was adjusted resulting in filtering of ~90% of bi-allelic singleton SNPs. An additional inbreeding coefficient (InbreedingCoeff <= -0.2) was applied to filter sites missed by VQSR filtering. Lastly, an additional filter labelled AC_Adj0_Filter was introduced to indicate that only low quality genotype calls containing alternate alleles are present in the release subset.
For more information please refer to GATK Best practices.