This site is designed to make summary-level data available from a recent analysis of cis-genetic associations for plasma proteome in a large (N~9,000) multi-ethnic dataset generated from the Atherosclerosis Risk in Communities (ARIC) study. Researchers can download model coefficients for genetically imputing proteins in European American (EA) and African American (AA) populations and hence conduct proteome-wide association studies (PWAS) using summary-statistics data from external genome-wide association studies. Further, researchers can visualize and download summary-statistics data on pQTL and subsequent fine mapping analyses for each of the two ethnic groups. The site is currently under construction and so please frequently check back for updates.
For the details of the study, please see our paper published in Nature Genetics.
This site is actively maintained and updated by:
For reporting errors or if you have any questions about the methods, results and visualizations in this site please contact Jingning Zhang or Diptavo Dutta.
Update log:
05/02/2022: Paper published in Nature Genetics (https://www.nature.com/articles/s41588-022-01051-w).
10/06/2021: Updated summary statistics, PWAS prediction models, and fine-mapping results using +/-500Kb cis window in genotype data, based on the latest revision in the manuscript (https://www.biorxiv.org/content/10.1101/2021.03.15.435533v2). Please download the updated files for your research.
06/11/2021: Updated PWAS prediction models with corrected rsid. Updated a new tab for bulk download summary statistics for genes used in PWAS which have significant cis-heritability estimated by GCTA (p-value<0.01).
04/25/2020: Updated details in display table in Fine-mapping results with variant allele information.
03/19/2020: Current version of the database includes results for ~ 2000 proteins (click to view) . Please check whether your protein of interest is currenly included our database. We will keep updating the database with further results.
03/17/2020: Initial version of the results released.
We provided summary statistics for 4,657 plasma proteins in our analysis. Summary statistics was generated for cis-SNPs in +/-500Kb of the transcription start site of the protein-encoding gene using PLINK2. The column format can be found in PLINK2 file format page. File names are SeqID for SOMAmers used for measuring protein levels. The matching table for SeqID and protein ID (uniprot_id and entrezgenesymbol) can be found here: (click to download) . SNP position was provided based on human genome assembly GRCh38.
Download plasma proteins' cis summary statistics from European Americans (EA): (click to download)
Download plasma proteins' cis summary statistics from African Americans (AA): (click to download)
For the details of the study, please see our paper published in Nature Genetics.
Introduction
This tutorial is created for Proteome-wide association study (PWAS) and further performing a bivariate conditional analysis to interpret PWAS findings in the context of those from Transcriptome-wide association studies (TWAS).
The pipeline of PWAS is mainly from TWAS/FUSION with some modifications. The bivariate conditional analysis was described in our recent paper. Please cite both the manuscript for primary TWAS methods and the recent pQTL and PWAS paper:
- Gusev, et al. “Integrative approaches for large-scale transcriptome-wide association studies” 2016 Nature Genetics
- Zhang, Chaterjee, et al. “Plasma proteome analyses in individuals of European and African ancestry identify cis-pQTLs and models for proteome-wide association studies” 2022 Nature Genetics
Note that we stored the trained models and required data for two ethnic groups separately (PWAS_[EA/AA]). [EA/AA] in file or directory names represents the ancestry where we trained these models (EA represents for European Americans and AA represents for African Americans). We recommend matching the ancestry to your GWAS summary data.
Installation
Launch R and install required libraries:
install.packages(c('optparse','RColorBrewer','readr','stringr','dplyr'))
install.packages('plink2R-master/plink2R/',repos=NULL)
Note: We noticed some people may have issues with the installation of plink2R. Pleas try the command below in R if the above one does not work.
devtools::install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")
Download and unzip the scripts:
mkdir PWAS
cd PWAS
wget https://jh-pwas.s3.amazonaws.com/packages/scripts.zip
unzip scripts.zip && rm scripts.zip
Download and unzip the required data:
wget https://jh-pwas.s3.amazonaws.com/packages/LDref.zip
unzip LDref.zip && rm LDref.zip
## choose to download models for EA / AA
wget https://jh-pwas.s3.amazonaws.com/packages/PWAS_EA.zip
unzip PWAS_EA.zip && rm PWAS_EA.zip
wget https://jh-pwas.s3.amazonaws.com/packages/PWAS_AA.zip
unzip PWAS_AA.zip && rm PWAS_AA.zip
## GTEx V7 list for performing conditional analysis
wget https://jh-pwas.s3.amazonaws.com/packages/GTEx_V7_list.zip
unzip GTEx_V7_list.zip && rm GTEx_V7_list.zip
We also provides some example output results:
wget https://jh-pwas.s3.amazonaws.com/packages/Results.zip
unzip Results.zip && rm Results.zip
PWAS – analysis and output
Required data to perform PWAS
LD reference data (LDref in our data directory), weights (e.g. Plasma_Protein_weights_[EA/AA] in our data directory), and weight-loading files (e.g. Plasma_Protein_[EA/AA]_hg19.pos in our data directory).
Input 1: GWAS summary statistics
Same format of input of the summary statistics: A white-space separated table with a header row containing SNP (rsid), A1 (effect allele), A2 (other allele), and Z (z-score).
Input 2: Weigths
Imputation models were available for 1,318 and 1,368 significant cis-heritable plasma proteins which have significant non-zero cis-heritability estimated by GCTA (p<0.01) in EA and AA, respectively. Currently, only enet (elastic-net) and top1 (best single SNP) models are available. We recommend using enet models because they have better prediction accuracy.
Inside PWAS_[EA/AA], weights are stored in Plasma_Protein_weights_[EA/AA]. Weights are loaded from Plasma_Protein_[EA/AA]_[hg19/hg38].pos. [hg19/hg38] is the genome build which is used to locate those proteins. We recommend using the [hg19] to match TWAS models based on GTEx V7 from FUSION.
Input 3: LD reference data
LD reference data for European (EUR) and African (AFR) ancestry from 1000 Genomes Project (1000G) are stored in LDref.
Example for performing association test
The association test was performed by chromosome. Below is an example using data on chromosome 22:
CHR=22
#until [ $CHR -lt 1 ]
#do
Rscript ./scripts/PWAS.assoc_test.R \
--sumstats ./sumstats.txt \
--weights ./PWAS_EA/Plasma_Protein_EA_hg19.pos \
--weights_dir ./PWAS_EA/Plasma_Protein_weights_EA/ \
--ref_ld_chr ./LDref/EUR/chr \
--force_model enet \
--chr ${CHR} \
--out ./Results/Example_PWAS_by_chr/chr${CHR}.out
#let CHR-=1
#done
Output: Gene-disease association
The output file has the same format as TWAS/FUSION. (The only differences are eQTL ⟶ pQTL; TWAS ⟶ PWAS)
Bivariate conditional analysis for TWAS and PWAS
We provide a framework for interpreting results from PWAS in conjunction with analogous results from TWAS. As demonstrated in our paper, genetic correlation of gene-expression in tissues and protein level in plasma due to cis genetic regulation is moderate. As a result, it is possible to explore whether the PWAS signals could be explained by cis-genetic regulation of the expression of nearby (1Mb region around) genes and vice versa using a bivariate conditional analysis framework.
Required data to perform conditional analysis
TWAS output for only significant cis-heritable gene expressions performed using enet models, PWAS output from our pipeline performed using enet models, and pre-imputed cis-regulated plasma proteins and gene expressions for 1000G reference individuals which are stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA]).
Input 1: PWAS and TWAS output tables
The output tables in the FUSION format are required. Here we also require enet models for both PWAS and TWAS (trained with the option –force_model enet), because the imputed plasma protein and gene expression levels for reference individuals were pre-computed using enet models (stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA]).
Output tables for all chromosomes from TWAS/PWAS are needed to be merged together. Example R code:
library(dplyr)
library(readr)
results <- tibble()
for (chr in 1:22) {
results <- rbind(results, read_tsv(paste0("./Results/Example_PWAS_by_chr/chr", chr, ".out")))
if(chr==6){
results <- rbind(results, read_tsv(paste0("./Results/Example_PWAS_by_chr/chr6.out.MHC")))
}
}
write_tsv(results, "./Results/PWAS.out")
TWAS output tables for different tissues should be stored separately with the name of tissue.out in a same folder. For example, the TWAS in whole blood should be stored with name of Whole_Blood.out.
Input 2: list of TWAS tissues to be analyzed
The full list of TWAS tissues to be analyzed for conditional analysis should be stored in a text file. One tissue per line. An example is the full list of all GTEx V7 tissues, GTEx_V7_tissue_list.txt.
Input 3: imputed plasma protein and gene expression reference data
Cis-genetic regulated plasma proteins and gene expressions were pre-imputed on publicly available 1000G individuals. This data is stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA].
Example for performing conditional analysis of TWAS and PWAS
Here we give an example for performing the conditional analysis of TWAS and PWAS. In this example, we stored the PWAS output table in ./Results/PWAS.out, and all-tissue TWAS output tables inside the folder of ./Results/TWAS.out/.
Rscript ./scripts/PWAS.conditional.R \
--PWAS ./Results/PWAS.out \
--TWAS ./Results/TWAS.out/ \
--tissue_list ./GTEx_V7_tissue_list.txt \
--tissue_n_gene ./GTEx_V7_n_gene.rds \
--imputed_P ./PWAS_EA/1000G_imputed_EA/1000G_imputed_Plasma_Protein.txt \
--imputed_T ./PWAS_EA/1000G_imputed_EA/1000G_imputed_FUSION/ \
--out ./Results/ConditionalAnalysis/
Output: bivariate conditional analysis of PWAS and TWAS in all tissues
This code will first generate a tissue.RDat file for each tissue. These RDat files contain the flowing variables: (below we use T to represent gene expressions, and P to represent plasma proteins)
dat.sentinel.pwas: The regional sentinel PWAS genes (+/- 500Kb).
PcT.z: z-score of P conditional on T.
PcT.p: p-value of P conditional on T.
TcP.z: z-score of T conditional on P.
TcP.p: p-value of T conditional on P.
twas.hit: the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene.
twas.p: p-value of the most significant nearby TWAS gene regional sentinel PWAS gene.
dist: distance (bp) between each sentinel PWAS gene and its most significant nearby TWAS gene.
corr: cis-regulated genetic correlation between each sentinel PWAS gene and its most significant nearby TWAS gene.
It will then generate summary tables for tissue-specific conditional analysis and an all-tissue conditional analysis. The columns are:
PWAS_hit : PWAS significant gene
PWAS_p : p-value o PWAS significant gene
TWAS_hit : the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene.
TWAS_p : p-value of the most significant nearby TWAS gene regional sentinel PWAS gene.
Dist_of_hits : distance (bp) between each sentinel PWAS gene and its most significant nearby TWAS gene.
Corr_of_hits : cis-regulated genetic correlation between each sentinel PWAS gene and its most significant nearby TWAS gene.
PcT_p : p-value of P conditional on T.
TcP_p : p-value of T conditional on P.
min_TWAS_hit : the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.
min_TWAS_Tissue : the tissue with the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.
min_TWAS_p : the p-value of the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.
N_significant_tissues_in_TWAS : the number of significant TWAS tissues in all-tissue analysis.
all_significant_tissues_in_TWAS : all significant TWAS tissues in all-tissue analysis.
To explore pQTL results either of the Protein View or SNP view modes can be used.
Protein View
In this view, individual proteins or corresponding genes can be queried to visualize the pQTL results.
Input: Either of the UniProt ID or Gene Symbol. Note: Please don't use both. The default choice is using UniProt ID. So when searching using a gene ID, please make sure you have cleared out the UniProt ID box.
This will produce 4 different plots:
- pQTL: EA : regional pQTL plot for the SNPs within +/- 1Mb of the transcription start site (cis-SNPs) for the corresponding gene of the protein in European Americans (EA). -log10(p-values) are on the vertical axis and the physical position of SNPs are on the horizontal axis.
- Fine-mapping: EA : Fine mapping (using SuSIE) for the cis-SNPs of the protein in EA. Posterior inclusion probabilities (PIP) are on the vertical axis and physical position of SNPs are on the horizontal axis.
- pQTL: AA : regional pQTL plot for the SNPs within +/- 1Mb of the transcription start site (cis-SNPs) for the corresponding gene of the protein in African Americans (AA). -log10(p-values) are on the vertical axis and the physical position of SNPs are on the horizontal axis.
- Fine-mapping: AA : Fine mapping (using SuSIE) for the cis-SNPs of the protein in AA. Posterior inclusion probabilities (PIP) are on the vertical axis and physical position of SNPs are on the horizontal axis.
The points are colored by LD with the sentinedl SNP in EA and AA
To view the top SNPs ordered by PIP, please choose either of EA or AA and click Update View. This will display the summary statistics for corresponding population in tabular format where each row provides information on one SNP. The columns are:
- CHR: chromosome
- position (hg38): Position of the SNP in hg38
- rsID: rs ID of the variant
- Tested Allele: The allele of the variant that was tested for association. By convention, the minor allele is tested.
- AF: Minor allele frequency
- Beta: Effect size of the coded allele from cis-pQTL analysis
- SE: standard error of the effect size of the coded allele
- pvalue: p-value of the effect of the coded allele
- PIP: Posterior inclusion probability of the SNP
You can also download the full summary statistics using the download button.
SNP View
In this view, SNPs can be queried to visualize the pQTL. Please note that the operations of this mode is fairly slow currently. We are working actively to improve the query time for this view.
Input : The coordinates of the SNP as chromosome number and position in GRCh38.
This will produce a table of proteins that the SNP is mapped (in the cis-region) to. The table contains information on the UniProt ID, Gene ID, chromosome and the transcription start site (TSS) of the gene corresponding to the protein.
Click on each row to produce the plots for each protein. The plots are same as Protein View for the corresponding protein, with the queried SNP marked separately in Black. To view the top SNPs ordered by PIP, use the Update View button for the population of interest. You can also download the full summary statistics using the download button.
Important
If you get an error at any stage of visualization, please make sure the protein is included currenlt in our database. The list of all proteins included in the database can be found in the About tab.
List of Proteins
Current version of the database includes results for ~ 2000 proteins (click to view) .
Before querying the database please check whether the protein of interest is included in the current version of the database.
Click the button to update chart visualizations.