Toggle navigation
    • About
    • Summary Statistics
    • PWAS
    • Fine Mapping

    This site is designed to make summary-level data available from a recent analysis of cis-genetic associations for plasma proteome in a large (N~9,000) multi-ethnic dataset generated from the Atherosclerosis Risk in Communities (ARIC) study. Researchers can download model coefficients for genetically imputing proteins in European American (EA) and African American (AA) populations and hence conduct proteome-wide association studies (PWAS) using summary-statistics data from external genome-wide association studies. Further, researchers can visualize and download summary-statistics data on pQTL and subsequent fine mapping analyses for each of the two ethnic groups. The site is currently under construction and so please frequently check back for updates.



    For the details of the study, please see our paper published in Nature Genetics.



    This site is actively maintained and updated by:

    • Benjamin Harvey
    • Jingning Zhang
    • Diptavo Dutta
    • Nilanjan Chatterjee


    For reporting errors or if you have any questions about the methods, results and visualizations in this site please contact Jingning Zhang or Diptavo Dutta.


    Update log:

    05/02/2022: Paper published in Nature Genetics (https://www.nature.com/articles/s41588-022-01051-w).

    10/06/2021: Updated summary statistics, PWAS prediction models, and fine-mapping results using +/-500Kb cis window in genotype data, based on the latest revision in the manuscript (https://www.biorxiv.org/content/10.1101/2021.03.15.435533v2). Please download the updated files for your research.

    06/11/2021: Updated PWAS prediction models with corrected rsid. Updated a new tab for bulk download summary statistics for genes used in PWAS which have significant cis-heritability estimated by GCTA (p-value<0.01).

    04/25/2020: Updated details in display table in Fine-mapping results with variant allele information.

    03/19/2020: Current version of the database includes results for ~ 2000 proteins (click to view) . Please check whether your protein of interest is currenly included our database. We will keep updating the database with further results.

    03/17/2020: Initial version of the results released.

    We provided summary statistics for 4,657 plasma proteins in our analysis. Summary statistics was generated for cis-SNPs in +/-500Kb of the transcription start site of the protein-encoding gene using PLINK2. The column format can be found in PLINK2 file format page. File names are SeqID for SOMAmers used for measuring protein levels. The matching table for SeqID and protein ID (uniprot_id and entrezgenesymbol) can be found here: (click to download) . SNP position was provided based on human genome assembly GRCh38.



    Download plasma proteins' cis summary statistics from European Americans (EA): (click to download)

    Download plasma proteins' cis summary statistics from African Americans (AA): (click to download)



    For the details of the study, please see our paper published in Nature Genetics.


    Introduction

    This tutorial is created for Proteome-wide association study (PWAS) and further performing a bivariate conditional analysis to interpret PWAS findings in the context of those from Transcriptome-wide association studies (TWAS).

    The pipeline of PWAS is mainly from TWAS/FUSION with some modifications. The bivariate conditional analysis was described in our recent paper. Please cite both the manuscript for primary TWAS methods and the recent pQTL and PWAS paper:

    • Gusev, et al. “Integrative approaches for large-scale transcriptome-wide association studies” 2016 Nature Genetics
    • Zhang, Chaterjee, et al. “Plasma proteome analyses in individuals of European and African ancestry identify cis-pQTLs and models for proteome-wide association studies” 2022 Nature Genetics

    Note that we stored the trained models and required data for two ethnic groups separately (PWAS_[EA/AA]). [EA/AA] in file or directory names represents the ancestry where we trained these models (EA represents for European Americans and AA represents for African Americans). We recommend matching the ancestry to your GWAS summary data.

    Installation

    Launch R and install required libraries:

    install.packages(c('optparse','RColorBrewer','readr','stringr','dplyr'))
    install.packages('plink2R-master/plink2R/',repos=NULL)
    

    Note: We noticed some people may have issues with the installation of plink2R. Pleas try the command below in R if the above one does not work.

    devtools::install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")
    

    Download and unzip the scripts:

    mkdir PWAS
    cd PWAS
    
    wget https://jh-pwas.s3.amazonaws.com/packages/scripts.zip
    unzip scripts.zip && rm scripts.zip
    

    Download and unzip the required data:

    wget https://jh-pwas.s3.amazonaws.com/packages/LDref.zip
    unzip LDref.zip && rm LDref.zip
    
    ## choose to download models for EA / AA
    wget https://jh-pwas.s3.amazonaws.com/packages/PWAS_EA.zip
    unzip PWAS_EA.zip && rm PWAS_EA.zip
    wget https://jh-pwas.s3.amazonaws.com/packages/PWAS_AA.zip
    unzip PWAS_AA.zip && rm PWAS_AA.zip
    
    ## GTEx V7 list for performing conditional analysis
    wget https://jh-pwas.s3.amazonaws.com/packages/GTEx_V7_list.zip
    unzip GTEx_V7_list.zip && rm GTEx_V7_list.zip
    

    We also provides some example output results:

    wget https://jh-pwas.s3.amazonaws.com/packages/Results.zip
    unzip Results.zip && rm Results.zip
    

    PWAS – analysis and output

    Required data to perform PWAS

    LD reference data (LDref in our data directory), weights (e.g. Plasma_Protein_weights_[EA/AA] in our data directory), and weight-loading files (e.g. Plasma_Protein_[EA/AA]_hg19.pos in our data directory).

    Input 1: GWAS summary statistics

    Same format of input of the summary statistics: A white-space separated table with a header row containing SNP (rsid), A1 (effect allele), A2 (other allele), and Z (z-score).

    Input 2: Weigths

    Imputation models were available for 1,318 and 1,368 significant cis-heritable plasma proteins which have significant non-zero cis-heritability estimated by GCTA (p<0.01) in EA and AA, respectively. Currently, only enet (elastic-net) and top1 (best single SNP) models are available. We recommend using enet models because they have better prediction accuracy.

    Inside PWAS_[EA/AA], weights are stored in Plasma_Protein_weights_[EA/AA]. Weights are loaded from Plasma_Protein_[EA/AA]_[hg19/hg38].pos. [hg19/hg38] is the genome build which is used to locate those proteins. We recommend using the [hg19] to match TWAS models based on GTEx V7 from FUSION.

    Input 3: LD reference data

    LD reference data for European (EUR) and African (AFR) ancestry from 1000 Genomes Project (1000G) are stored in LDref.

    Example for performing association test

    The association test was performed by chromosome. Below is an example using data on chromosome 22:

    CHR=22
    #until [ $CHR -lt 1 ]
    #do
    Rscript ./scripts/PWAS.assoc_test.R \
    --sumstats ./sumstats.txt \
    --weights ./PWAS_EA/Plasma_Protein_EA_hg19.pos \
    --weights_dir ./PWAS_EA/Plasma_Protein_weights_EA/ \
    --ref_ld_chr ./LDref/EUR/chr \
    --force_model enet \
    --chr ${CHR} \
    --out ./Results/Example_PWAS_by_chr/chr${CHR}.out
    #let CHR-=1
    #done
    

    Output: Gene-disease association

    The output file has the same format as TWAS/FUSION. (The only differences are eQTL ⟶ pQTL; TWAS ⟶ PWAS)

    Bivariate conditional analysis for TWAS and PWAS

    We provide a framework for interpreting results from PWAS in conjunction with analogous results from TWAS. As demonstrated in our paper, genetic correlation of gene-expression in tissues and protein level in plasma due to cis genetic regulation is moderate. As a result, it is possible to explore whether the PWAS signals could be explained by cis-genetic regulation of the expression of nearby (1Mb region around) genes and vice versa using a bivariate conditional analysis framework.

    Required data to perform conditional analysis

    TWAS output for only significant cis-heritable gene expressions performed using enet models, PWAS output from our pipeline performed using enet models, and pre-imputed cis-regulated plasma proteins and gene expressions for 1000G reference individuals which are stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA]).

    Input 1: PWAS and TWAS output tables

    The output tables in the FUSION format are required. Here we also require enet models for both PWAS and TWAS (trained with the option –force_model enet), because the imputed plasma protein and gene expression levels for reference individuals were pre-computed using enet models (stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA]).

    Output tables for all chromosomes from TWAS/PWAS are needed to be merged together. Example R code:

    library(dplyr)
    library(readr)
    results <- tibble()
    for (chr in 1:22) {
        results <- rbind(results, read_tsv(paste0("./Results/Example_PWAS_by_chr/chr", chr, ".out")))
        if(chr==6){
            results <- rbind(results, read_tsv(paste0("./Results/Example_PWAS_by_chr/chr6.out.MHC")))
        }
    }
    write_tsv(results, "./Results/PWAS.out")
    

    TWAS output tables for different tissues should be stored separately with the name of tissue.out in a same folder. For example, the TWAS in whole blood should be stored with name of Whole_Blood.out.

    Input 2: list of TWAS tissues to be analyzed

    The full list of TWAS tissues to be analyzed for conditional analysis should be stored in a text file. One tissue per line. An example is the full list of all GTEx V7 tissues, GTEx_V7_tissue_list.txt.

    Input 3: imputed plasma protein and gene expression reference data

    Cis-genetic regulated plasma proteins and gene expressions were pre-imputed on publicly available 1000G individuals. This data is stored in 1000G_imputed_[EA/AA] inside PWAS_[EA/AA].

    Example for performing conditional analysis of TWAS and PWAS

    Here we give an example for performing the conditional analysis of TWAS and PWAS. In this example, we stored the PWAS output table in ./Results/PWAS.out, and all-tissue TWAS output tables inside the folder of ./Results/TWAS.out/.

    Rscript ./scripts/PWAS.conditional.R \
    --PWAS ./Results/PWAS.out \
    --TWAS ./Results/TWAS.out/ \
    --tissue_list ./GTEx_V7_tissue_list.txt \
    --tissue_n_gene ./GTEx_V7_n_gene.rds \
    --imputed_P ./PWAS_EA/1000G_imputed_EA/1000G_imputed_Plasma_Protein.txt \
    --imputed_T ./PWAS_EA/1000G_imputed_EA/1000G_imputed_FUSION/ \
    --out ./Results/ConditionalAnalysis/
    

    Output: bivariate conditional analysis of PWAS and TWAS in all tissues

    This code will first generate a tissue.RDat file for each tissue. These RDat files contain the flowing variables: (below we use T to represent gene expressions, and P to represent plasma proteins)

    • dat.sentinel.pwas: The regional sentinel PWAS genes (+/- 500Kb).

    • PcT.z: z-score of P conditional on T.

    • PcT.p: p-value of P conditional on T.

    • TcP.z: z-score of T conditional on P.

    • TcP.p: p-value of T conditional on P.

    • twas.hit: the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene.

    • twas.p: p-value of the most significant nearby TWAS gene regional sentinel PWAS gene.

    • dist: distance (bp) between each sentinel PWAS gene and its most significant nearby TWAS gene.

    • corr: cis-regulated genetic correlation between each sentinel PWAS gene and its most significant nearby TWAS gene.

    It will then generate summary tables for tissue-specific conditional analysis and an all-tissue conditional analysis. The columns are:

    • PWAS_hit : PWAS significant gene

    • PWAS_p : p-value o PWAS significant gene

    • TWAS_hit : the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene.

    • TWAS_p : p-value of the most significant nearby TWAS gene regional sentinel PWAS gene.

    • Dist_of_hits : distance (bp) between each sentinel PWAS gene and its most significant nearby TWAS gene.

    • Corr_of_hits : cis-regulated genetic correlation between each sentinel PWAS gene and its most significant nearby TWAS gene.

    • PcT_p : p-value of P conditional on T.

    • TcP_p : p-value of T conditional on P.

    • min_TWAS_hit : the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.

    • min_TWAS_Tissue : the tissue with the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.

    • min_TWAS_p : the p-value of the most significant nearby TWAS gene (+/- 500Kb) for each regional sentinel PWAS gene in all-tissue analysis.

    • N_significant_tissues_in_TWAS : the number of significant TWAS tissues in all-tissue analysis.

    • all_significant_tissues_in_TWAS : all significant TWAS tissues in all-tissue analysis.

    • Explore Results
    • Protein View
    • SNP View

    To explore pQTL results either of the Protein View or SNP view modes can be used.

    Protein View

    In this view, individual proteins or corresponding genes can be queried to visualize the pQTL results.

    Input: Either of the UniProt ID or Gene Symbol. Note: Please don't use both. The default choice is using UniProt ID. So when searching using a gene ID, please make sure you have cleared out the UniProt ID box.

    This will produce 4 different plots:

    • pQTL: EA : regional pQTL plot for the SNPs within +/- 1Mb of the transcription start site (cis-SNPs) for the corresponding gene of the protein in European Americans (EA). -log10(p-values) are on the vertical axis and the physical position of SNPs are on the horizontal axis.
    • Fine-mapping: EA : Fine mapping (using SuSIE) for the cis-SNPs of the protein in EA. Posterior inclusion probabilities (PIP) are on the vertical axis and physical position of SNPs are on the horizontal axis.
    • pQTL: AA : regional pQTL plot for the SNPs within +/- 1Mb of the transcription start site (cis-SNPs) for the corresponding gene of the protein in African Americans (AA). -log10(p-values) are on the vertical axis and the physical position of SNPs are on the horizontal axis.
    • Fine-mapping: AA : Fine mapping (using SuSIE) for the cis-SNPs of the protein in AA. Posterior inclusion probabilities (PIP) are on the vertical axis and physical position of SNPs are on the horizontal axis.

    The points are colored by LD with the sentinedl SNP in EA and AA

    To view the top SNPs ordered by PIP, please choose either of EA or AA and click Update View. This will display the summary statistics for corresponding population in tabular format where each row provides information on one SNP. The columns are:

    • CHR: chromosome
    • position (hg38): Position of the SNP in hg38
    • rsID: rs ID of the variant
    • Tested Allele: The allele of the variant that was tested for association. By convention, the minor allele is tested.
    • AF: Minor allele frequency
    • Beta: Effect size of the coded allele from cis-pQTL analysis
    • SE: standard error of the effect size of the coded allele
    • pvalue: p-value of the effect of the coded allele
    • PIP: Posterior inclusion probability of the SNP

    You can also download the full summary statistics using the download button.

    SNP View

    In this view, SNPs can be queried to visualize the pQTL. Please note that the operations of this mode is fairly slow currently. We are working actively to improve the query time for this view.

    Input : The coordinates of the SNP as chromosome number and position in GRCh38.

    This will produce a table of proteins that the SNP is mapped (in the cis-region) to. The table contains information on the UniProt ID, Gene ID, chromosome and the transcription start site (TSS) of the gene corresponding to the protein.

    Click on each row to produce the plots for each protein. The plots are same as Protein View for the corresponding protein, with the queried SNP marked separately in Black. To view the top SNPs ordered by PIP, use the Update View button for the population of interest. You can also download the full summary statistics using the download button.

    Important

    If you get an error at any stage of visualization, please make sure the protein is included currenlt in our database. The list of all proteins included in the database can be found in the About tab.

    List of Proteins

    Current version of the database includes results for ~ 2000 proteins (click to view) .

    Before querying the database please check whether the protein of interest is included in the current version of the database.


    Click the button to update chart visualizations.

    European Americans

    Note: while the data view will show only the specified number of observations, the summary will still be based on the full dataset. p-values < 1e-320 has been collapsed to 1e-320 for ease of viewing.

    African Americans

    Top SNPs

    Download
    Note: while the data view will show only the specified number of observations, the summary will still be based on the full dataset. p-values < 1e-320 has been collapsed to 1e-320 for ease of viewing.

    Protein-SNP mapping

    Click on the Protein name to see Fine Mapping Results

    European Americans

    Note: while the data view will show only the specified number of observations, the summary will still be based on the full dataset. p-values < 1e-320 has been collapsed to 1e-320 for ease of viewing.

    African Americans

    Top SNPs

    Download