Last updated: 2024-09-21

Checks: 7 0

Knit directory: cross_ancestory_LCL/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200813) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6c612dd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/11.24.2020.2.png
    Ignored:    analysis/11.24.2020.3.png
    Ignored:    analysis/11.24.2020.4.png
    Ignored:    analysis/11.24.2020.5.png
    Ignored:    analysis/11.24.2020.6.png
    Ignored:    analysis/11.24.2020.7.png
    Ignored:    analysis/11.24.2020.8.png
    Ignored:    analysis/11.24.2020.9.png
    Ignored:    genotype/

Untracked files:
    Untracked:  analysis/1.21.WIP.Rmd
    Untracked:  analysis/12.WIP.Rmd
    Untracked:  analysis/first-analysis.Rmd alias
    Untracked:  getfastqtest2.csv

Unstaged changes:
    Modified:   analysis/correl.Rmd
    Modified:   analysis/manuscript3.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/manuscript2.Rmd) and HTML (docs/manuscript2.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 6c612dd mariesaitou 2024-09-21 workflowr::wflow_publish("analysis/manuscript2.Rmd")
html 9474f66 mariesaitou 2022-04-11 Build site.
Rmd c6a488f mariesaitou 2022-04-11 Update the files for myproject
html d23b1be mariesaitou 2021-07-02 Build site.
html 64421f6 mariesaitou 2021-07-02 Build site.
html d2e1bc0 mariesaitou 2021-07-02 Build site.
html 2abb7f5 mariesaitou 2021-07-02 Build site.
Rmd 90217b9 mariesaitou 2021-07-02 Publish the files for myproject

Fine Mapping

Cut the 100k up/downstream regions from the genotype file

module load vcftools
module load htslib 
csvfile=/project2/xuanyao/marie/E-GEUV-1/finemap/genelist.location1.csv
for line in `cat ${csvfile} | grep -v ^#`
do
  gene=`echo ${line} | cut -d ',' -f 1`
  chr=`echo ${line} | cut -d ',' -f 3`
  up=`echo ${line} | cut -d ',' -f 6`
  down=`echo ${line} | cut -d ',' -f 7`
  vcftools --gzvcf /project2/xuanyao/marie/E-GEUV-1/genotype/phase3/chr${chr}.1000gphase3.EUR.0.01.biallelic.recode.vcf.gz --chr ${chr} --from-bp ${up} --to-bp ${down} --recode --out ${gene}.EUR.genotype

done

run SuSiER

setwd("/project2/xuanyao/marie/E-GEUV-1/finemap")
library(susieR)
library(data.table)
genelist <- read.csv("YRI.eGenes.csv", stringsAsFactors = F)
gene.expression.YRI <- fread("/project2/xuanyao/marie/E-GEUV-1/FastQTL/GEUV/Yoruba.TPM.scaled.gene.csv")


## read a gene from the list
filenameYRI<-paste("YRI/",genelist[,"gene"], ".Yoruba.genotype.recode.vcf", sep="")
genotype.YRI<- lapply(filenameYRI, FUN=read.table, header = FALSE, stringsAsFactors = F)
names(genotype.YRI) <- genelist[,"gene"]
genotype.YRI.df <- rbindlist(genotype.YRI, fill=T, idcol = T)
genotype.YRI.df<-genotype.YRI.df[!duplicated(genotype.YRI.df[,c(".id","V3" )])&!duplicated(genotype.YRI.df[,c(".id","V3" )], fromLast = T),]


## remove the rows with all 0|0 or 1|1 (V10:ncol(genotype.YRI.df))
## remove the rows with all 0|1 or 1|0 (V10:ncol(genotype.YRI.df))
allsame <- function(vector){
  length(unique(unlist(vector[11:ncol(genotype.YRI.df)]))) == 1
}
all0110 <- function(vector){
  len <- length(unlist(vector[11:ncol(genotype.YRI.df)]))
  sum(unlist(vector[11:ncol(genotype.YRI.df)]) %in% c("0|1", "1|0")) == len
}

genotype.YRI.poly <- genotype.YRI.df[!(apply(genotype.YRI.df, 1, allsame) | apply(genotype.YRI.df, 1, all0110)), ]

## convert the vcf format as input dataset for SuSIE
genotype.YRI.data <- genotype.YRI.poly[,11:length(genotype.YRI.poly[1,])] 


genotype.YRI.data[genotype.YRI.data=="0|0"]<- 0L
genotype.YRI.data[genotype.YRI.data=="0|1"]<- 1L
genotype.YRI.data[genotype.YRI.data=="1|0"]<- 1L
genotype.YRI.data[genotype.YRI.data=="1|1"]<- 2L
genotype.YRI.data1<- as.matrix(genotype.YRI.data)
genotype.YRI.data<- matrix(as.numeric(genotype.YRI.data1), nrow = nrow(genotype.YRI.data))


## Scale the genotypes
scale.gen.YRI <- scale(t(genotype.YRI.data))
scale.gen.YRI.matrix <- matrix(scale.gen.YRI, ncol = 87, byrow = TRUE)



## Extract genes from the gene expression list
test.expression.YRI <- gene.expression.YRI[unlist(lapply(genelist$gene, grep, gene.expression.YRI$ID)),]
expression.YRI <- t(test.expression.YRI[,-(1:4)])

## Covariates
Zmat = read.table("Yoruba.cov.txt",stringsAsFactors = F)
Zmat.t <- t(Zmat[-(1),-(1)])
Zmat.t <-as.numeric(Zmat.t)
Zmat1<-matrix(as.numeric(Zmat.t), nrow=87)

covtest<-  as.list(NULL)
for(i in 1:nrow(genelist)){
  y=expression.YRI[,i]
  y.res = residuals(lm(y~Zmat1, na.action=na.exclude))
  covtest[[i]] <- stack(y.res)
}


## Make a list for susieR  
fitted.test.YRI <- as.list(NULL)

## Run SucieR ... Parameters are changeable
  for(i in 1:nrow(genelist)){
    fitted.test.YRI[[i]] <-   susie(scale.gen.YRI[,(genotype.YRI.poly$.id==genelist$gene[i]),drop=F], covtest[[i]][["values"]],
                                    L = 10, 
                                    estimate_residual_variance = TRUE, 
                                    estimate_prior_variance = FALSE,
                                    scaled_prior_variance = 0.95,
                                    verbose = TRUE)
  }



## Attach gene names to the result
#### original PIP
fitted.YRI <-  list(NULL)
for(i in 1:nrow(genelist)){
  if(length(fitted.test.YRI[[i]]$sets$cs) != 0){
    fitted.YRI[[i]] <- cbind(stack(fitted.test.YRI[[i]]$sets$cs),
                             fitted.test.YRI[[i]][["pip"]][unlist(fitted.test.YRI[[i]]$sets$cs)])
   # print(fitted.YRI[[i]])
  }
  
}

check.YRI <- rbindlist(fitted.YRI, idcol = T) 
check.YRI$name <- genelist[check.YRI$.id, "gene"]
names(check.YRI) <- c(".id", "values", "ind", "pip", "name")

result.temp <- data.frame(row.names = c("V1", "V2", "V3"))
for(i in 1:length(check.YRI$.id)){
  result.temp <- rbind(result.temp, genotype.YRI.poly[genotype.YRI.poly$.id == check.YRI$name[i], ][check.YRI$values[i], 2:4])
}
result.YRI <- data.frame(check.YRI$name, check.YRI$ind, check.YRI$values, check.YRI$pip, result.temp)
names(result.YRI)<-c("gene","L","SNP","PIP","chr","loc","rs")


## Format the finemapped SNP file
library(dplyr)
result.YRI<-result.YRI %>% as.data.frame() %>% mutate(gene.SNP = paste(!!!rlang::syms(c("gene", "rs")), sep="."))
write.csv(result.YRI, file = "result.YRI.PIP.cov.02172022.csv")

Classification of the finemapping results

library(dplyr)
grou12<-read.csv("result.EUR.cluster.csv")
group2<-read.csv("result.YRI.cluster.csv")


#(1) group1_only finemapped genes
group1_only <- unique(group1$gene[!(group1$gene %in% group2$gene)])
#(2) group2_onlyfinemapped genes
group2_only <- unique(group2$gene[!(group2$gene %in% group1$gene)])
#(3) both_overlapped
#gene.SNP matched genes
both_overlapped_SNP <- unique(group2$gene[(group2$gene.SNP %in% group1$gene.SNP)])
#(4) both_non_overlapped
#gene.SNP non-matched genes
both_overlapped <- unique(group2$gene[(group2$gene %in% group1$gene)])
both_non_overlapped <- unique(both_overlapped[!(both_overlapped %in% both_overlapped_SNP)])

df1 <- data.frame(gene=both_overlapped_SNP, status = "both_overlapped")
df2 <- data.frame(gene=both_non_overlapped, status = "both_non_overlapped")
df3 <- data.frame(gene=group1_only, status = "EUR_only")
df4 <- data.frame(gene=group2_only, status = "YRI_only")

DF4<-rbind(df1,df2,df3,df4)
write.csv(DF4, file = "geneclass.csv")

Obtain the frequency of the genetic variants

module load vcftools
module load bcftools
module load htslib

module load vcftools
for i in `seq 1 22`
do
   vcftools --gzvcf chr$i.1000gphase3.EUR.0.01.biallelic.snp.recode.vcf.gz --snps SNPs.EURgroup.txt --keep group1_indivi.csv --recode --recode-INFO-all --out chr$i.specific_snp_group1
done

for i in `seq 1 22`
do
   vcftools --gzvcf chr$i.1000gphase3.EUR.0.01.biallelic.snp.recode.vcf.gz --snps SNPs.EURgroup.txt --remove group1_indivi.csv --recode --recode-INFO-all --out chr$i.specific_snp_group2
done
c

bcftools concat chr{1..22}.specific_snp_group1.recode.vcf -o  specific_snp_group1.vcf
bcftools concat chr{1..22}.specific_snp_group2.recode.vcf -o  specific_snp_group2.vcf


vcftools --vcf specific_snp_group1.vcf --freq --out group1
vcftools --vcf specific_snp_group2.vcf --freq --out group2
done

VarLD Analysis

module load vcftools
module load htslib
module load python
module load java

#sbatch varLD.slurm
#cd /project2/xuanyao/marie/E-GEUV-1/LDSC/varLD

#csvfile=/project2/xuanyao/marie/E-GEUV-1/finemap/genelist.location1.csv
csvfile=/project2/xuanyao/marie/E-GEUV-1/finemap/varLD/result720/genelist.location2.csv
for line in `cat ${csvfile} | grep -v ^#`

do
gene=`echo ${line} | cut -d ',' -f 1`
srun --exclusive -N1 -n1 java -jar /project2/xuanyao/marie/E-GEUV-1/LDSC/varLD/rgenetics-1.0.jar -p VarLD /project2/xuanyao/marie/E-GEUV-1/finemap/varLD/${gene}.varLDgenotype.Yoruba.csv /project2/xuanyao/marie/E-GEUV-1/finemap/varLD/${gene}.varLDgenotype.EUR.csv -n 200 -o /project2/xuanyao/marie/E-GEUV-1/finemap/varLD/result720/varLD${gene}.L200.txt


done
wait

Genetic Correlation Analysis

Prepare input files

setwd("/Users/saitoumarie/Dropbox/Chicago/RCC/eQTL.practice/LDSC/phen")
genelist=read.csv("LDSC/test3.location.csv", header=T,stringsAsFactors = F)
EUR = read.csv("EUR.TPM.scaled.csv", header=T,stringsAsFactors = F)
Yoruba = read.csv("Yoruba.TPM.scaled.csv", header=T,stringsAsFactors = F)

## extract genes from the gene expression list

loc = read.csv("genelist.location1.csv", header=T,stringsAsFactors = F)
Yorubalist<-Yoruba[Yoruba[,1]%in%genelist[,1],]
EURlist <- EUR[EUR[,1]%in%genelist[,1],]

comlist <- Yorubalist$X[Yorubalist$X%in% EURlist$gene]

for(i in 1:length(comlist)){
  hako <- data.frame(cbind(c(names(EURlist), names(Yorubalist)), c(names(EURlist), names(Yorubalist))))
  hako$EUR <- c(as.numeric(EURlist[EURlist$gene==comlist[i],]), rep(NA, length(Yorubalist)))
  hako$Yoruba <-  c(rep(NA, length(EURlist)), as.numeric(Yorubalist[Yorubalist$X==comlist[i],]))
  hako<-subset(hako,hako$X1!="gene")
  hako<-subset(hako,hako$X1!="X")
  colnames(hako) <- c('ID', 'ID', 'EUR', 'Yoruba')
 # write.csv(hako, comlist[i])
  write.table(hako, file = paste(comlist[i], ".pre"), sep = "\t", row.names = FALSE,
              col.names = FALSE,quote=FALSE)
}

Run analysis

#cd /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/noconstraint_EUR87
export PATH="$PATH:/home/maries1/gcta_1.93.2beta"
#./gcta64

## Genetic correlation
csvfile=allgenes.list.csv
for line in `cat ${csvfile} | grep -v ^#`
do
  gene=`echo ${line} | cut -d ',' -f 1`
  gcta64 --reml-bivar --reml-no-constrain --reml-bivar-no-constrain --reml-maxit 100 --grm  /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/${gene}  --pheno /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/nonconstraint_EUR87.2/phen/${gene}.EUR87.2.phen  --out results/${gene}_EUR87 
done

## General reml
csvfile=allgenes.list.csv
for line in `cat ${csvfile} | grep -v ^#`
do
  gene=`echo ${line} | cut -d ',' -f 1`
  gcta64  --reml --reml-no-constrain --grm /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/${gene} --reml-maxit 100 --pheno /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/noconstraint_reml/material/EUR87.1/${gene}.EUR87.1.phen  --out /project2/xuanyao/marie/E-GEUV-1/LDSC/GCTA/noconstraint_reml/results/EUR87.1/${gene}_87.1
done

Regression Analysis on Allele Frequency and Gene Expression

ggplot(data = x, 
       aes(x = -(difference),
           y = log2FoldChange,color=DEG,shape=DEG, alpha=0.7)) + 
  geom_smooth(method=lm) + 
  geom_point(size = 2) +
  ggtitle("causal SNP frequency and gene expression") +
  xlab("allele frequency difference in two pops") + theme_bw()+ ylab("log2FoldChange")+ stat_summary(fun = "mean", geom = "crossbar",  width = 0.5)

ggscatter(x, "difference", "log2FoldChange", alpha=0.6, add = "reg.line", conf.int = TRUE,color = "DEG",shape="DEG")+ stat_cor(aes(color = "DEG",shape="DEG"))

ggplot(x, aes(x=difference, y=log2FoldChange)) +geom_point()+theme_bw(base_size = 12) 
+  stat_summary(fun = "mean",
               geom = "crossbar", 
               width = 0.5,
               colour = "")

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Oslo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.2         knitr_1.45       
 [5] rlang_1.1.3       xfun_0.42         stringi_1.8.3     processx_3.8.3   
 [9] promises_1.3.0    jsonlite_1.8.8    glue_1.7.0        rprojroot_2.0.4  
[13] git2r_0.33.0      htmltools_0.5.7   httpuv_1.6.15     ps_1.7.6         
[17] sass_0.4.8        fansi_1.0.6       rmarkdown_2.26    jquerylib_0.1.4  
[21] tibble_3.2.1      evaluate_0.23     fastmap_1.1.1     yaml_2.3.8       
[25] lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1     compiler_4.3.1   
[29] fs_1.6.3          pkgconfig_2.0.3   Rcpp_1.0.12       rstudioapi_0.15.0
[33] later_1.3.2       digest_0.6.34     R6_2.5.1          utf8_1.2.4       
[37] pillar_1.9.0      callr_3.7.5       magrittr_2.0.3    bslib_0.6.1      
[41] tools_4.3.1       cachem_1.0.8      getPass_0.2-4