nanocompore
nanocompore copied to clipboard
sharkfin plot too many data points and difficult to visualize
Hi, I was trying to make the sharkfin plot as you discussed in another issue. However, the shape of my plot doesn't match the shape you showed in the paper and the plot looks like the one shown below. These are 35K data points or perhaps you recommend splitting the result dataframe by transcript ID (i.e. ref_id Column
). This is Sars-CoV-2
## Because ggplot doesn't like NAs
df<-file[,c("ref_id","ref_kmer","GMM_logit_pvalue_context_2","Logit_LOR")] %>% tidyr::drop_na()
df$Logit_LOR<- abs(df$Logit_LOR)
df<-df[order(df$GMM_logit_pvalue_context_2, df$Logit_LOR),]
df$color<-ifelse(df$GMM_logit_pvalue_context_2 <0.05 & df$Logit_LOR > 0.5 ,"Significant","Not-significant")
df$GMM_logit_pvalue_context_2<- -log10(df$GMM_logit_pvalue_context_2)
ggplot(df, aes(x=Logit_LOR, y=GMM_logit_pvalue_context_2,color=color)) + geom_point()+theme_minimal()+xlab("Logistic regression odds ratio")+ylab( "Nanocompore p-value (-log10)")
It does seem strange that you have a lot of sites which have significant p-values but low absolute values for the log odds ratio. Without knowing anything about your experimental design it can be challenging to give good advice on what this might mean. Maybe start be looking through the methods and supplementary information in this paper where we used Nanocompore on SARS-CoV-2 RNA?
https://doi.org/10.1016/j.omtn.2023.102052