pySCENIC
pySCENIC copied to clipboard
Empty file output from pyscenic ctx using original/modified databases
Hi pySCENIC team,
Thank you for providing a great tool. I am running pyscenic using my original zebrafish data. However, I got an empty file output (reg.csv) from pyscenic ctx
, which is a very similar situation in #177. After execution of the pyscenic ctx
function, a lot of warnings were shown up like:
2021-08-30 10:12:59,909 - pyscenic.transform - WARNING - Less than 80% of the genes in tfdp1a could be mapped to zf1.genes_vs_motifs.rankings. Skipping this module.
[ ] | 0% Completed | 33.4s
or
2021-08-30 10:13:04,246 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for foxp2 could be mapped to zf1.genes_vs_motifs.rankings. Skipping this module.
[ ] | 0% Completed | 33.7s
There is no error message, and the empty output reg.csv means this:
Could you please provide me a solution to fix this issue?
More details are described below.
I ran this code:
!pyscenic ctx adj.tsv \
{f_db_names} \
--annotations_fname {f_motif_path} \
--expression_mtx_fname {f_loom_path_scenic} \
--output reg.csv \
--mask_dropouts \
--num_workers 20
zebrafish cisTarget databases were created by following this repo using zebrafish 5'UTR's up and downstream 10k bp genomic regions and motif information from JASPAR. The feather file I used for pyscenic ctx
is zf1.genes_vs_motifs.rankings.feather.
motifs ABCA7 ABCD2 ABR ACADSB ACAP2 ACBD3 ACSF3 ACTC1 ACVR1C AK6 AL627305.1
1 MA0002.2 9891 20179 10742 2567 6236 6282 13648 5347 10663 16867 644
2 MA0003.4 7236 18069 10906 14978 5419 16347 16718 4296 12922 5931 233
3 MA0004.1 19676 6688 14874 8093 20146 18805 13125 18632 7730 19189 15629
4 MA0006.1 10872 17790 6219 50 11655 20120 14065 1775 17164 19464 6347
5 MA0007.3 10583 17451 3597 12343 9659 19054 5885 9608 5872 14260 14541
The head of the adjacencies matrix (adj.tsv):
TF target importance
0 tfcp2l1 prl 253.933308
1 tfcp2l1 nppal 234.960120
2 ybx1 rpl8 158.685731
3 ybx1 rps15a 156.212955
4 ybx1 rpl11 156.103158
According to this comment, I have created zebrafish Motif2TF database by the following step: 1. find zebrafish orthologs by orthofinder of mouse genes; 2. substitute mouse genes to zebrafish ortholog genes in motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl. The created db looks like this:
#motif_id gene_name motif_name motif_description source_name source_version motif_similarity_qvalue similar_motif_id similar_motif_description orthologous_identity orthologous_gene_name orthologous_species description
metacluster_89.24 ran RAN RAN hdpi 2009 0 None None 0.925926 ENSG00000132341 H. sapiens gene is orthologous to ENSG00000132341 in H. sapiens (identity = 92%) which is directly annotated for motif
metacluster_14.17 si:ch73-238c9.1 C19orf25 C19orf25 hdpi 2009 0 None None 0.669725 ENSG00000119559 H. sapiens gene is orthologous to ENSG00000119559 in H. sapiens (identity = 66%) which is directly annotated for motif
hdpi__ACF a1cf ACF ACF hdpi 2009 0 None None 0.892437 ENSG00000148584 H. sapiens gene is orthologous to ENSG00000148584 in H. sapiens (identity = 89%) which is directly annotated for motif
metacluster_28.13 abcf2a ABCF2 ABCF2 hdpi 2009 0 None None 0.982484 ENSG00000033050 H. sapiens gene is orthologous to ENSG00000033050 in H. sapiens (identity = 98%) which is directly annotated for motif
pyscenic version 0.11.2 running in conda environment
Any suggestions or comments would be really appreciated.
I believe that using the updated pySCENIC >=0.11.3 will fix this.
Hi @cflerin ,
I truly appreciate your answer. I may misunderstand but the latest version is 0.11.2, right? Did you mean that that issue is unable to resolve until the newer version is released? I have tried updating pyscenic but it's still 0.11.2.
Thank you for your help.
Sorry @stanaka6! 0.11.2 is indeed the current version, I must have gotten mixed up with another issue.
Here it seems that all of your modules are dropped due to low overlap with the database. I notice that your adj.tsv has lower case gene names vs upper case in the database (at least the few that I can see). So you should check to make sure you're using the same gene symbols across both, otherwise there's no way that an overlap can be found.
Hi @cflerin,
Thank you very much for your quick response.
Zebrafish gene names (symbols) are consist of both upper and lower cases. For example, ABR (ENSDARG00000059587; Chr5:62374092-62545913), abr (ENSDARG00000095180; Chr5:62611400-62642718) and MDFIC (ENSDARG00000074223; Chromosome 4: 6,339,572-6,416,414), etc. Therefore, my input data contains both upper and lower-case gene symbols. I am not sure if this causes the issues.
I manually checked overlaps and could see them (for example, TF "tbx3a" in both adj.tsv and Motif2TF db; tbx3a's target gene "dio2" in .feather cisTarget ranking database).
Also, I found there are only 71 warning messages when running pyscenic ctx
function (described in my 1 st post), but I do not understand this is relevant to my problem.
Here is the link for my input files (except for our original expression matrix loom file): https://umbc.box.com/s/0dez16fxn4nwvvhojr3y9hz43dq5442n I would really appreciate it if you have a chance to look at them if possible.
One of my concerns is whether the zebrafish Motif2TF database is good. As described above, I identified zebrafish orthologs of mouse genes by Orthofinder, and substituted mouse genes to zebrafish ortholog genes in motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl. A mouse gene often matches several zebrafish genes -- and vise versa(see below pic; this is Orthofinder's output).
data:image/s3,"s3://crabby-images/289e0/289e0f98e4c659d94c8fb15a8b4ca16a2b319ada" alt="Screen Shot 2021-12-13 at 5 30 05 PM"
Therefore, I made a list of mouse and zebrafish genes from the above data, and separate the rows by commas. Also, I removed the genes from the Motif2TF database if there are no matched zebrafish genes. I am not sure the way I performed was correct.
The below is the partial code that I used for creating the zebrafish Motif2TF database.
## import motifs-v9nr_clust-nr.mgi-m0.001-o0.0.tbl as csv ----
tbl <- read.csv("motifs-v9nr_clust-nr.mgi-m0.001-o0.0.csv", header = TRUE)
## Make zebrafish-mouse gene substitute list ----
# Extract gene symbol from "motifs-v9nr_clust-nr.mgi-m0.001-o0.0."
mm.genes <- tbl$gene_name
# Annotate ensemble ID
library(AnnotationDbi)
library(org.Mm.eg.db)
df <- data.frame(gene = mm.genes)
df$EnsembleID <- mapIds(org.Mm.eg.db, keys = df$gene, column = "ENSEMBL", keytype = "SYMBOL")
# Leave unique values
unique.df <- df[!duplicated(df$gene), ]
colnames(unique.df) <- c("gene","mouse") # change colnames
## Load Orthofinder output (above picture) ----
orthofinder <- read.csv("Mus_musculus.GRCm39.pep.all__v__Danio_rerio.GRCz11.pep.all.csv")
colnames(orthofinder) <- c("orthogroup", "mouse", "zebrafish") # change col names
## Separate comma-separated values (mouse gene)
separate.df <- orthofinder %>%
tidyr::separate_rows(zebrafish, sep = ",") %>%
tidyr::separate_rows(mouse, sep = ",") %>%
dplyr::group_by(mouse) %>%
dplyr::summarise(zebrafish = paste0(sort(unique(na.omit(zebrafish))), collapse = ','))
# remove version numbers in mouse EnsembleID
separate.df$mouse <- sub("\\..*", "", separate.df$mouse)
# Merge with the gene list from "motifs-v9nr_clust-nr.mgi-m0.001-o0.0"
dfA <- unique.df %>%
dplyr::left_join(separate.df)
## Add zebrafish ID to Motif2TF table ---
dfA <- dfA[, -which(names(dfA) %in% "mouse")] # Remove mouse ensemble ID
colnames(dfA) <- c("gene_name", "zebrafish") # Change col names
merge <- merge(dfA, tbl, by = "gene_name", all = TRUE)
## separate comma-separated rows in zebrafish ID ----
merge2 <- tidyr::separate_rows(merge, zebrafish, sep = ",")
# remove version numbers in zebrafish EnsembleID
merge2$zebrafish <- sub("\\..*", "", merge2$zebrafish)
## Remove NA containing rows
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
merge2 <- completeFun(merge2, "zebrafish")
## Then, get the zebrafish gene name (symbols) and add them as a column...
Any suggestions would be truly appreciated.
I am having a similar issue. I would appreciate it if a solution is given to this. Thank you!
I am having a similar issue. I would appreciate it if a solution is given to this. Thank you!
Hi, I face the same problem. Do you have some solution for this? Really appreciate you if you have some comments. Thanks in advance,
Hi Frucelee, I have solved my own issue. I found that the problem was that I had incorrectly formatted my initial loom file. In your loom file, you can check to ensure that the Var and Obs are correctly set. You can also check that you're using the right organism and database. Best wishes.
Hi everyone, did someone manage to solve this issue? @Arinze-BioX it would be great to share on how to check on the right format of the loom file. thanks
Hi @MubasherMohammed , I think what I did was basic wrangling to check that I have the right number of genes and cells, and that I haven't wrongly transformed my expression matrix. That is, if I remember correctly, the Var are the genes and the Obs are the Cells. Then I checked that I used the right specie, genome build (e.g. mm9 vs mm10) and databases based on the explanations here: http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Setup.html
You can choose the right database by browsing this list of databases: https://resources.aertslab.org/cistarget/
I have moved on from this project, so I vaguely remember the details, and I hope my brief explanation here helps.
If you keep having issues after trying this, please share details of your issue and I can dig deeper and try to help.
Thank you @Arinze-BioX for your explanation. I also went through check for my files for pyscenic ctx regulons enrichment seems all good. I open this issue #389 hence and hope to get some insights on fixing it Thanks again
Hi @MubasherMohammed , I think what I did was basic wrangling to check that I have the right number of genes and cells, and that I haven't wrongly transformed my expression matrix. That is, if I remember correctly, the Var are the genes and the Obs are the Cells. Then I checked that I used the right specie, genome build (e.g. mm9 vs mm10) and databases based on the explanations here: http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Setup.html
You can choose the right database by browsing this list of databases: https://resources.aertslab.org/cistarget/
I have moved on from this project, so I vaguely remember the details, and I hope my brief explanation here helps.
If you keep having issues after trying this, please share details of your issue and I can dig deeper and try to help.
@Arinze-BioX @stanaka6 @MubasherMohammed Would you please explain how did you solve your problem regarding pyscenic ctx step empty output issue? I have raised one issue here: https://github.com/aertslab/pySCENIC/issues/407#issue-1301714888