pySCENIC icon indicating copy to clipboard operation
pySCENIC copied to clipboard

Missing TFs in resources

Open nasjr08 opened this issue 2 years ago • 9 comments

I have a few TFs that appear to be missing from the TF list seen in this file: hs_hgnc_tfs.txt, which is available in the resources folder. The TFs are: ZFHX4 AEBP1 CXXC5 TSHZ2

They are also missing in all other lists in the resources folder. Does anyone know alternative resources for pyscenic that might include these TFs?

I assume I can't just add them to the txt file right?

Finally, how have the TF list been compiled? Is it just based on available databases for these TFs?

Many thanks, Naseer Basma

nasjr08 avatar Apr 05 '22 10:04 nasjr08

Hi @nasjr08

Great remark!! You are right, we have to update the list of TFs.

You are also right that you can just add them to the file.

As to how these lists were compiled, see https://github.com/aertslab/pySCENIC/commit/c74a6ebdbcd3b63be3166dbaf63f2ebbb6b218b6

Best,

S

SeppeDeWinter avatar Apr 05 '22 13:04 SeppeDeWinter

Great! I'll do that then. Thank you!

Separate question, but how are the motif annotations compiled? I am currently using motifs-v9-nr.hgnc-m0.001-o0.0.tbl in my code, but I feel the output is missing a few TFs. I am fully aware that last thing I want to do is introduce my own Biases, but there are 2 TFs that correlate beautifully with genes we expect them to correlate with, genes that are involved in biological processes that the TF is associated with, but these TFs are pruned at the AUCell stage.

I was thinking maybe their motifs are not well defined/aren't in the above motif annotation table. What parameters do you recommend playing around with in the code for RcisTarget and how do I call these arguments on the terminal? Current code is:

singularity run pyscenic_0.11.2.simg pyscenic ctx adj.csv hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather --annotations_fname motifs-v9-nr.hgnc-m0.001-o0.0.tbl --expression_mtx_fname MYFILE.loom --mode "dask_multiprocessing" -o reg_win.csv --num_workers 32

nasjr08 avatar Apr 05 '22 16:04 nasjr08

Hi

First of all would you mind sharing which 2 TFs you are interested in, that way I can also check wether they are missing from our databases.

The motif-to-TF annotation is based on three things:

  1. A motif annotation can be directly annotated for a TF (for example the motif comes from a ChIP-seq experiment for a particular TF).
  2. A motif annotation can be annotated to a TF based on motif similarity: this means that the motif is similar (to a certain degree) to another motif which has a direct annotation, so we inherit that annotation.
  3. A motif annotation can be based on orthology: this means that the motif comes from an experiment with a TF in a different species which has an orthologous TF in the species you are analysing now, we inherit the orthologous annotation.

Thus, the only way of adding extra TFs is by either:

  1. Adding more motifs with a direct annotation
  2. Relaxing the thresholds for considering a motif as similar or a TF as orthologous.

This last parameter you can change in the code by setting --min_orthologous_identity and --max_similarity_fdr, unfortunatly for you they are already set at the least strict threshold (i.e. you can only make them more strict).

What I would advise you to do is:

  1. Check wether your TF of interest is annotated to a motif. This can be done by checking wether it is present in the motifs-v9-nr.hgnc-m0.001-o0.0.tbl file (can be done using a simple grep operation, for example cat motifs-v9-nr.hgnc-m0.001-o0.0.tbl | grep <TF_of_interest>)
  2. We also have a cistarget database based on ChIP-seq enrichment (rather than motif enrichment). For example see: encode_20190621__ChIP_seq_transcription_factor.hg38__refseq-r80__10kb_up_and_down_tss.max on https://resources.aertslab.org/cistarget/. With this you might have better luck. Make sure to download the correct TF annotation for this one if you decide to use it: https://resources.aertslab.org/cistarget/track2tf/encode_project_20190621__ChIP-seq_transcription_factor.homo_sapiens.hg38.bigwig_signal_pvalue.track_to_tf_in_motif_to_tf_format.tsv

Finally we are working on a new motif database which should be released fairly soon, this one might contain more motifs for your TF of interest.

SeppeDeWinter avatar Apr 06 '22 13:04 SeppeDeWinter

Ah I see, That's massively helpful. I have actually checked the TF motifs in the motifs-v9-nr.hgnc-m0.001-o0.0.tbl and they all appear to be there. I am looking at PRRX1 and RUNX2 in this case, so I don't think what I initially suggested about vaguely defined motifs as the reason they didn't turn up is true.

Question regarding the methodology, as the motif enrichment relies on genomic regions around the TSS, isn't there a good chance that you might have drop-outs since a lot of TFs bind to enhancers further away? I guess the Chip-enrichment database might help with this regard, but when i use the TSS Ref data, would this be an issue?

I was in the process to repeating with the Chip enrichment database but missed the fact I needed a different annotation table, so thanks for that!

I will also add that all 4 TFs mentioned in my first post (ZFHX4, AEBP1, CXXC5, TSHZ2) don't have any present motifs in the motif annotation table, which might explain why they are not in the TF list. I don't think they are likely to show up in the analysis because of this reason. Let me know if they are in your updated motif database!

nasjr08 avatar Apr 06 '22 16:04 nasjr08

Hi

Ah I see, That's massively helpful. I have actually checked the TF motifs in the motifs-v9-nr.hgnc-m0.001-o0.0.tbl and they all appear to be there. I am looking at PRRX1 and RUNX2 in this case, so I don't think what I initially suggested about vaguely defined motifs as the reason they didn't turn up is true. Indeed, PRRX1 and RUNX2 I have definitely seen before in SCENIC outputs.

Question regarding the methodology, as the motif enrichment relies on genomic regions around the TSS, isn't there a good chance that you might have drop-outs since a lot of TFs bind to enhancers further away? I guess the Chip-enrichment database might help with this regard, but when i use the TSS Ref data, would this be an issue?

This is very true! We certainly miss a lot by only looking at the regions surrounding the TSS. With scRNA-seq alone it is however the the best we can do (how would you know which regions are enhancers (?), how would you link these enhancers to their target genes (?), ...). Integrating information on chromatin accessibility would help with this. ChIP-seq data could in theory also help, but still here we don't look too much up-/down-stream of the gene because we don't know which ChIP-seq peaks are or are not regulating the gene.

I will also add that all 4 TFs mentioned in my first post (ZFHX4, AEBP1, CXXC5, TSHZ2) don't have any present motifs in the motif annotation table, which might explain why they are not in the TF list. I don't think they are likely to show up in the analysis because of this reason. Let me know if they are in your updated motif database! Thanks for mentioning these 4 TFs. We will keep them in mind while generating the new database. If they don't have any motif annotation they will be indeed lost from the pruning step unfortunately... Will keep you in touch about the new database. You can however still predict their target genes skipping the pruning step and afterwards run cistarget on these genes to do the pruning more manual. Let me known if you need help with this.

SeppeDeWinter avatar Apr 11 '22 07:04 SeppeDeWinter

I also met TF missing problems, so I may take my chance on these chipseq-derived databases right?

encode_20190621__ChIP_seq_transcription_factor.hg38__refseq-r80__10kb_up_and_down_tss.max encode_20190621__ChIP_seq_transcription_factor.hg38__refseq-r80__500bp_up_and_100bp_down_tss.max

I always used these two files on my human data: hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr, hg38__refseq-r80__10kb_up_and_down_tss.mc9nr

Can I use a mixture of all four cistargetdb for scenic ctx analysis?

Thank you for your answering

chansigit avatar Apr 14 '22 07:04 chansigit

2. cistarget

Thank you Seppe for the help. As the process of doing CisTarget on the gene list is done automatically, can I please confirm if I was to do it manually with Chip Data on the missing TFs (e.g. ZFHX4), that the below steps are correct:

  • Acquire Chip data for TF of interest.
  • Perform Peakcalling, as you would with Chip data
  • Identify and rank coordinates that lie within xbps from TF of genes (based on pileup value or enrichment scores)
  • Use this ChipRanking as the basis for AUCell (gene list from GENEID3 on singlecell data, and ranked genes base on chip data)
  • Use this for manual pruning, followed by regulon activity scoring using AUCell for each cell individually as normal.

Let me know if this sounds good as I will try to find publicly available chip datasets for the missing TFs.

Thank you, Nas

nasjr08 avatar Apr 18 '22 15:04 nasjr08

Hi

Ah I see, That's massively helpful. I have actually checked the TF motifs in the motifs-v9-nr.hgnc-m0.001-o0.0.tbl and they all appear to be there. I am looking at PRRX1 and RUNX2 in this case, so I don't think what I initially suggested about vaguely defined motifs as the reason they didn't turn up is true. Indeed, PRRX1 and RUNX2 I have definitely seen before in SCENIC outputs.

Question regarding the methodology, as the motif enrichment relies on genomic regions around the TSS, isn't there a good chance that you might have drop-outs since a lot of TFs bind to enhancers further away? I guess the Chip-enrichment database might help with this regard, but when i use the TSS Ref data, would this be an issue?

This is very true! We certainly miss a lot by only looking at the regions surrounding the TSS. With scRNA-seq alone it is however the the best we can do (how would you know which regions are enhancers (?), how would you link these enhancers to their target genes (?), ...). Integrating information on chromatin accessibility would help with this. ChIP-seq data could in theory also help, but still here we don't look too much up-/down-stream of the gene because we don't know which ChIP-seq peaks are or are not regulating the gene.

I will also add that all 4 TFs mentioned in my first post (ZFHX4, AEBP1, CXXC5, TSHZ2) don't have any present motifs in the motif annotation table, which might explain why they are not in the TF list. I don't think they are likely to show up in the analysis because of this reason. Let me know if they are in your updated motif database! Thanks for mentioning these 4 TFs. We will keep them in mind while generating the new database. If they don't have any motif annotation they will be indeed lost from the pruning step unfortunately... Will keep you in touch about the new database. You can however still predict their target genes skipping the pruning step and afterwards run cistarget on these genes to do the pruning more manual. Let me known if you need help with this.

Just to follow on this Seppe, when I originally performed the analysis, I performed it on a subcluster of all my cell types. Here RUNX2 and PRRX1 got pruned, as mentioned before. When I repeated on all my cells with all cell types, they were not pruned, and appeared in the final analysis, with their activity being primarily in the subcluster that I mentioned before (as I would expect). How would you interpret these results and why do you think the output from the former had these 2 pruned?

Many thanks, Nas

nasjr08 avatar Apr 21 '22 18:04 nasjr08

New TF lists are available at: https://resources.aertslab.org/cistarget/tf_lists/

ghuls avatar Sep 14 '22 09:09 ghuls

New TF lists are available at: https://resources.aertslab.org/cistarget/tf_lists/

I noticed that NOTCH1/4 is missing from this TF list. I am going to go ahead and add that to my personal TF list; however, will I need to do any to the file (motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl)?

sdettle avatar Sep 14 '22 21:09 sdettle