create_cisTarget_databases
create_cisTarget_databases copied to clipboard
What is the correct workflow for creating tracks DBs?
First off, thanks for putting this together for the community.
I am trying to create a cisTarget track DB for mouse to use with pySCENIC, using a number of ChIP-seq (bigWig) files. I am aware of the multiple (still open) issues on related topics, and a tutorial e.g. PBMC10k_SCENIC-protocol-CLI-tracks, etc., but I must say, despite all that and the instructions given here, I am still a bit unsure about what is a correct workflow.
-
First step, score-all-tracks-at-once-and-create-rankings. Here, I am using
--tracksto pass a list of bigWig files, and for--bedI am using the files that you provide under regions, e.g. mm10-limited-upstream500-tss-downstream100-full-transcript.bed. This steps outputs the files described in the instructions, except *.tracks_vs_regions.rankings.feather. I have 1 question for this step in particular:- Here https://github.com/aertslab/create_cisTarget_databases/blob/fef07aeaf92ebb8ec3bc4baf26b674102fd38d08/create_cistarget_track_databases.py#L414 writing to disk of *.tracks_vs_regions.rankings.feather is commented out, so I guess we don't need this file, right?
-
Second step, I think from here instructions are a bit unclear for track-based annotations... According to my understanding, I need to run convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py using the output *.tracks_vs_regions.scores.feather from step 1. But it looks like it's not even reading in the file (same whether we use regions or genes with
--genes):
Traceback (most recent call last):
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 175, in <module>
main()
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 110, in main
ct_scores_db_motifs_or_tracks_vs_regions_or_genes = CisTargetDatabase.read_db(
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/cistarget_db.py", line 980, in read_db
all_column_names = pa_table.column_names
UnboundLocalError: local variable 'pa_table' referenced before assignment
-
If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few quetions:
-
For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?
-
Most importantly, how do I generate a matching "motif database" to use with
--annotations_fname( files like those under track2tf )?
-
-
Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
Thanks for taking time to clarify, I'm sure this will help others as well.
I am running under: SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
and that's my environment:
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
arrow-cpp 10.0.1 ha770c72_6_cpu conda-forge
aws-c-auth 0.6.22 hd93a3ba_0 conda-forge
aws-c-cal 0.5.20 hff2c3d7_3 conda-forge
aws-c-common 0.8.5 h166bdaf_0 conda-forge
aws-c-compression 0.2.16 hf5f93bc_0 conda-forge
aws-c-event-stream 0.2.18 h57874a7_0 conda-forge
aws-c-http 0.7.0 h96ef541_0 conda-forge
aws-c-io 0.13.12 h57ca295_1 conda-forge
aws-c-mqtt 0.7.13 h0b5698f_12 conda-forge
aws-c-s3 0.2.3 hc08f4d7_1 conda-forge
aws-c-sdkutils 0.1.7 hf5f93bc_0 conda-forge
aws-checksums 0.1.14 h6027aba_0 conda-forge
aws-crt-cpp 0.18.16 h20a6995_11 conda-forge
aws-sdk-cpp 1.10.57 ha834a50_1 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2022.12.7 ha878542_0 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
glog 0.6.0 h6f12383_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.20.1 h81ceb04_0 conda-forge
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
libabseil 20220623.0 cxx17_h05df665_6 conda-forge
libarrow 10.0.1 hf9c26a6_6_cpu conda-forge
libblas 3.9.0 16_linux64_openblas conda-forge
libbrotlicommon 1.0.9 h166bdaf_8 conda-forge
libbrotlidec 1.0.9 h166bdaf_8 conda-forge
libbrotlienc 1.0.9 h166bdaf_8 conda-forge
libcblas 3.9.0 16_linux64_openblas conda-forge
libcrc32c 1.1.2 h9c3ff4c_0 conda-forge
libcurl 7.87.0 hdc1c0ab_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h28343ad_4 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgfortran-ng 12.2.0 h69a702a_19 conda-forge
libgfortran5 12.2.0 h337968e_19 conda-forge
libgomp 12.2.0 h65d4601_19 conda-forge
libgoogle-cloud 2.5.0 h21dfe5b_1 conda-forge
libgrpc 1.51.1 h30feacc_0 conda-forge
liblapack 3.9.0 16_linux64_openblas conda-forge
libllvm11 11.1.0 he0ac6c6_5 conda-forge
libnghttp2 1.51.0 hff17c54_0 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libopenblas 0.3.21 pthreads_h78a6416_3 conda-forge
libprotobuf 3.21.12 h3eb15da_0 conda-forge
libsqlite 3.40.0 h753d276_0 conda-forge
libssh2 1.10.0 hf14f497_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libthrift 0.16.0 he500d00_2 conda-forge
libutf8proc 2.8.0 h166bdaf_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
llvmlite 0.39.1 py310h58363a5_1 conda-forge
lz4-c 1.9.4 hcb278e6_0 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
numba 0.56.4 py310ha5257ce_0 conda-forge
numpy 1.21.6 py310h45f3432_0 conda-forge
openssl 3.0.7 h0b41bf4_2 conda-forge
orc 1.8.2 hfdbbad2_0 conda-forge
pandas 1.5.3 py310h9b08913_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
pip 23.0 pyhd8ed1ab_0 conda-forge
pyarrow 10.0.1 py310h633f555_6_cpu conda-forge
python 3.10.8 h4a9ceb5_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-flatbuffers 23.1.21 pyhd8ed1ab_0 conda-forge
python_abi 3.10 3_cp310 conda-forge
pytz 2022.7.1 pyhd8ed1ab_0 conda-forge
re2 2022.06.01 h27087fc_1 conda-forge
readline 8.1.2 h0f457ee_0 conda-forge
s2n 1.3.31 h3358134_0 conda-forge
setuptools 66.1.1 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.9 hbd366e4_2 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
tzdata 2022g h191b570_0 conda-forge
wheel 0.38.4 pyhd8ed1ab_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
zlib 1.2.13 h166bdaf_4 conda-forge
zstd 1.5.2 h3eb15da_6 conda-forge
writing to disk of
*.tracks_vs_regions.rankings.featheris commented out, so I guess we don't need this file, right?
No, this file does not need to be saved. It is only temporarily needed in memory. Writing this file takes quite a bit of time (if you have a lot of regions).
The second step would only be needed if iyou created the the first step in multiple steps (when using --partial CURRENT_PART NBR_TOTAL_PARTS). If you have *.regions_vs_tracks.rankings.feather , you are fine.
@eboileau What was the full command that you ran? I suspect that your input feather file you used was not in the proper format.
Traceback (most recent call last):
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 175, in <module>
main()
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 110, in main
ct_scores_db_motifs_or_tracks_vs_regions_or_genes = CisTargetDatabase.read_db(
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/cistarget_db.py", line 980, in read_db
all_column_names = pa_table.column_names
UnboundLocalError: local variable 'pa_table' referenced before assignment
Thanks for your reply.
I ran the first step create_cistarget_track_databases.py, as explained above, using a small dataset example of 10 tracks. It's been some time, but I think this is the right log output:
Initialize dataframe (92634 regions x 10 tracks) for storing track scores for each regions per track.
Adding bigWigAverageOverBed track scores (1 of 10) for track "ERX3374317" took 0.103590 seconds.
Adding bigWigAverageOverBed track scores (2 of 10) for track "ERX3374325" took 0.066980 seconds.
Adding bigWigAverageOverBed track scores (3 of 10) for track "ERX3374319" took 0.054182 seconds.
Adding bigWigAverageOverBed track scores (4 of 10) for track "ERX3374323" took 0.061743 seconds.
Adding bigWigAverageOverBed track scores (5 of 10) for track "ERX3374321" took 0.045412 seconds.
Adding bigWigAverageOverBed track scores (6 of 10) for track "ERX3374320" took 0.027367 seconds.
Adding bigWigAverageOverBed track scores (7 of 10) for track "ERX3374324" took 0.056523 seconds.
Adding bigWigAverageOverBed track scores (8 of 10) for track "ERX3374318" took 0.048575 seconds.
Adding bigWigAverageOverBed track scores (9 of 10) for track "ERX3374322" took 0.035871 seconds.
Adding bigWigAverageOverBed track scores (10 of 10) for track "ERX3374326" took 0.035685 seconds.
Create rankings from "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather" with random seed set to 1984.
Writing cisTarget regions vs tracks scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather"
Writing cisTarget regions vs tracks scores db took: 0.106667 seconds
Writing cisTarget tracks vs regions scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.scores.feather"
Writing cisTarget tracks vs regions scores db took: 19.852678 seconds
Creating cisTarget rankings db from cisTarget scores db took: 0.174052 seconds
Writing cisTarget tracks vs regions rankings db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.rankings.feather"
Writing cisTarget tracks vs regions rankings db took: 7.912627 seconds
Using the output of step 1, I then ran step 2
./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py \
-i /prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather \
--seed 1984
which resulted in the above error.
The files are here: https://data.dieterichlab.org/s/8Npf6r5bGdfaASM
Thanks for providing the test files.
In case you would need to run 1convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py in the future, the bug is fixed (missing 1 character): https://github.com/aertslab/create_cisTarget_databases/commit/9572f4147bcd9853571a687a2b3833241bff641d#diff-09ae4856d6de6fde66c2e4398454f6d41a333c12d018f20b0f0969b2f5b95cffR508
Hi @ghuls thanks for taking time to fix this. I'd appreciate your comments on some more general questions:
If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few questions:
- For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?
Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience?
- Most importantly, how do I generate a matching "motif database" to use with
--annotations_fname( files like those under track2tf )?
Do I have to generate this from scratch? Are you aware of any resource for mouse?
- Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
?
Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience? Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
pySCENIC will only work with gene based rankings databases as it will also use the expression data. Region based databases are used in SCENIC+ as you would have access to ATAC data.
But in general region databases are "better" than gene databases as they have a higher resolution (way more number of regions than number of genes). The disadvantage of region based databases is that you would need to associate a region with a downstream affected gene. In the gene based databases we assume that a e.g 10kb area around the TSS is correlated with the expression of the gene, but this dismisses distal enhancers.
Do I have to generate this from scratch? Are you aware of any resource for mouse? Yes you will have to generate them from scratch. Here are some old examples from files in that format:
https://resources.aertslab.org/cistarget/track2tf/
You will need a file with the following header: #motif_id\tgene_name\tmotif_similarity_qvalue\torthologous_identity\tdescription\n.
The format was originally only meant for motifs and not tracks, so the header is a bit weird:
motif_idyou use your track name (ChIP-seq names in your database).gene_nameyou use the TF name associated with this dataset.motif_similarity_qvalue:0(no motif similarity used as we have tracks).orthologous_identity:1(no orthologous annotation from another species as we assume you don't use ChIP-seq data from another species for scoring)description: whatever metadata you like (not used)
| #motif_id | gene_name | motif_similarity_qvalue | orthologous_identity | description |
|---|---|---|---|---|
| accession | TF | 0 | 1 | description |
| ENCFF522XXJ | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) |
| ENCFF355QJK | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) |
| ENCFF963FLO | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) |
| ENCFF013PDC | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) |
| ENCFF508BVS | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) |
| ENCFF560OIN | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) |
| ENCFF819BQU | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) |
BTW, our SCENIC+ public motif collection is now available: https://resources.aertslab.org/cistarget/motif_collections/
moitf2TF snapshots for that collection can be found at: https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/snapshots/
To create annotation for other species (if you would make one for non-human,mouse,fly) try to find orthologs for your species for the gene name mentioned in the gene_name (column 6) column and replace it with the gene name of your species. If you don't have an ortholog, for that gene, delete the line.
Thanks @ghuls this is really useful. Let me take a few days to digest all this and go back to my project, then I'll mark this issue as resolved.