PPanGGOLiN icon indicating copy to clipboard operation
PPanGGOLiN copied to clipboard

Searching for spots using the gene as a query

Open GuilhemRoyer opened this issue 2 years ago • 7 comments

Hi Guillaume, Adelme and Jerome!

I am trying to find all the spots which contain a specific gene. Is there a way to do it before drawing the resulting spots?

Best,

Guilhem

GuilhemRoyer avatar Oct 07 '22 14:10 GuilhemRoyer

I have found a way to do this but I am not sure I get all the spots: I search for my gene family in all the projection files and then get the column which, if I understand correctly, is the list of spots where the gene can be found.

Am I right to do it this way?

GuilhemRoyer avatar Oct 07 '22 14:10 GuilhemRoyer

Hi !

I'm pretty sure there is no spot-related information in the projection files. However, there is a useful command line that you can use that will tell you everything that is related to a gene:

ppanggolin align -p pangenome.h5 -o MYOUTPUTDIR --sequences MY_SEQUENCES_OF_INTEREST.fasta --getinfo

This will give you the exact list of spots, and RGPs, where the genes are found, in a tsv file called 'info_input_seq.tsv'. It should write one line for each gene in your .fasta file.

The format is described in the last table of the Align wiki page here

You can use any fasta sequence for this, and it will get assigned to its closest member gene family within the pangenome, then extract all info related to this gene family, including the list of spots (and RGPs) in which is it found !

Hopefully that helps !

Adelme

axbazin avatar Oct 07 '22 15:10 axbazin

Thank you Adelme! In fact there is spot-related information in the projection files (column 13). So I was able to get the information but it was more tricky than the dedicated solution you propose.

Unfortunately I have still a problem while drawing spots. It returns an error that seems to be related to the size of some spots ("RecursionError: maximum recursion depth exceeded while getting the str of an object"). Indeed it always freeze to the same spot that contains 2003 identical_rgp. Did you already experienced it?

Guilhem

GuilhemRoyer avatar Oct 07 '22 16:10 GuilhemRoyer

aha, that's a newish addition I was not aware (I probably forgot haha), and the doc is not up-to-date, I will fix this. As I see from the code, what is in which row can change depending on what you ask as output.

I have never met this error in that context, can you tell me which command line you used exactly, and show me the log ? There is probably a bug to fix there.

axbazin avatar Oct 07 '22 18:10 axbazin

Thank you Adelme. Here is the command that I used:

ppanggolin align -c 12 -p pangenome.h5 -o iroN_related_spots --sequences iroN.fasta --draw_related

I can get a few spots, then it crashes with the following error:

2022-10-08 12:20:22 main.py:l214 INFO Command: /pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/bin/ppanggolin align -c 12 -p pangenome.h5 -o iroN_related_spots --sequences iroN.fasta --draw_related 2022-10-08 12:20:22 main.py:l215 INFO PPanGGOLiN version: 1.2.74 2022-10-08 12:20:22 readBinaries.py:l39 INFO Getting the current pangenome's status 2022-10-08 12:20:23 readBinaries.py:l389 INFO Reading pangenome annotations... 100%|██████████| 11004557/11004557 [00:53<00:00, 205145.75gene/s] 100%|████��2022-10-08 12:25:38 readBinaries.py:l403 INFO] Reading pangenome gene families... 100%|██████████| 2302/2302 [04:19<00:00, 8.88organism/s] 100%|██████████| 10746569/10746569 [01:17<00:00, 139480.95gene/s] 2022-10-08 12:26:56 readBinaries.py:l418 INFO Reading the RGP... 100%|██████████| 52908/52908 [00:00<00:00, 53671.87gene family/s] 94%|█████████▍| 2890752/3060047 [00:12<00:00,2022-10-08 12:27:10 readBinaries.py:l425 INFO Reading the spots... 100%|██████████| 3060047/3060047 [00:13<00:00, 226261.49gene/s] 99%|█████████▉| 163003/164541 [06:57<00:05, 2022-10-08 12:34:10 alignOnPang.py:l46 INFO Aligning sequences to cluster representatives... 2022-10-08 12:34:12 alignOnPang.py:l51 INFO Extracting alignments... 2022-10-08 12:34:13 alignOnPang.py:l145 INFO Writing RGP and spot information related to hits in the pangenome 2022-10-08 12:35:44 alignOnPang.py:l165 INFO Drawing the 9 spots with more than 1 organization related to hits of the input sequences... 2022-10-08 12:35:44 draw_spot.py:l448 INFO Ordering genes among regions, and drawing spots... 100%|██████████| 164541/164541 [06:57<00:00, 393.79region/s] 11%|█ | 1/9 [00:02<00:22, 2.84s/spot] Traceback (most recent call last): File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/bin/ppanggolin", line 10, in sys.exit(main()) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/main.py", line 241, in main ppanggolin.align.launch(args) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/align/alignOnPang.py", line 264, in launch align(pangenome=pangenome, sequenceFile=args.sequences, output=args.output, tmpdir=args.tmpdir, cpu=args.cpu, File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/align/alignOnPang.py", line 247, in align getSeqInfo(seq2pang, pangenome, output, draw_related, disable_bar=disable_bar) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/align/alignOnPang.py", line 168, in getSeqInfo drawSelectedSpots(drawn_spots, pangenome, output, pangenome.parameters["spots"]["overlapping_match"], File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/figures/draw_spot.py", line 492, in drawSelectedSpots GeneLists = orderGeneLists(GeneLists, overlapping_match, exact_match, set_size) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/figures/draw_spot.py", line 50, in orderGeneLists return rowOrderGeneLists(geneLists) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/ppanggolin/figures/draw_spot.py", line 72, in rowOrderGeneLists dendro = dendrogram(hc, no_plot=True) File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3349, in dendrogram _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3652, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3652, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3619, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3652, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3652, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3652, in _dendrogram_calculate_info _dendrogram_calculate_info( [Previous line repeated 981 more times] File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3619, in _dendrogram_calculate_info _dendrogram_calculate_info( File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3549, in _dendrogram_calculate_info _append_singleton_leaf_node(Z, p, n, level, lvs, ivl, File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py", line 3427, in _append_singleton_leaf_node ivl.append(str(int(i))) RecursionError: maximum recursion depth exceeded while getting the str of an object

Or is there another way to get the representative of each RGP without drawing spots (i.e. the file "spot_XX_identical_rgps.tsv") ?

Guilhem

GuilhemRoyer avatar Oct 08 '22 10:10 GuilhemRoyer

Hi,

I finally fix the problem by increasing the recursion limit of Python in the script "Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py"

""" import sys print(sys.getrecursionlimit(2500)) """

But, not sure if it is the best way to solve the problem...

Guilhem

GuilhemRoyer avatar Oct 08 '22 14:10 GuilhemRoyer

Hi ! This was indeed the best way to solve that problem, if it indeed got you to the files in the end. Having to draw the spots to get the files is not super practical I admit, it could probably be better as a “standard” output file. Adelme

axbazin avatar Oct 09 '22 13:10 axbazin

Closing as this is supposedly fixed in master and will be in the next release (fixed it exactly the way you did, but in ppanggolin's code)

I don't have the means to test that it actually works though :(

axbazin avatar Feb 15 '23 17:02 axbazin