PPanGGOLiN
PPanGGOLiN copied to clipboard
Searching for spots using the gene as a query
Hi Guillaume, Adelme and Jerome!
I am trying to find all the spots which contain a specific gene. Is there a way to do it before drawing the resulting spots?
Best,
Guilhem
I have found a way to do this but I am not sure I get all the spots: I search for my gene family in all the projection files and then get the column which, if I understand correctly, is the list of spots where the gene can be found.
Am I right to do it this way?
Hi !
I'm pretty sure there is no spot-related information in the projection files. However, there is a useful command line that you can use that will tell you everything that is related to a gene:
ppanggolin align -p pangenome.h5 -o MYOUTPUTDIR --sequences MY_SEQUENCES_OF_INTEREST.fasta --getinfo
This will give you the exact list of spots, and RGPs, where the genes are found, in a tsv file called 'info_input_seq.tsv'. It should write one line for each gene in your .fasta file.
The format is described in the last table of the Align wiki page here
You can use any fasta sequence for this, and it will get assigned to its closest member gene family within the pangenome, then extract all info related to this gene family, including the list of spots (and RGPs) in which is it found !
Hopefully that helps !
Adelme
Thank you Adelme! In fact there is spot-related information in the projection files (column 13). So I was able to get the information but it was more tricky than the dedicated solution you propose.
Unfortunately I have still a problem while drawing spots. It returns an error that seems to be related to the size of some spots ("RecursionError: maximum recursion depth exceeded while getting the str of an object"). Indeed it always freeze to the same spot that contains 2003 identical_rgp. Did you already experienced it?
Guilhem
aha, that's a newish addition I was not aware (I probably forgot haha), and the doc is not up-to-date, I will fix this. As I see from the code, what is in which row can change depending on what you ask as output.
I have never met this error in that context, can you tell me which command line you used exactly, and show me the log ? There is probably a bug to fix there.
Thank you Adelme. Here is the command that I used:
ppanggolin align -c 12 -p pangenome.h5 -o iroN_related_spots --sequences iroN.fasta --draw_related
I can get a few spots, then it crashes with the following error:
2022-10-08 12:20:22 main.py:l214 INFO Command: /pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/bin/ppanggolin align -c 12 -p pangenome.h5 -o iroN_related_spots --sequences iroN.fasta --draw_related
2022-10-08 12:20:22 main.py:l215 INFO PPanGGOLiN version: 1.2.74
2022-10-08 12:20:22 readBinaries.py:l39 INFO Getting the current pangenome's status
2022-10-08 12:20:23 readBinaries.py:l389 INFO Reading pangenome annotations...
100%|██████████| 11004557/11004557 [00:53<00:00, 205145.75gene/s]
100%|████��2022-10-08 12:25:38 readBinaries.py:l403 INFO] Reading pangenome gene families...
100%|██████████| 2302/2302 [04:19<00:00, 8.88organism/s]
100%|██████████| 10746569/10746569 [01:17<00:00, 139480.95gene/s]
2022-10-08 12:26:56 readBinaries.py:l418 INFO Reading the RGP...
100%|██████████| 52908/52908 [00:00<00:00, 53671.87gene family/s]
94%|█████████▍| 2890752/3060047 [00:12<00:00,2022-10-08 12:27:10 readBinaries.py:l425 INFO Reading the spots...
100%|██████████| 3060047/3060047 [00:13<00:00, 226261.49gene/s]
99%|█████████▉| 163003/164541 [06:57<00:05, 2022-10-08 12:34:10 alignOnPang.py:l46 INFO Aligning sequences to cluster representatives...
2022-10-08 12:34:12 alignOnPang.py:l51 INFO Extracting alignments...
2022-10-08 12:34:13 alignOnPang.py:l145 INFO Writing RGP and spot information related to hits in the pangenome
2022-10-08 12:35:44 alignOnPang.py:l165 INFO Drawing the 9 spots with more than 1 organization related to hits of the input sequences...
2022-10-08 12:35:44 draw_spot.py:l448 INFO Ordering genes among regions, and drawing spots...
100%|██████████| 164541/164541 [06:57<00:00, 393.79region/s]
11%|█ | 1/9 [00:02<00:22, 2.84s/spot]
Traceback (most recent call last):
File "/pasteur/appa/homes/groyer/anaconda3/envs/Ppanggolin/bin/ppanggolin", line 10, in
Or is there another way to get the representative of each RGP without drawing spots (i.e. the file "spot_XX_identical_rgps.tsv") ?
Guilhem
Hi,
I finally fix the problem by increasing the recursion limit of Python in the script "Ppanggolin/lib/python3.8/site-packages/scipy/cluster/hierarchy.py"
""" import sys print(sys.getrecursionlimit(2500)) """
But, not sure if it is the best way to solve the problem...
Guilhem
Hi ! This was indeed the best way to solve that problem, if it indeed got you to the files in the end. Having to draw the spots to get the files is not super practical I admit, it could probably be better as a “standard” output file. Adelme
Closing as this is supposedly fixed in master and will be in the next release (fixed it exactly the way you did, but in ppanggolin's code)
I don't have the means to test that it actually works though :(