PIRATE icon indicating copy to clipboard operation
PIRATE copied to clipboard

PIRATE.pangenome_summary.txt

Open haruosuz opened this issue 11 months ago • 9 comments

I ran PIRATE with 322 genomes (gff files) as input, but the <PIRATE.pangenome_summary.txt> file indicates 321 genomes. Is there any way to investigate why the count decreased from 322 to 321? The confirmation details are as follows:

cat PIRATE.log | grep "32. "

 - 322 files in input directory.
 - 322 gff files passed QC and will be analysed by PIRATE - completed in: 2s
 - Loci file contains 21143 loci from 322 genomes.
322 genomes processed.
 - 322 samples found in file headers
 - 1415 genes of 1415 total genes were present in 322 isolates
 - 1415 clusters from 322 genomes (dosage threshold <= 1.1) used for graphing.
# 1415 gene families in 321 genomes.

haruosuz avatar Mar 02 '24 01:03 haruosuz

Hi @haruosuz, that is a little odd. Typically poorly formatted GFF files are removed at the initial stage. Were your samples annotated with prokka? I would suggest you check the GFF file for removed sample to ensure it is formatted correctly and contains CDS/genes (i.e. is not empty). If it looks normal feel free to email me the and I will check to see if there is anything odd going on (perhaps include a handful of the files that worked as well as contrasts).

SionBayliss avatar Mar 07 '24 10:03 SionBayliss

Thank you for your reply. The 322 genomes were annotated with DFAST. Among the 322 GFF files, there isn't any empty file. In the <PIRATE.gene_families.ordered.tsv> file, there are 1415 rows and 344 columns (i.e., 344 - 22 = 322 genomes). Is there any way to identify which of the 322 GFF files was excluded from the <PIRATE.pangenome_summary.txt> file? This is suggested by # 1415 gene families in 321 genomes. in the <PIRATE.pangenome_summary.txt> file.

haruosuz avatar Mar 07 '24 11:03 haruosuz

You can check the headers in the PIRATE.gene_families.tsv file and compare them to your input sample list.

SionBayliss avatar Mar 08 '24 11:03 SionBayliss

Thank you for your reply.

The following command did not produce any output, indicating that there is no difference between the genomes listed in the headers in the PIRATE.gene_families.tsv file and input sample list provided in the "genome_list.txt" file:

diff <(head -n 1 PIRATE.gene_families.tsv | tr "\t" "\n" | tail +21) <(cat genome_list.txt | sort)

The discrepancy in the numbers (322 vs. 321 genomes) remains unclear. Here are the commands and their outputs provided:

$ wc -l genome_list.txt
     322 genome_list.txt

$ head -n 1 PIRATE.pangenome_summary.txt
# 1415 gene families in 321 genomes.

haruosuz avatar Mar 20 '24 06:03 haruosuz

So it found all your input genome files but is saying there is an additional one at one internal step? Are you sure you don't have a line including just whitespace in the genome_list.txt file?

SionBayliss avatar Mar 20 '24 10:03 SionBayliss

I ran PIRATE with 322 genomes (gff files) as input. While the <PIRATE.gene_families.tsv> file contains 322 genomes, the <PIRATE.pangenome_summary.txt> file indicates only 321 genomes.

The <genome_list.txt> file, generated by PIRATE, contains no whitespace, as shown below:

$ cat genome_list.txt | wc -l
     322
$ cat genome_list.txt | grep -v "^$" | wc -l
     322

haruosuz avatar Mar 20 '24 11:03 haruosuz