GenEra icon indicating copy to clipboard operation
GenEra copied to clipboard

split results

Open Proginski opened this issue 1 year ago • 7 comments

Hi,

Is your feature request related to a problem? Please describe. As the last release of the human genome, with its ~145k CDS produces a 630Go results, and as the help of v1.4.0 says that one needs around 200Go RAm for 180Go of results, it seems one needs ~700Go of RAM to complete the analysis with the -F option.

Describe the solution you'd like Once step 1 (+/- 2) is completed, is it possible to manually split the input fasta and Diamond results to better each chunk's performance? (I'm not saying it will not require a lot of RAM also ;) )

Describe alternatives you've considered I just tried something like

faSplit sequence cds_from_genomic.faa 10 cds_from_genomic
grep ">" cds_from_genomic04.fa | sed -E "s/>(.*)/^\1\t/" > cds_from_genomic04.txt
grep -f cds_from_genomic04.txt tmp_9606_18134/9606_Diamond_results.bout > cds_from_genomic04.bout # This step of course is "expensive"
genEra \
-t 9606 \
-q  cds_from_genomic04.fa\
-n 40 \
-p cds_from_genomic04.bout \
-c 9606_ncbi_lineages.csv \
-r ncbi_lineages_2023-07-12.csv \

The chunk has 87 CDS and of course, it went turbo-fast. The ages assigned to the CDS were the same as when the entire original fasta was used. So is it possible to do so, and could it be of any interest?

Paul

Proginski avatar Sep 26 '23 15:09 Proginski