Non-deterministic behaviour
Hello,
Firstly, thanks for your work with Pyrodigal! I wasn't able to determine if Pyrodigal should be deterministic, if it is then ignore this ticket.
I have an extremely rare case that took quite some time to identify, but Pyrodigal will occasionally give a different result when running via Shell vs. Python subprocess. The strange part is that the likelihood of Pyrodigal giving a different result is higher when running via a Python subprocess though.
To replicate the issue:
Dockerfile
FROM python:3.10-slim
RUN apt-get update && apt-get install -y \
curl \
unzip \
&& rm -rf /var/lib/apt/lists/*
RUN python -m pip install pyrodigal==2.0.4
RUN mkdir -p /data /results /tmp/download
WORKDIR /tmp/download
RUN curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_009700405.1/download?include_annotation_type=GENOME_FASTA&filename=GCA_009700405.1.zip" -H "Accept: application/zip" && \
unzip GCA_009700405.1.zip && \
rm GCA_009700405.1.zip && \
mv ncbi_dataset/data/GCA_009700405.1/GCA_009700405.1_ASM970040v1_genomic.fna /data/genome.fna && \
rm -rf /tmp/download
WORKDIR /data
Entering container
docker build -t pyrodigal_test . && docker run -it pyrodigal_test /bin/bash
Running Pyrodigal
#!/bin/bash
for i in {1..100}
do
python -c "import os; os.system('pyrodigal -m -i /data/genome.fna -g 11 -o /dev/null -a /results/python_$i.faa -d /dev/null -p single')";
pyrodigal -m -i /data/genome.fna -g 11 -o /dev/null -a /results/shell_$i.faa -d /dev/null -p single;
done
Results
Over 200 trials I get the following results:
| Hash | Command Line (count) | Python os.system (count) |
|---|---|---|
| a9f114 | 192 | 178 |
| 597610 | 8 | 22 |
The differences between the two hashes are:
7082,7090c7082,7090
< >WLMD01000046.1_10 # 9095 # 10405 # -1 # ID=101_10;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.487
< MSADDQLRKQQEFVLRTIEERNIRFVRLWFTDVLGFLKSVAIAPAELANAFDEGIGFDGS
< AIEGFARITESDMLAKPDPSTFSVLPWRTEAPGAARMFCDIVMPDGSASHADPRHVLRRI
< LNKAATMGYTCYTHPEIEFFLFKDRPEIGKRPTPVDQGGYFDHTPAVVGHDFRRTAITML
< EAMGISVEFSHHEGAPGQQEIDLRYADALTTADNIMTFRHVVKEVALDQGFHASFIPKPF
< TDHPGSGMHTHVSLFQGEKNAFYDAKAEYNLSKVGRSFIAGLLRHAPEITAVTNQWVNSY
< KRLHGGGEAPALVNWGHNNRGALVRVPMYKPNNENSTRVEFRSPDSACNPYLAYAVMIAA
< GLKGVEEGYELADSSDATVLPSNLNEAIIAMEKSALVRETLGEHVFEYVLRNKRAEWNDY
< SRQVTAYELDRYLPIL*
---
> >WLMD01000046.1_10 # 9095 # 10414 # -1 # ID=101_10;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.487
> YVPMSADDQLRKQQEFVLRTIEERNIRFVRLWFTDVLGFLKSVAIAPAELANAFDEGIGF
> DGSAIEGFARITESDMLAKPDPSTFSVLPWRTEAPGAARMFCDIVMPDGSASHADPRHVL
> RRILNKAATMGYTCYTHPEIEFFLFKDRPEIGKRPTPVDQGGYFDHTPAVVGHDFRRTAI
> TMLEAMGISVEFSHHEGAPGQQEIDLRYADALTTADNIMTFRHVVKEVALDQGFHASFIP
> KPFTDHPGSGMHTHVSLFQGEKNAFYDAKAEYNLSKVGRSFIAGLLRHAPEITAVTNQWV
> NSYKRLHGGGEAPALVNWGHNNRGALVRVPMYKPNNENSTRVEFRSPDSACNPYLAYAVM
> IAAGLKGVEEGYELADSSDATVLPSNLNEAIIAMEKSALVRETLGEHVFEYVLRNKRAEW
> NDYSRQVTAYELDRYLPIL*
i.e. the 597610 hash starts with Y instead of M.
Hi @aaronmussig,
Normally Pyrodigal should indeed be deterministic. Non-determinism is often caused when some parts of the program are reading undefined memory, often out of bounds.
This may be linked to https://github.com/hyattpd/Prodigal/pull/100 as I'm seeing the error is happening next to the edge on the reverse strand. I'll check if the fixed version still causes the issue.
Hi @althonos,
Re-ran this using v3.5.2 and both execution methods are returning the same result. Thanks for looking into it!
Good to hear, thanks!