`datasets summary genome taxon` should have exitcode of `0` when there are no genomes for a valid taxon
When looking up genomes for a taxon that exists but has no genomes on NCBI, the following error is given:
root@b9976d6ef296:/data# datasets summary genome taxon 97225
Error: The taxonomy ID '97225' is valid for 'Gomphonemataceae', but no genome data is currently available for this taxon.
Use datasets summary genome taxon <command> --help for detailed help about a command.
root@b9976d6ef296:/data# echo $?
1
The text of this is helpful, but it returns a "failure" exit code of 1. It seems to me that this is not actually a failed API call. It understood the request and returned what was asked, which in this case is nothing. I Think the exit code should be 0 in this case. This is important in automated pipelines because the exit code is used to check if a process failed or not. In the case of our Nextflow pipeline, I had to do this bash gymnastics in order to bypass this behavior:
# NOTE: This command errors when a taxon is found but has no data rather than just outputing an empty file,
# so the below code forces it to not fail and then fails if any other error occur
datasets summary genome taxon ${args} ${taxon.toLowerCase()} 1> ${output_path} 2> >(tee error.txt >&2) || true
if [ -s error.txt ] && ! grep -q 'no genome data is currently available for this taxon.' error.txt; then
exit 1
fi
Thanks for the suggestion and for providing the example! We agree that this could be improved. We'll let you know when we make an update.
Thanks for helping improve the tool! Nuala
Wow, what a timely issue to see right when I run into the same thing! Edit: After updating from v17.1.0 to v18.1.0, I get the exact behavior I would want! See below.
~~+1 to the suggestion, and to go even further:~~
~~I wanted to get genomes for a list of roughly 1700 of taxonomy IDs, so to explore how many genomes existed for each of these many taxIDs, I ran~~
datasets summary genome taxon --inputfile taxids.txt > genome_info.json
~~But was given similar output to @zachary-foster:~~
Error: The taxonomy ID '1001349' is valid for 'Streptomyces sp. Acta 2897', but no genome data is currently available for this taxon.
Use datasets summary genome taxon <command> --help for detailed help about a command.
~~So, while simply returning an exit code of 0 partially helps my case, IMO the desired behavior would be:~~
- ~~Send the error text to
stderr~~ - ~~Simply omit any results for that taxon in the output, or perhaps an empty JSON object?~~
- ~~Either way, continue processing the rest of the taxIDs from the file, so I can get whatever genomes do exist for all of my many taxIDs.~~
- ~~(Minor issue) IMO there's also no need to print the help text either, but that's not a huge deal as long as it goes to
stderranyways~~
Thanks for all your work, and continuing to improve this tool!
EDIT:
For anyone coming across this thread in the future -- I was on ncbi-datasets-cli version 17.1.0, and after updating to version 18.1.0, I am able to get the expected output:
$ datasets summary genome taxon --inputfile ~/tmp/taxids_test.txt > genome_info_test.json
The taxonomy ID '1001349' is valid for 'Streptomyces sp. Acta 2897', but no genome data is currently available for this taxon.
Confirm the exit code
$ echo $?
0
And all of the following are true:
☑ The error text is sent to stderr
☑ Any TaxIDs lacking data are simply omitted from the results/output
☑ The rest of the TaxIDs are processed as normal
☑ The usage help text is not printed
Hi @dtdoering,
Thank you for your suggestions. I took a look at the items in your bulleted list and I believe the CLI already behaves this way, although as you point out, we are printing an error message as well as help text. Please let me know if I'm misunderstanding something or better yet, please share an example so we can better understand where we are going wrong.
I'll respond to each bullet point:
- Send the error text to
stderr
The error text is currently sent to stderr.
- Simply omit any results for that taxon in the output, or perhaps an empty JSON object?
We do currently omit results for taxa without genome data.
- Either way, continue processing the rest of the taxIDs from the file, so I can get whatever genomes do exist for all of my many taxIDs.
We recently (in March) updated the CLI to work this way in response to another GitHub issue https://github.com/ncbi/datasets/issues/450#issuecomment-2751604779
- (Minor issue) IMO there's also no need to print the help text either, but that's not a huge deal as long as it goes to stderr anyways
Yes, I agree that we don't need to print the help text in this case, but the help text is going to stderr.
Thanks again for your feedback.
Best, Eric
Hi @ericcox1,
Thank you for the response -- I usually take care to ensure I'm on the latest version before posting anything, but apparently not in this case 😅 I'll edit my post to avoid misleading anyone coming across this post in the future, and to not derail this discussion any further.
That said, to re-focus back on @zachary-foster's original issue, I am able to reproduce the behavior and look forward to any fixes! Aside from the posted bash redirection workaround, the only other workaround option I can think of would be to collect TaxIDs in a file and make use of the --inputfile flag as in my example, though I could see that being difficult to switch to, depending on the workflow.
Just wanted to post an update since it has been a little while since this issue was opened. We're currently discussing how to approach this internally. Changing the return value in this case probably isn't the best answer, but the underlying problem of interpretability is absolutely a problem we need to address to support pipeline usage. I'll try to update with a proposed solution once we've landed on it for additional feedback before implementing it.