acc-tax-map flag in lambda3 misidentifies extension
Using lambda3, passing the -m/-acc-tax-map flag a file (e.g. nucl.accession2taxid.gz) returns with an error:
Validation failed for option -m/--acc-tax-map: Expected one of the following valid extensions: [accession2taxid,dat,accession2taxid.gz,accession2taxid.bgzf,dat.gz,dat.bgzf,accession2taxid.bz2,dat.bz2]! Got gz instead
This works properly in lambda2.
Further, passing the --tax-dump-dir flag with the unzipped taxdmp from NCBI returns the following error:
...
Parsing acc-to-tax-map file... done.
Parsing nodes.dmp... done.
Thinning and flattening Tree... done.
Parsing names.dmp...
ERROR: The following unspecified exception was thrown:
"Error: Expected taxonomical ID in first column, but couldn't find it."
This also works properly in lambda2.
$ head taxdump/names.dmp
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
...
Thanks for alerting us to this problem! lambda3 is still very much in development, but of course we plan to have these features working again!
FYI-- I had some time to look into these issues myself.
1.) the acc-to-tax-map issue:
[seqan3] argument_parser/validators.hpp:374 uses: path.extension() from std::filesystem::path which in the case of checking extensions with part of the stem separated by a dot: (e.g. .accession2taxid.gz) would only (properly) return ".gz".
There is a function also provided in the path type: stem() which returns: "foo.accession2taxid" that one can substr and append into this check:
// Drop the dot.
std::string tmp_str = path.extension().string();
// @TODO: make this a `view`
if (path.has_stem()) {
if (auto lpos = path.stem().string().find_last_of("."); lpos != std::string::npos) {
tmp_str = path.stem().string().substr(lpos + 1) + tmp_str;
}
}
...
In which case: "foo.accession2taxid.gz" would return ".accession2taxid.gz" as the drop_less_ext for cmp_lambda check.
2.) the tax-dump-dir issue:
[lambda] mkindex_algo.hpp:571:
while (std::ranges::begin(file_view2) != std::ranges::end(file_view2))
{
// read line
buf = file_view | seqan3::views::take_line | seqan3::views::to<std::string>;
...
Noticed that buf was empty while iterating over the file; however file_view is not being iterated, file_view2 is. This was a quickfix :)
I'm able to fully build an index with taxonomy information with lambda3.
Cheers, ~B
Can this be closed or are there any more open problems concerning the taxonomy files?
The acc-to-tax-map issue still exists (in the seqan3 repo: https://github.com/seqan/seqan3/blob/master/include/seqan3/argument_parser/validators.hpp#L382-L384). I had not created a PR out of it yet since it seemed a bit more of a "hack" to get it to work for my testing.
This is something we need to investigate before the release of lambda3
This should be fixed now. Please re-open if you can reproduce on the current lambda3-branch.