lambda icon indicating copy to clipboard operation
lambda copied to clipboard

acc-tax-map flag in lambda3 misidentifies extension

Open bfowle opened this issue 5 years ago • 6 comments

Using lambda3, passing the -m/-acc-tax-map flag a file (e.g. nucl.accession2taxid.gz) returns with an error:

Validation failed for option -m/--acc-tax-map: Expected one of the following valid extensions: [accession2taxid,dat,accession2taxid.gz,accession2taxid.bgzf,dat.gz,dat.bgzf,accession2taxid.bz2,dat.bz2]! Got gz instead

This works properly in lambda2.

bfowle avatar Feb 19 '20 13:02 bfowle

Further, passing the --tax-dump-dir flag with the unzipped taxdmp from NCBI returns the following error:

...
Parsing acc-to-tax-map file... done.
Parsing nodes.dmp... done.
Thinning and flattening Tree... done.
Parsing names.dmp...

ERROR: The following unspecified exception was thrown:
       "Error: Expected taxonomical ID in first column, but couldn't find it."

This also works properly in lambda2.

$ head taxdump/names.dmp
1       |       all     |               |       synonym |
1       |       root    |               |       scientific name |
2       |       Bacteria        |       Bacteria <bacteria>     |       scientific name |
...

bfowle avatar Feb 19 '20 14:02 bfowle

Thanks for alerting us to this problem! lambda3 is still very much in development, but of course we plan to have these features working again!

h-2 avatar Feb 19 '20 14:02 h-2

FYI-- I had some time to look into these issues myself.

1.) the acc-to-tax-map issue: [seqan3] argument_parser/validators.hpp:374 uses: path.extension() from std::filesystem::path which in the case of checking extensions with part of the stem separated by a dot: (e.g. .accession2taxid.gz) would only (properly) return ".gz". There is a function also provided in the path type: stem() which returns: "foo.accession2taxid" that one can substr and append into this check:

// Drop the dot.
std::string tmp_str = path.extension().string();
// @TODO: make this a `view`
if (path.has_stem()) {
    if (auto lpos = path.stem().string().find_last_of("."); lpos != std::string::npos) {
        tmp_str = path.stem().string().substr(lpos + 1) + tmp_str;
    }
}
...

In which case: "foo.accession2taxid.gz" would return ".accession2taxid.gz" as the drop_less_ext for cmp_lambda check.

2.) the tax-dump-dir issue: [lambda] mkindex_algo.hpp:571:

while (std::ranges::begin(file_view2) != std::ranges::end(file_view2))
{
    // read line
    buf = file_view | seqan3::views::take_line | seqan3::views::to<std::string>;
...

Noticed that buf was empty while iterating over the file; however file_view is not being iterated, file_view2 is. This was a quickfix :)

I'm able to fully build an index with taxonomy information with lambda3.

Cheers, ~B

bfowle avatar Feb 26 '20 15:02 bfowle

Can this be closed or are there any more open problems concerning the taxonomy files?

sarahet avatar Mar 16 '20 14:03 sarahet

The acc-to-tax-map issue still exists (in the seqan3 repo: https://github.com/seqan/seqan3/blob/master/include/seqan3/argument_parser/validators.hpp#L382-L384). I had not created a PR out of it yet since it seemed a bit more of a "hack" to get it to work for my testing.

bfowle avatar Mar 18 '20 12:03 bfowle

This is something we need to investigate before the release of lambda3

h-2 avatar Aug 15 '22 13:08 h-2

This should be fixed now. Please re-open if you can reproduce on the current lambda3-branch.

h-2 avatar Jul 13 '23 15:07 h-2