biom-format icon indicating copy to clipboard operation
biom-format copied to clipboard

Required attribute table type not checked consistently

Open peterjc opened this issue 1 year ago • 6 comments

Reproducible example

Example code based on https://biom-format.org/documentation/table_objects.html#examples

import numpy as np
from biom.table import Table

data = np.arange(40).reshape(10, 4)
sample_ids = ["S%d" % i for i in range(4)]
observ_ids = ["O%d" % i for i in range(10)]
table = Table(
    data,
    observ_ids,
    sample_ids,
    # observ_metadata,
    # sample_metadata,
    # table_id='Example Table'
)

from biom.util import biom_open

with biom_open("example-hdf5.biom", "w") as handle:
    table.to_hdf5(handle, generated_by="BIOM Pycode", compress=True)

with open("example-json.biom", "w") as handle:
    handle.write(table.to_json(generated_by="BIOM Pycode"))

Follow this by command line validation of the output:

$ biom validate-table -i example-hdf5.biom 
$ biom validate-table -i example-json.biom 

Actual behaviour:

Python script runs without error (bad).

HDF5 file passess validation (bad):

$ biom validate-table -i example-hdf5.biom 

The input file is a valid BIOM-formatted file.

JSON file fails validation (good):

$ biom validate-table -i example-json.biom 
Unknown table type, however that is likely okay.
The input file is not a valid BIOM-formatted file.

Expected behaviour:

Runtime error during Table.__init__ since defaults include type=None and validate=True by default, and for all BIOM formats to date, type is a required top level attribute.

https://biom-format.org/documentation/format_versions/biom-1.0.html https://biom-format.org/documentation/format_versions/biom-2.0.html https://biom-format.org/documentation/format_versions/biom-2.1.html

Furthermore, using the example BIOM files as generated without the table type, both the JSON and the HDF5 ought to fail consistently.

peterjc avatar Mar 09 '23 16:03 peterjc

Hm, this is going to be a delicate one. If type=None raises, it at a minimum triggers a minor release as this is breaking behavior for the current API and it will create headaches for many users of the library. I'm inclined to, if None, to default to "OTU table" and raise a warning with a deprecation notice that this behavior will be unsupported in the future.

That is correct the validation should be consistent -- good catch.

wasade avatar Mar 09 '23 17:03 wasade

I agree that's a good plan - add a deprecation warning in a revision release of the tool, and upgrade to an exception in a minor release.

At least fixing the HDF5 validation to be stricter shouldn't be so complicated 😃

I found this bug because (following the examples), I'd not set the table type myself. In my case "OUT table" is the best match from the options defined.

peterjc avatar Mar 09 '23 18:03 peterjc

Ah yes the classic "OUT table", which coincidentally is a common interpretation by Word :)

wasade avatar Mar 09 '23 18:03 wasade

In some offline discussion with @wasade, he suggested that maybe we could grandfather type=None in as allowable by the format. I'm in favor of that as it hasn't caused any problems to-date (that I'm aware of). Changing the behavior would cause other tools to need updating (e.g., QIIME 2, which doesn't set this value - see below), and would make old biom-formatted tables not work with new versions of the software.

In [1]: import qiime2
In [2]: import biom
In [3]: t = qiime2.Artifact.load('table.qza').view(biom.Table)
In [5]: print(t.type)
None

gregcaporaso avatar Apr 13 '23 22:04 gregcaporaso

Thank you, @gregcaporaso!

wasade avatar Apr 13 '23 22:04 wasade

Also using biom.Table.from_tsv(...) does not have a type= argument to set the table type.

peterjc avatar Mar 08 '24 23:03 peterjc