biom-format
biom-format copied to clipboard
Required attribute table type not checked consistently
Reproducible example
Example code based on https://biom-format.org/documentation/table_objects.html#examples
import numpy as np
from biom.table import Table
data = np.arange(40).reshape(10, 4)
sample_ids = ["S%d" % i for i in range(4)]
observ_ids = ["O%d" % i for i in range(10)]
table = Table(
data,
observ_ids,
sample_ids,
# observ_metadata,
# sample_metadata,
# table_id='Example Table'
)
from biom.util import biom_open
with biom_open("example-hdf5.biom", "w") as handle:
table.to_hdf5(handle, generated_by="BIOM Pycode", compress=True)
with open("example-json.biom", "w") as handle:
handle.write(table.to_json(generated_by="BIOM Pycode"))
Follow this by command line validation of the output:
$ biom validate-table -i example-hdf5.biom
$ biom validate-table -i example-json.biom
Actual behaviour:
Python script runs without error (bad).
HDF5 file passess validation (bad):
$ biom validate-table -i example-hdf5.biom
The input file is a valid BIOM-formatted file.
JSON file fails validation (good):
$ biom validate-table -i example-json.biom
Unknown table type, however that is likely okay.
The input file is not a valid BIOM-formatted file.
Expected behaviour:
Runtime error during Table.__init__
since defaults include type=None
and validate=True
by default, and for all BIOM formats to date, type is a required top level attribute.
https://biom-format.org/documentation/format_versions/biom-1.0.html https://biom-format.org/documentation/format_versions/biom-2.0.html https://biom-format.org/documentation/format_versions/biom-2.1.html
Furthermore, using the example BIOM files as generated without the table type, both the JSON and the HDF5 ought to fail consistently.
Hm, this is going to be a delicate one. If type=None
raises, it at a minimum triggers a minor release as this is breaking behavior for the current API and it will create headaches for many users of the library. I'm inclined to, if None
, to default to "OTU table" and raise a warning with a deprecation notice that this behavior will be unsupported in the future.
That is correct the validation should be consistent -- good catch.
I agree that's a good plan - add a deprecation warning in a revision release of the tool, and upgrade to an exception in a minor release.
At least fixing the HDF5 validation to be stricter shouldn't be so complicated 😃
I found this bug because (following the examples), I'd not set the table type myself. In my case "OUT table" is the best match from the options defined.
Ah yes the classic "OUT table", which coincidentally is a common interpretation by Word :)
In some offline discussion with @wasade, he suggested that maybe we could grandfather type=None
in as allowable by the format. I'm in favor of that as it hasn't caused any problems to-date (that I'm aware of). Changing the behavior would cause other tools to need updating (e.g., QIIME 2, which doesn't set this value - see below), and would make old biom-formatted tables not work with new versions of the software.
In [1]: import qiime2
In [2]: import biom
In [3]: t = qiime2.Artifact.load('table.qza').view(biom.Table)
In [5]: print(t.type)
None
Thank you, @gregcaporaso!
Also using biom.Table.from_tsv(...)
does not have a type=
argument to set the table type.