scirpy icon indicating copy to clipboard operation
scirpy copied to clipboard

airr validation fails when converting demo dataset to_airr_cells

Open grst opened this issue 2 years ago • 1 comments

Description of the bug

Can't convert demo dataset to_airr_cells because of AIRR validation error.

Minimal reproducible example

import scirpy as ir
adata = ir.datasets.wu2020_3k()
ir.io.to_airr_cells(adata)

The error message produced by the code above

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/airr/schema.py:193, in Schema.to_int(self, value, validate)
    192 try:
--> 193     return int(value)
    194 except ValueError:

ValueError: invalid literal for int() with base 10: '2.0'

During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/airr/schema.py:275, in Schema.validate_row(self, row)
    274 if spec == 'boolean':  self.to_bool(row[f], validate=True)
--> 275 if spec == 'integer':  self.to_int(row[f], validate=True)
    276 if spec == 'number':  self.to_float(row[f], validate=True)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/airr/schema.py:196, in Schema.to_int(self, value, validate)
    195 if validate:
--> 196     raise ValidationError('invalid int %s'% value)
    197 else:

ValidationError: invalid int 2.0

During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
Input In [76], in <cell line: 1>()
----> 1 airr_cells = ir.io.to_airr_cells(adata)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/scirpy/io/_util.py:67, in _check_upgrade_schema.<locals>.check_upgrade_schema_decorator.<locals>.check_wrapper(*args, **kwargs)
     65 for i in check_args:
     66     _check_anndata_upgrade_schema(args[i])
---> 67 return f(*args, **kwargs)

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/scirpy/io/_convert_anndata.py:133, in to_airr_cells(adata)
    130 for tmp_chain in chains.values():
    131     # Don't add empty chains!
    132     if not all([_is_na2(x) for x in tmp_chain.values()]):
--> 133         tmp_ir_cell.add_chain(tmp_chain)
    135 try:
    136     tmp_ir_cell.add_serialized_chains(row["extra_chains"])

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/scirpy/io/_datastructures.py:134, in AirrCell.add_chain(self, chain)
    131 # TODO this should be `.validate_obj` but currently does not work
    132 # because of https://github.com/airr-community/airr-standards/issues/508
    133 RearrangementSchema.validate_header(chain.keys())
--> 134 RearrangementSchema.validate_row(chain)
    136 for tmp_field in self._cell_attribute_fields:
    137     # It is ok if a field specified as cell attribute is not present in the chain
    138     try:

File ~/anaconda3/envs/test_awkward/lib/python3.9/site-packages/airr/schema.py:278, in Schema.validate_row(self, row)
    276         if spec == 'number':  self.to_float(row[f], validate=True)
    277     except ValidationError as e:
--> 278         raise ValidationError('field %s has %s' %(f, e))
    280 return True

ValidationError: field duplicate_count has invalid int 2.0

Version information

-----
anndata     0.8.0rc2.dev27+ge524389
scanpy      1.9.1
-----
Levenshtein                 NA
PIL                         9.1.1
adjustText                  NA
airr                        1.3.1
asttokens                   NA
awkward                     1.8.0
backcall                    0.2.0
beta_ufunc                  NA
binom_ufunc                 NA
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.6.0
decorator                   5.1.1
entrypoints                 0.4
executing                   0.8.3
h5py                        3.7.0
hypergeom_ufunc             NA
igraph                      0.9.11
ipykernel                   6.15.0
jedi                        0.18.1
joblib                      1.1.0
kiwisolver                  1.4.3
llvmlite                    0.38.1
matplotlib                  3.5.2
mpl_toolkits                NA
natsort                     8.1.0
nbinom_ufunc                NA
networkx                    2.8.4
numba                       0.55.2
numpy                       1.22.4
packaging                   21.3
pandas                      1.4.2
parasail                    1.2.4
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.29
psutil                      5.9.1
ptyprocess                  0.7.0
pure_eval                   0.2.2
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.8.0
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.12.0
pyparsing                   3.0.9
pytoml                      NA
pytz                        2022.1
scipy                       1.8.1
scirpy                      0.10.1
seaborn                     0.11.2
session_info                1.0.0
setuptools                  62.5.0
setuptools_scm              NA
six                         1.16.0
sklearn                     1.1.1
stack_data                  0.3.0
statsmodels                 0.13.2
tabulate                    0.8.9
texttable                   1.6.4
threadpoolctl               3.1.0
tornado                     6.1
tqdm                        4.64.0
tracerlib                   NA
traitlets                   5.3.0
wcwidth                     0.2.5
yaml                        6.0
yamlordereddictloader       NA
zmq                         23.1.0
-----
IPython             8.4.0
jupyter_client      7.3.4
jupyter_core        4.10.0
-----
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
Linux-5.18.2-arch1-1-x86_64-with-glibc2.35
-----
Session information updated at 2022-06-20 19:01

grst avatar Jun 20 '22 17:06 grst

The problem is that IR_VDJ_2_duplicate_count is of type str as it contains "None".

grst avatar Jun 20 '22 17:06 grst

solved with the new data structure and the new example dataset

grst avatar Apr 13 '23 06:04 grst