automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

Scipy sparse matrices not handled correctly by TPOT and autosklearn

Open sebhrusen opened this issue 3 years ago • 7 comments

Failing datasets: https://openml.org/t/360932 https://openml.org/t/360932

  • serialization of sparse matrices was not applied correctly.
  • once fixed, the frameworks still fail with the following errors:
# TPOT
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/TPOT/venv/lib/python3.7/site-packages/tpot/base.py", line 1359, in _check_dataset
    self.config_dict
ValueError: Not all operators in None supports sparse matrix. Please use "TPOT sparse" for sparse matrix.
#autosklearn
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/utils/multiclass.py", line 288, in type_of_target
    if y.ndim > 2 or (y.dtype == object and len(y) and
TypeError: len() of unsized object

We'll improve support for sparse data in a future version: for now, we can simply deserialize the sparse matrices as dense matrices for the frameworks that don't use pandas.

sebhrusen avatar Jul 28 '21 19:07 sebhrusen

Just checking - are these sparse target matrices y? We might indeed not have tests for that.

CC @eddiebergman

mfeurer avatar Jul 29 '21 07:07 mfeurer

@mfeurer in this case both X and y are indeed sparse, not sure this makes sense for y. I currently fixed this by turning both into arrays as I thought the problem was X, but it's very possible that for some frameworks, it's only necessary to do this for y.

sebhrusen avatar Jul 30 '21 15:07 sebhrusen

Thanks for the clarification. Auto-sklearn should support sparse X, but we'll check, and will also check what the behavior for sparse y values is.

mfeurer avatar Jul 30 '21 16:07 mfeurer

@mfeurer for autosklearn, sparse X with dense y seems to work fine (and faster), meaning that in your case, sparse y was the issue. Thanks for noticing this: ideally we'd like to have frameworks using sparse data whenever possible, so I'll probably just make the ys dense by default, and see individually for each framework regarding X. cc: @PGijsbers

sebhrusen avatar Jul 30 '21 16:07 sebhrusen

@sebhrusen It's probably in the interest of autosklearn to handle sparse y correctly in this case, I'll have a look into it

eddiebergman avatar Jul 30 '21 16:07 eddiebergman

@eddiebergman Sure, just mentioning that we have a workaround on our side for now that also seems to work for other frameworks. Thanks for fixing it on your side too.

sebhrusen avatar Jul 30 '21 16:07 sebhrusen

Hi @sebhrusen,

Just letting you know the fix should be in the next release and I tracked down the problem a little more and wrote a brief synopsis, incase it helps identify the problem for other libraries.

eddiebergman avatar Aug 07 '21 13:08 eddiebergman