skops Support more array-like data types for tabular data and list-like data types for text data

This pull requests is an attempt towards addressing #65. Note that this is my first ever public contribution, so don't be too hard on me 😉

For tabular-classification and tabular-regression tasks, we're currently supporting the following data types:

pandas.DataFrame
numpy.ndarray

Should this PR get merged, we will support any other data type that can be converted to a 2D numpy.ndarray using np.asarray.

For text-classification and text-regression tasks, we're currently supporting the following data types:

List of strings

Should this PR get merged, we will support any iterable of strings.

xref: https://github.com/skops-dev/skops/pull/45/files#r933221139

Oct 09 '22 16:10 anferico

Thanks a lot @anferico for taking this.

I have only taken a quick look so far and noticed that you haven't added any unit tests for the changes. Could you please do that? Let us know if you need any help with that.

Also, please add an entry to the changes doc.

Oct 10 '22 08:10 BenjaminBossan

I haven't added any unit test for the 2 small helper functions I added, though I've modified existing ones in accordance to the changes I made. Not sure what's your policy here, but if needed I can add unit tests for those as well. I also added some docstrings and modified existing ones.

Everything should be fine since all unit tests were passing on my local machine, but let's wait until the pytest workflow runs successfully.

Oct 11 '22 19:10 anferico

@BenjaminBossan I've tried addressing your comments, see if it's good now. One minor detail: I renamed the function _is_iterable_of_strings to _is_sequence_of_strings because now I'm not expecting generic iterables anymore as inputs to that function, which led me to remove the logic that prevents iterators and generators from being exhausted when iterated over. Without that logic, it just felt wrong to me calling it _is_iterable_of_strings, so I went for _is_sequence_of_strings.

Oct 13 '22 19:10 anferico

@anferico are you still working on this?

Nov 04 '22 11:11 BenjaminBossan

@anferico are you still working on this?

I am, it's just that I haven't had much free time recently. I'm happy to continue working on this, but if it's really high priority please feel free to let someone else take over. Sorry about that

Nov 04 '22 13:11 anferico

Hey, no worries, take the time you need. I just asked because sometimes people just forget or are waiting on some input from the maintainers.

Nov 04 '22 16:11 BenjaminBossan

@BenjaminBossan I think I've addressed all the points above, however now I'm facing another problem. I noticed that when calling the test_get_column_names_pandas_not_installed function, I incur into a weird recursion error. Obviously, as a result, the test fails. Here's the command I ran:

python -m pytest -s -v --cov-report=xml skops/hub_utils/tests/test_hf_hub.py::test_get_column_names_pandas_not_installed

and here is part of the output I get (see log_test.txt for the full output):

  .
  .
  .
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
  File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
  File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
RecursionError: maximum recursion depth exceeded

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

Dec 08 '22 10:12 anferico

I'm not sure what's happening here, because I haven't touched conftest.py. My suspicion is that the line from sklearn.utils import check_array that I added to _hf_hub.py could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import builtin in such a way that it throws an ImportError while trying to import pandas could be a problem, but I'm really not sure.

It seems likely that this is the explanation, but I'm not sure why that would lead to an infinite recursion. I'd have to dig deeper. @adrinjalali any ideas?

Meanwhile, since some time has passed since this PR was opened, the changes.rst has been updated. You would need to move your entry to the current version. Could you please do that?

Dec 08 '22 16:12 BenjaminBossan

No idea about the infinite recursion, but looking at this, two things come to my mind:

it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.
same issues as #211 would apply here: we should check if what we do would work on the inference API side.

It might be a good idea to reduce the scope of this PR to list of strings only, to make it easier to merge and tackle other things in a separate PR.

Dec 09 '22 12:12 adrinjalali

@anferico I figured out the recursion error. Honestly, I'm a bit surprised that it didn't occur previously, but patching can be tricky. Please change the pandas_not_installed fixture like so:

@pytest.fixture
def pandas_not_installed():
    # patch import so that it raises an ImportError when trying to import
    # pandas. This works because pandas is only imported lazily.
    orig_import = __import__

    def mock_import(name, *args, **kwargs):
        if name == "pandas":
            raise ImportError
        return orig_import(name, *args, **kwargs)

    with patch("builtins.__import__", side_effect=mock_import):
        yield

Dec 09 '22 12:12 BenjaminBossan

it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.

You mean a CI job where, for instance, pandas is not installed? Yes, that could be useful, but I still like to have a test that patches import, so that I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

The extra CI environment would come in handy in case that the test itself is buggy or incomplete. However, having such a test environment would probably require adjusting a bunch of other tests too that currently assume all tests libraries to be installed.

same issues as Added support for Structured Arrays #211 would apply here: we should check if what we do would work on the inference API side.

Would it be possible to add this in a separate PR?

Dec 09 '22 12:12 BenjaminBossan

Would it be possible to add this in a separate PR?

what does "this" here refer to?

I can e.g. quickly test the functionality locally without the need to juggle multiple environments.

Sure, I personally don't bother and let the CI do that for me.

Dec 09 '22 12:12 adrinjalali

what does "this" here refer to?

I mean checking if what we do would work on the inference API side.

Dec 09 '22 12:12 BenjaminBossan

I mean checking if what we do would work on the inference API side.

Yes and no 😁 we should at least check if we train a model with these data types, scikit-learn understands them (in terms of feature names etc), and that we can predict using these types and don't get errors/warnings from sklearn.

Right now sklearn supports feature names mostly through pandas only.

Dec 09 '22 12:12 adrinjalali

@BenjaminBossan thanks, the recursion error just got fixed. I also fixed the changes.rst file.

A couple of comments regarding my latest commit:

I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok
The function get_example_input_from_text_data does not fail on empty containers. Is it desirable or not?

Jan 08 '23 18:01 anferico

@BenjaminBossan I've addressed the latest comments, let me know if this can be merged.

Jan 17 '23 20:01 anferico

Done, thanks again for your support!

Jan 19 '23 17:01 anferico

skops skops copied to clipboard

Support more array-like data types for tabular data and list-like data types for text data

skops
skops copied to clipboard