skops
skops copied to clipboard
Support more array-like data types for tabular data and list-like data types for text data
This pull requests is an attempt towards addressing #65. Note that this is my first ever public contribution, so don't be too hard on me 😉
For tabular-classification
and tabular-regression
tasks, we're currently supporting the following data types:
- pandas.DataFrame
- numpy.ndarray
Should this PR get merged, we will support any other data type that can be converted to a 2D numpy.ndarray using np.asarray
.
For text-classification
and text-regression
tasks, we're currently supporting the following data types:
- List of strings
Should this PR get merged, we will support any iterable of strings.
xref: https://github.com/skops-dev/skops/pull/45/files#r933221139
Thanks a lot @anferico for taking this.
I have only taken a quick look so far and noticed that you haven't added any unit tests for the changes. Could you please do that? Let us know if you need any help with that.
Also, please add an entry to the changes doc.
I haven't added any unit test for the 2 small helper functions I added, though I've modified existing ones in accordance to the changes I made. Not sure what's your policy here, but if needed I can add unit tests for those as well. I also added some docstrings and modified existing ones.
Everything should be fine since all unit tests were passing on my local machine, but let's wait until the pytest workflow runs successfully.
@BenjaminBossan I've tried addressing your comments, see if it's good now. One minor detail: I renamed the function _is_iterable_of_strings
to _is_sequence_of_strings
because now I'm not expecting generic iterables anymore as inputs to that function, which led me to remove the logic that prevents iterators and generators from being exhausted when iterated over. Without that logic, it just felt wrong to me calling it _is_iterable_of_strings
, so I went for _is_sequence_of_strings
.
@anferico are you still working on this?
@anferico are you still working on this?
I am, it's just that I haven't had much free time recently. I'm happy to continue working on this, but if it's really high priority please feel free to let someone else take over. Sorry about that
Hey, no worries, take the time you need. I just asked because sometimes people just forget or are waiting on some input from the maintainers.
@BenjaminBossan I think I've addressed all the points above, however now I'm facing another problem.
I noticed that when calling the test_get_column_names_pandas_not_installed
function, I incur into a weird recursion error. Obviously, as a result, the test fails.
Here's the command I ran:
python -m pytest -s -v --cov-report=xml skops/hub_utils/tests/test_hf_hub.py::test_get_column_names_pandas_not_installed
and here is part of the output I get (see log_test.txt for the full output):
.
.
.
File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
File "/home/anfri/MyProjects/skops/skops/conftest.py", line 13, in mock_import
File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1103, in __call__
File "/home/anfri/miniconda3/envs/skops/lib/python3.10/unittest/mock.py", line 1111, in _increment_mock_call
RecursionError: maximum recursion depth exceeded
I'm not sure what's happening here, because I haven't touched conftest.py
. My suspicion is that the line from sklearn.utils import check_array
that I added to _hf_hub.py
could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching the import
builtin in such a way that it throws an ImportError
while trying to import pandas could be a problem, but I'm really not sure.
I'm not sure what's happening here, because I haven't touched
conftest.py
. My suspicion is that the linefrom sklearn.utils import check_array
that I added to_hf_hub.py
could be the culprit because there are some pandas imports in the source file containing that function, so maybe us patching theimport
builtin in such a way that it throws anImportError
while trying to import pandas could be a problem, but I'm really not sure.
It seems likely that this is the explanation, but I'm not sure why that would lead to an infinite recursion. I'd have to dig deeper. @adrinjalali any ideas?
Meanwhile, since some time has passed since this PR was opened, the changes.rst
has been updated. You would need to move your entry to the current version. Could you please do that?
No idea about the infinite recursion, but looking at this, two things come to my mind:
- it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.
- same issues as #211 would apply here: we should check if what we do would work on the inference API side.
It might be a good idea to reduce the scope of this PR to list of strings only, to make it easier to merge and tackle other things in a separate PR.
@anferico I figured out the recursion error. Honestly, I'm a bit surprised that it didn't occur previously, but patching can be tricky. Please change the pandas_not_installed
fixture like so:
@pytest.fixture
def pandas_not_installed():
# patch import so that it raises an ImportError when trying to import
# pandas. This works because pandas is only imported lazily.
orig_import = __import__
def mock_import(name, *args, **kwargs):
if name == "pandas":
raise ImportError
return orig_import(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=mock_import):
yield
- it might be a good idea for us to have a CI job where we have minimal dependencies installed, instead of patching load_module.
You mean a CI job where, for instance, pandas is not installed? Yes, that could be useful, but I still like to have a test that patches import
, so that I can e.g. quickly test the functionality locally without the need to juggle multiple environments.
The extra CI environment would come in handy in case that the test itself is buggy or incomplete. However, having such a test environment would probably require adjusting a bunch of other tests too that currently assume all tests
libraries to be installed.
- same issues as Added support for Structured Arrays #211 would apply here: we should check if what we do would work on the inference API side.
Would it be possible to add this in a separate PR?
Would it be possible to add this in a separate PR?
what does "this" here refer to?
I can e.g. quickly test the functionality locally without the need to juggle multiple environments.
Sure, I personally don't bother and let the CI do that for me.
what does "this" here refer to?
I mean checking if what we do would work on the inference API side.
I mean checking if what we do would work on the inference API side.
Yes and no 😁 we should at least check if we train a model with these data types, scikit-learn understands them (in terms of feature names etc), and that we can predict using these types and don't get errors/warnings from sklearn.
Right now sklearn supports feature names mostly through pandas only.
@BenjaminBossan thanks, the recursion error just got fixed. I also fixed the changes.rst
file.
A couple of comments regarding my latest commit:
- I removed support for generators as valid containers for text-based tasks. The reason is that I couldn't find a way to peek at a generator without consuming it. Let me know if this is ok
- The function
get_example_input_from_text_data
does not fail on empty containers. Is it desirable or not?
@BenjaminBossan I've addressed the latest comments, let me know if this can be merged.
Done, thanks again for your support!