skops
skops copied to clipboard
Generate minimal README.md file in repository initialization (issue #113)
This PR references issue #113
Implemented the function _create_readme
which is called in hub_utils.init
. It creates a minimal README.md file. The input to this function is the same as the input to _create_config
function.
At the moment, the function _create_readme
only supports a tabular
task for the widget. I haven't added the text
task since other functions such as metadata_from_config
in _model_card.py
don't give support for the text
task yet. Let me know if this can be added otherwise.
I have manually tested the initialization of a repository with the example script plot_hf_hub.py
and it seemed to work as expected.
@BenjaminBossan any idea why the CI fails here? It's really odd to me.
any idea why the CI fails here? It's really odd to me.
I didn't follow this PR along. Is the expectation here that the error should be caught by the try
?
Oh I see where the issue comes from. The error is raised inside this test: test_init_empty_model_file_warns
So far, we have allowed people to pass an empty model file, we just raise a warning if they do so.
Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file.
So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?
do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?
I would be okay with that, don't really see a use case where you'd start out without a model.
https://github.com/skops-dev/skops/pull/214 should fix the issue.
@jucamohedano note that I've merge with latest main
, you'd need to do a git pull
before applying more changes to have this change applied locally on your machine.
@jucamohedano would you be able to debug the issue?
@jucamohedano if you can't debug the issue, let us know. We can let another person continue this work :)
@adrinjalali I am actually debugging it now. I will some feedback later about my progress.
So the problem comes due to the fact that some tests are not using pickle
, and instead use skops.io
to dump and load models. So in hub_utils.init
I assumed that the model passed is a pickle file, and if you pass a skops model then it will fail to execute. Should I account for both formats or should I get rid of the pickle format and only accept the skops format in hub_utils.init
?
This issue comes from the test test_code_autogeneration_skops
Then I'd wait for #242 to be merged, since it'll fix this issue as well.
@jucamohedano Since #242 is merged, I think there is no blocker for this PR. Do you still intend on working on it?
@BenjaminBossan No, I was waiting for #242 to be merged. I have resolved some small conflicts in my last commit.
@jucamohedano Thanks for continuing to work on this. The tests are failing with:
NameError: name 'load' is not defined
That's probably easy to fix, I assume after merging an import statement was changed.
I actually had to modified the functionality I added to hub_utils.init
so that it checks for file format when loading the model file, previously I assumed it was just a pickle file, but now I also checked for the skops format.
Linter is failing here. You can consider enabling pre-commit hooks: https://github.com/skops-dev/skops/blob/main/CONTRIBUTING.rst#using-condamamba
Sorry! I did a clean installation, and forgot to install pre-commit hooks. Thank you for reminding.
I wanted to submit an empty commit to double check the codecov, but you're using your main
branch to submit the PR therefore I can't do that, you should always create a branch for each contribution on your repo.
Running the tests locally, I get the same result that those lines are not tested:
---------- coverage: platform linux, python 3.10.6-final-0 -----------
Name Stmts Miss Cover Missing
---------------------------------------------------------------
skops/__init__.py 8 1 88% 31
skops/card/__init__.py 2 0 100%
skops/card/_model_card.py 377 0 100%
skops/card/_templates.py 7 0 100%
skops/cli/__init__.py 0 0 100%
skops/cli/_convert.py 39 0 100%
skops/cli/_utils.py 8 2 75% 11, 13
skops/cli/entrypoint.py 12 0 100%
skops/hub_utils/__init__.py 2 0 100%
skops/hub_utils/_hf_hub.py 182 31 83% 435-436, 438, 586-597, 696-712, 752-775
skops/hub_utils/tests/common.py 1 0 100%
skops/io/__init__.py 2 0 100%
skops/io/_audit.py 110 12 89% 183, 220, 239, 270, 275-282, 300-305, 311
skops/io/_general.py 230 5 98% 36, 172, 216, 469, 472
skops/io/_numpy.py 105 3 97% 43, 74, 101
skops/io/_persist.py 66 1 98% 207
skops/io/_scipy.py 29 1 97% 46
skops/io/_sklearn.py 90 9 90% 10-11, 77, 79, 84, 118, 124-128
skops/io/_trusted_types.py 5 0 100%
skops/io/_utils.py 75 3 96% 16, 50, 152
skops/io/exceptions.py 6 0 100%
skops/io/tests/_utils.py 111 4 96% 33, 37, 109, 134
skops/utils/__init__.py 0 0 100%
---------------------------------------------------------------
TOTAL 1467 72 95%
Could you please investigate?
Sorry, I know I should made a new branch for each feature. I was excited to contribute to the project, and completely forgot about doing it.
I have fixed the tests. I ran them locally, and the lines stated above within the init
function aren't missing anymore.
Doc build is raising this issue which is relevant:
Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_tabular_regression.py failed leaving traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_tabular_regression.py", line 73, in <module>
hub_utils.init(
File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 437, in init
model = pikle_load(f)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.
Is there a way to check if the model file that's being loaded is in the skops persistent format? Because I could catch the error that you highlighted and in consequence load the model file using skops.io.load
, but I want to know if there's a cleaner way of doing it than catching the error.
since https://github.com/skops-dev/skops/pull/272 , the model_format
gives you that information
Great, thanks! I have updated the logic in the code to account for model format and the extension of the file if model_format
is auto
. However, I believe that the error that you highlighted will still occur because the example plot_tabular_regression.py
creates a skops persistent model which is saved with the .pkl
extension. This makes the _hf_hub.init
function fail when resolving the type of file to read. This can be easily fix by setting the model_format
argument to skops
or correcting the suffix given.
Hi, sorry that this PR is going for too long. I was looking at the output from the doc build, in particular this one
Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_intelex.py failed leaving traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_intelex.py", line 156, in <module>
hub_utils.init(
File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 444, in init
model = pikle_load(f)
EOFError: Ran out of input
This error happens because examples/plot_intelex.py
creates and opens temporary files that are also being opened in hub_utils.init
function through the use of the model
parameter. Maybe we can create those temporary files without using the wrapper that takes care of closing and deleting them, and instead we delete them explicitly. There might be a better solution that my knowledge can't reach.
This error happens because examples/plot_intelex.py creates and opens temporary files that are also being opened in hub_utils.init function through the use of the model parameter.
Hmm, if that's really the reason, doesn't it mean that this code is breaking backwards compatibility? I'd much prefer if it didn't, e.g. by making the creation of the README.md opt-in. WDYT @adrinjalali
Also, from a quick glance at the implementation, I think it can be greatly simplified. It is possible to instantiate the Card
class with a path to the model file (perhaps that was added after this PR started), no need to load the model inside of hub_utils.init
.
I'm quite confused by this. This code is a minimal reproducible for the issue:
from tempfile import NamedTemporaryFile
import pickle
with NamedTemporaryFile(mode="bw", prefix="stock-", suffix=".pkl") as fp:
pickle.dump("test", file=fp)
print(fp.name)
with open(fp.name, "rb") as f:
pickle.load(f)
Now if we add a seek(0)
for fp
, it works:
from tempfile import NamedTemporaryFile
import pickle
with NamedTemporaryFile(mode="bw", prefix="stock-", suffix=".pkl") as fp:
pickle.dump("test", file=fp)
print(fp.name)
fp.seek(0)
with open(fp.name, "rb") as f:
pickle.load(f)
To me, that makes 0 sense. Does it make sense to you @BenjaminBossan ?
To me, that makes 0 sense. Does it make sense to you @BenjaminBossan ?
I can't say I would have predicted it, but it doesn't seem too surprising. They way I visualize this to myself (probably technically quite incorrect) is that when I open a file, there is a "cursor" at position 0 for the file object. After writing to the file, it is at byte X, where X is the file size. When I try to read from the same file object, since the "cursor" is at the end, there is nothing to read. Therefore, seek(0)
(putting the cursor to position 0) solves the issue. Normally, a file object is only read from/written to once, so it doesn't become an issue, but here, it is used twice.
For this PR specifically, as mentioned earlier, I would prefer if the default behavior is not changed, i.e. creating a model card should be opt in. IIUC this would solve the issue. When it comes to the intelex example, if it would be changed to use temporary files the same way we do in other examples (i.e. no context manager), it should solve issue too, but context manager has the advantage of immediately cleaning up the temp file.
I'm still happy to create the README by default, and I think the issue we have here is a python bug. Your intuition about seek
would be correct if we weren't opening the file and just accessing the safe file descriptor. The issue might be that internally python is using the same fd, and therefore the need for seek
. The fix might be to do the seek after opening before passing it to pickle (although we probably then need to do that for our own load
as well).
Reading the python docs:
This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows). If delete is true (the default), the file is deleted as soon as it is closed. The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.
When passing the file to init
, we should expect it to be read, therefore it should be something which can be opened, and therefore I think we need to fix the intelex example accordingly.
I think the issue we have here is a python bug
Not sure about this :) At the very least, it's not specific to NamedTemporaryFile
:
with open('foo', 'w') as f:
f.write("hello")
# f.seek(0) fixes it
with open(f.name, 'r') as g:
bla = g.read()
print(bla) # prints nothing
therefore I think we need to fix the intelex example accordingly
Yes, we can certainly do that, but there could still be code out there that would break in the same fashion.