skops icon indicating copy to clipboard operation
skops copied to clipboard

Generate minimal README.md file in repository initialization (issue #113)

Open jucamohedano opened this issue 1 year ago • 53 comments

This PR references issue #113

Implemented the function _create_readme which is called in hub_utils.init. It creates a minimal README.md file. The input to this function is the same as the input to _create_config function. At the moment, the function _create_readme only supports a tabular task for the widget. I haven't added the text task since other functions such as metadata_from_config in _model_card.py don't give support for the text task yet. Let me know if this can be added otherwise.

I have manually tested the initialization of a repository with the example script plot_hf_hub.py and it seemed to work as expected.

jucamohedano avatar Nov 01 '22 21:11 jucamohedano

@BenjaminBossan any idea why the CI fails here? It's really odd to me.

adrinjalali avatar Nov 18 '22 17:11 adrinjalali

any idea why the CI fails here? It's really odd to me.

I didn't follow this PR along. Is the expectation here that the error should be caught by the try?

BenjaminBossan avatar Nov 18 '22 18:11 BenjaminBossan

Oh I see where the issue comes from. The error is raised inside this test: test_init_empty_model_file_warns

So far, we have allowed people to pass an empty model file, we just raise a warning if they do so.

Now, this PR tries to actually load the model file and do things with it, which means it expects it to be a valid model file.

So, @BenjaminBossan , the question is: do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

adrinjalali avatar Nov 21 '22 12:11 adrinjalali

do we allow people to init a repo with an invalid model file? I'm leaning towards raising if the model file doesn't contain a valid model. WDYT?

I would be okay with that, don't really see a use case where you'd start out without a model.

BenjaminBossan avatar Nov 21 '22 12:11 BenjaminBossan

https://github.com/skops-dev/skops/pull/214 should fix the issue.

adrinjalali avatar Nov 21 '22 13:11 adrinjalali

@jucamohedano note that I've merge with latest main, you'd need to do a git pull before applying more changes to have this change applied locally on your machine.

adrinjalali avatar Nov 22 '22 15:11 adrinjalali

@jucamohedano would you be able to debug the issue?

adrinjalali avatar Nov 29 '22 15:11 adrinjalali

@jucamohedano if you can't debug the issue, let us know. We can let another person continue this work :)

adrinjalali avatar Dec 06 '22 09:12 adrinjalali

@adrinjalali I am actually debugging it now. I will some feedback later about my progress.

jucamohedano avatar Dec 06 '22 09:12 jucamohedano

So the problem comes due to the fact that some tests are not using pickle, and instead use skops.io to dump and load models. So in hub_utils.init I assumed that the model passed is a pickle file, and if you pass a skops model then it will fail to execute. Should I account for both formats or should I get rid of the pickle format and only accept the skops format in hub_utils.init ?

This issue comes from the test test_code_autogeneration_skops

jucamohedano avatar Dec 06 '22 10:12 jucamohedano

Then I'd wait for #242 to be merged, since it'll fix this issue as well.

adrinjalali avatar Dec 06 '22 11:12 adrinjalali

@jucamohedano Since #242 is merged, I think there is no blocker for this PR. Do you still intend on working on it?

BenjaminBossan avatar Jan 18 '23 14:01 BenjaminBossan

@BenjaminBossan No, I was waiting for #242 to be merged. I have resolved some small conflicts in my last commit.

jucamohedano avatar Jan 19 '23 20:01 jucamohedano

@jucamohedano Thanks for continuing to work on this. The tests are failing with:

NameError: name 'load' is not defined

That's probably easy to fix, I assume after merging an import statement was changed.

BenjaminBossan avatar Jan 20 '23 11:01 BenjaminBossan

I actually had to modified the functionality I added to hub_utils.init so that it checks for file format when loading the model file, previously I assumed it was just a pickle file, but now I also checked for the skops format.

jucamohedano avatar Jan 21 '23 16:01 jucamohedano

Linter is failing here. You can consider enabling pre-commit hooks: https://github.com/skops-dev/skops/blob/main/CONTRIBUTING.rst#using-condamamba

adrinjalali avatar Jan 23 '23 12:01 adrinjalali

Sorry! I did a clean installation, and forgot to install pre-commit hooks. Thank you for reminding.

jucamohedano avatar Jan 23 '23 21:01 jucamohedano

I wanted to submit an empty commit to double check the codecov, but you're using your main branch to submit the PR therefore I can't do that, you should always create a branch for each contribution on your repo.

Running the tests locally, I get the same result that those lines are not tested:

---------- coverage: platform linux, python 3.10.6-final-0 -----------
Name                              Stmts   Miss  Cover   Missing
---------------------------------------------------------------
skops/__init__.py                     8      1    88%   31
skops/card/__init__.py                2      0   100%
skops/card/_model_card.py           377      0   100%
skops/card/_templates.py              7      0   100%
skops/cli/__init__.py                 0      0   100%
skops/cli/_convert.py                39      0   100%
skops/cli/_utils.py                   8      2    75%   11, 13
skops/cli/entrypoint.py              12      0   100%
skops/hub_utils/__init__.py           2      0   100%
skops/hub_utils/_hf_hub.py          182     31    83%   435-436, 438, 586-597, 696-712, 752-775
skops/hub_utils/tests/common.py       1      0   100%
skops/io/__init__.py                  2      0   100%
skops/io/_audit.py                  110     12    89%   183, 220, 239, 270, 275-282, 300-305, 311
skops/io/_general.py                230      5    98%   36, 172, 216, 469, 472
skops/io/_numpy.py                  105      3    97%   43, 74, 101
skops/io/_persist.py                 66      1    98%   207
skops/io/_scipy.py                   29      1    97%   46
skops/io/_sklearn.py                 90      9    90%   10-11, 77, 79, 84, 118, 124-128
skops/io/_trusted_types.py            5      0   100%
skops/io/_utils.py                   75      3    96%   16, 50, 152
skops/io/exceptions.py                6      0   100%
skops/io/tests/_utils.py            111      4    96%   33, 37, 109, 134
skops/utils/__init__.py               0      0   100%
---------------------------------------------------------------
TOTAL                              1467     72    95%

Could you please investigate?

adrinjalali avatar Jan 30 '23 12:01 adrinjalali

Sorry, I know I should made a new branch for each feature. I was excited to contribute to the project, and completely forgot about doing it.

I have fixed the tests. I ran them locally, and the lines stated above within the init function aren't missing anymore.

jucamohedano avatar Jan 31 '23 23:01 jucamohedano

Doc build is raising this issue which is relevant:

Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_tabular_regression.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_tabular_regression.py", line 73, in <module>
    hub_utils.init(
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 437, in init
    model = pikle_load(f)
_pickle.UnpicklingError: A load persistent id instruction was encountered,
but no persistent_load function was specified.

adrinjalali avatar Feb 27 '23 12:02 adrinjalali

Is there a way to check if the model file that's being loaded is in the skops persistent format? Because I could catch the error that you highlighted and in consequence load the model file using skops.io.load, but I want to know if there's a cleaner way of doing it than catching the error.

jucamohedano avatar Feb 28 '23 22:02 jucamohedano

since https://github.com/skops-dev/skops/pull/272 , the model_format gives you that information

adrinjalali avatar Mar 01 '23 08:03 adrinjalali

Great, thanks! I have updated the logic in the code to account for model format and the extension of the file if model_format is auto. However, I believe that the error that you highlighted will still occur because the example plot_tabular_regression.py creates a skops persistent model which is saved with the .pkl extension. This makes the _hf_hub.init function fail when resolving the type of file to read. This can be easily fix by setting the model_format argument to skops or correcting the suffix given.

jucamohedano avatar Mar 01 '23 22:03 jucamohedano

Hi, sorry that this PR is going for too long. I was looking at the output from the doc build, in particular this one

Unexpected failing examples:
/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_intelex.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/examples/plot_intelex.py", line 156, in <module>
    hub_utils.init(
  File "/home/docs/checkouts/readthedocs.org/user_builds/skops/checkouts/207/skops/hub_utils/_hf_hub.py", line 444, in init
    model = pikle_load(f)
EOFError: Ran out of input

This error happens because examples/plot_intelex.py creates and opens temporary files that are also being opened in hub_utils.init function through the use of the model parameter. Maybe we can create those temporary files without using the wrapper that takes care of closing and deleting them, and instead we delete them explicitly. There might be a better solution that my knowledge can't reach.

jucamohedano avatar Mar 09 '23 21:03 jucamohedano

This error happens because examples/plot_intelex.py creates and opens temporary files that are also being opened in hub_utils.init function through the use of the model parameter.

Hmm, if that's really the reason, doesn't it mean that this code is breaking backwards compatibility? I'd much prefer if it didn't, e.g. by making the creation of the README.md opt-in. WDYT @adrinjalali

Also, from a quick glance at the implementation, I think it can be greatly simplified. It is possible to instantiate the Card class with a path to the model file (perhaps that was added after this PR started), no need to load the model inside of hub_utils.init.

BenjaminBossan avatar Mar 10 '23 11:03 BenjaminBossan

I'm quite confused by this. This code is a minimal reproducible for the issue:

from tempfile import NamedTemporaryFile
import pickle

with NamedTemporaryFile(mode="bw", prefix="stock-", suffix=".pkl") as fp:
    pickle.dump("test", file=fp)
    print(fp.name)
    with open(fp.name, "rb") as f:
        pickle.load(f)

Now if we add a seek(0) for fp, it works:

from tempfile import NamedTemporaryFile
import pickle

with NamedTemporaryFile(mode="bw", prefix="stock-", suffix=".pkl") as fp:
    pickle.dump("test", file=fp)
    print(fp.name)
    fp.seek(0)
    with open(fp.name, "rb") as f:
        pickle.load(f)

To me, that makes 0 sense. Does it make sense to you @BenjaminBossan ?

adrinjalali avatar Mar 13 '23 09:03 adrinjalali

To me, that makes 0 sense. Does it make sense to you @BenjaminBossan ?

I can't say I would have predicted it, but it doesn't seem too surprising. They way I visualize this to myself (probably technically quite incorrect) is that when I open a file, there is a "cursor" at position 0 for the file object. After writing to the file, it is at byte X, where X is the file size. When I try to read from the same file object, since the "cursor" is at the end, there is nothing to read. Therefore, seek(0) (putting the cursor to position 0) solves the issue. Normally, a file object is only read from/written to once, so it doesn't become an issue, but here, it is used twice.

For this PR specifically, as mentioned earlier, I would prefer if the default behavior is not changed, i.e. creating a model card should be opt in. IIUC this would solve the issue. When it comes to the intelex example, if it would be changed to use temporary files the same way we do in other examples (i.e. no context manager), it should solve issue too, but context manager has the advantage of immediately cleaning up the temp file.

BenjaminBossan avatar Mar 13 '23 10:03 BenjaminBossan

I'm still happy to create the README by default, and I think the issue we have here is a python bug. Your intuition about seek would be correct if we weren't opening the file and just accessing the safe file descriptor. The issue might be that internally python is using the same fd, and therefore the need for seek. The fix might be to do the seek after opening before passing it to pickle (although we probably then need to do that for our own load as well).

adrinjalali avatar Mar 13 '23 15:03 adrinjalali

Reading the python docs:

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows). If delete is true (the default), the file is deleted as soon as it is closed. The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.

When passing the file to init, we should expect it to be read, therefore it should be something which can be opened, and therefore I think we need to fix the intelex example accordingly.

adrinjalali avatar Mar 13 '23 15:03 adrinjalali

I think the issue we have here is a python bug

Not sure about this :) At the very least, it's not specific to NamedTemporaryFile:

with open('foo', 'w') as f:
    f.write("hello")
    # f.seek(0) fixes it
    with open(f.name, 'r') as g:
        bla = g.read()
        print(bla)  # prints nothing

therefore I think we need to fix the intelex example accordingly

Yes, we can certainly do that, but there could still be code out there that would break in the same fashion.

BenjaminBossan avatar Mar 13 '23 16:03 BenjaminBossan