sklearn-xarray icon indicating copy to clipboard operation
sklearn-xarray copied to clipboard

add wrap_gridsearch_results function

Open aspitarl opened this issue 3 years ago • 1 comments

This is the initial version of the wrapper to convert gridsearchCV results per the thread in the discussion.

Codewise, I use a multindexed pandas to create the Dataset. Perhaps there is a more elegant xarray only way of doing this, but I see pandas is in the requirements.

I'm new to contributing to projects and not sure what other things are needed for this pull request, so let me know. Is the comment in the function definition the "documentation" that is being referred to? I'm also unfamiliar with implementing tests and not sure how one would be implemented in this case without adding a Dataset.

aspitarl avatar Mar 15 '21 05:03 aspitarl

Thanks for your contribution @aspitarl !

Using pandas to create the Dataset is fine, pandas is a dependency for xarray anyway. I believe it might make more sense to return a DataArray instead of a Dataset with a single data variable. Or is there a case where we would have multiple variables in the result?

Re: documentation, the comment after the function definition (called the "docstring") is definitely a big part of it. This will be included in the API Reference part of the documentation, all you have to do is add the function at the appropriate place in the source document which is here, in the model selection section under the .. autosummary:: directive.

However, most users will look at the user guide part of the documentation first, as that usually provides step-by-step examples on how to use a certain functionality. A good place for an example for your function would be here. You could just pass the results of the grid search example into the new function to show how it works.

The way the documentation is set up with the >>> prompt (and actually the .. doctest:: directive that isn't visible in the rendered preview) means that the code examples are actually executed when the documentation is built and the results are compared to the specified output. This way we make sure that the user will get the same results when they run the examples themselves.

Which brings me to the next point, testing. Unit testing is a way of making sure that your function works as expected with a range of inputs that you would expect users to pass to it. The tests for the model selection module are here. Instead of setting up a complete example, you "mock" the function input by constructing an artificial fitted GridSearchCV instance and verifying the output of your function. Something like:

def test_wrap_gridsearch_results():

    gs = GridSearchCV(...)

    # mock results without fitting
    gs.cv_results_ = ...

    # check output
    da = wrap_gridsearch_results(gs)
    assert isinstance(da, xr.DataArray)
    assert da.coords == ...
    # you can also use xr.testing.assert_equal

At that point, you might want to think about what happens when the user calls your function with an incorrect argument, e.g. an unfitted GridSearchCV instance. Do they get a concise error message that tells them what went wrong and how to fix it? This helper function might be useful here.

Last but not least, I really liked the minimal example that you provided in the initial discussion. You could add that to the examples with minimal modifications (have a look at this other example as a template).

I know this is a lot at once. Let me know if you need help at any point (e.g. setting up a development environment, I realize now that there's no guide on that). There's also the contribution guide in the wiki; unfortunately it's a little outdated and not extremely helpful.

phausamann avatar Mar 15 '21 09:03 phausamann