PySR [WIP] Alternative method to set PySRRegressor parameters

PySRRegressor currently has 74 parameters. That is way too many for a single method. It can be hard to figure out what a single parameter from the long list actually does, especially for new users - who are likely overwhelmed by the large number of options.

What this work-in-progress PR does is create an alternative way to initialize a PySRRegressor object. Rather than passing all parameters at initialization (which would still be allowed), you could instead call a series of different functions to configure various parts of the regressor.

Here's an example of what this might look like:

model = (
    PySRRegressor()
        .set_space(binary_operators=["+", "-"], unary_operators=["cos"], maxsize=30)
        .set_size(niterations=100, populations=50)
        .set_parallelism(procs=8)
)

Each of these functions would define only a subset of parameters according to the groups shown in the new API page: https://astroautomata.com/PySR/api/.

This makes for much more readable code, and hopefully a model that is easier for new users to explore, rather than having all parameters dumped in a flat list. It also presents an opportunity for better documentation: these set_... functions can have more specific explanations of, e.g., what an "equation space" is (for set_space), rather than giving a general overview in the __init__ function.

There are a couple issues with this:

Re-used documentation strings. You would have the docstring both at __init__ and at the set_... functions. Or, we could remove the docstring from __init__ altogether, and simply point the user to the various set_... functions they can use.
Some default parameters are None. How do we also make parameters optional in the set_... functions? My solution to this was to define a global NONE="none" constant that is sort of like an alternative definition of None, so we can differentiate between the user setting a variable to None or not.
A bit different than normal Scikit-learn initializations. However, I think most scikit-learn style APIs only have max ~15 parameters, so we are in a different regime.

Thanks @dfm for the idea! Would be interested in hearing your comments when you have time.

@patrick-kidger @tttc3 @JayWadekar I am eager to hear what you think as well.

Nov 02 '22 21:11 MilesCranmer

This sounds like a good idea to me.

Another option here is to define intermediate helper objects:

@dataclass
class Sizes:
    niterations: int
    populations: int

@dataclass
class Space:
    binary_operations: Sequence[Callable]
    unary_operations: Sequence[Callable]
    maxsize: int

class PySRRegressor:
    def __init__(self, sizes: Sizes, space: Space, ...)

Which can likewise help to split things up by grouping less-frequently-used parameters.

This is what I do for diffrax.diffeqsolve, for example. This has a very similar feel: differential equation solvers also take a lot of parameters, many of which are only used in particular use cases (nonlinear solvers only used for implicit problems; delays only used for DDEs; tolerances only used if you're stepping adaptively/etc.etc.)

Nov 02 '22 22:11 patrick-kidger

Thanks for the ideas! I am on the fence about which way to go: passing a hierarchy of objects, or declaratively setting parameters.

One one hand, I think that if a user wants to figure out the API just by using hover-help and tab completion in their IDE, then the functional method is better, and the typing method would require they keep flipping back to the docs page, and importing various objects.

But on the other hand, the typing method seems more readable. I can definitely see the advantage of a hierarchy of types when only certain parameter combinations are compatible with eachother. And it's also more modular in case a user wants to add their own configuration, or share optimal parameter configs. (Though perhaps this would be better suited for multiple-dispatch)

Maybe you could rule out certain param combos with the declarative method too, and have the functions return different subclasses of the main regressor object. Sort of like how ggplot or [altair](https://altair-viz.github.io/) creates plots. e.g., maybe .sequence_regressor() would return a subclass that could take sequences as input, rather than tabular data. And similar for tensor operations in the future. But for that perhaps it's just better to have a separate class suited for the task... Not sure!

Nov 06 '22 00:11 MilesCranmer

I don't have any other ideas, but personally, I think the hierarchy of types is more straightforward. It avoids the duplicated docstrings, and I think is easier to understand/follow if the user hasn't read the docs beforehand. Whichever option you decide, I think it will be a nice improvement!

Nov 13 '22 11:11 tttc3

Thanks for the feedback @tttc3!

I’m also wondering if it might be possible to get tab completion to work by exposing a “config” module that acts to store references to each type. Then you could write, for example, pysr.config.<tab> and have it show you what types are available. @patrick-kidger what is your experience with tab completions/parameter discoverability with the type-based API?

Maybe I should just mock up a few designs and conduct a Twitter poll…

Nov 14 '22 02:11 MilesCranmer

I've not tried using a config namespace to hold them - but otherwise yup, tab completion works well.

Nov 14 '22 06:11 patrick-kidger

Thanks! How do you use tab completion? i.e., what string of text are you actually hitting tab on?

Nov 14 '22 06:11 MilesCranmer

Typically:

-type-> diffrax.Kv
-tab-> diffrax.Kvaerno5
-type-> diffrax.Kvaerno5(no
-tab-> diffrax.Kvaerno5(nonlinear_solver=

As I have it set to prompt after two characters. (FWIW my current editors are Helix and NvChad.)

If you're thinking in terms of discoverability then that should work just the same, I think. Probably easier if you configure prompting to happen after zero characters though.

Nov 14 '22 07:11 patrick-kidger

Thanks. I guess if I do <tab> after typing diffrax, there are too many options for me to know what to select, right? i.e., if I don't know to type Kv, then one might have to go to the docs?

I wonder if it would be enough to have a prefix on all config parameters, like ConfigSpace, ConfigSize, etc. Then when you hit Config<tab> , it would only show you the small set of available configs, so you could step through more easily (since Config is easy enough to memorize). Or just have a pysr.config submodule...

Nov 14 '22 07:11 MilesCranmer

PySR PySR copied to clipboard

[WIP] Alternative method to set PySRRegressor parameters

PySR
PySR copied to clipboard