Feature Request: Double-space adds a period (full stop)
Double-space adds a period (full stop).
1.
pretty much depends on how often do users use hall_of_fame separately?
hall_of_fame, state = equation_search(dataset; options, return_state=true)
dominating = calculate_pareto_curve(hall_of_fame, dataset, options)
according to you
So it would be easier for the user to query. More importantly, I would only return the dominating pareto curve, rather than the entire hall of fame (I doubt anybody wants the entire curve anyways).
so it seems we should just have
dataframe, state = equation_search(dataset; options, return_state=true)
side note: I don't like return_state=true, I feel like this is a python thing where # of returned objects depends on run-time value. given it's a end-user function it doesn't have much performance implication, so just a taste thing.
On DataFrame
I strongly recommend you NOT to depend on DataFrames.jl, it's unnecessarily heavy. Returning Dict or NamedTuple or something like Tables.jl / StructArrays.jl would be fine, it's trivial to pack into a DataFrame later.
DataDrivenDiffEq and MLJ
both are good ideas, for MLJ you just want to make an interface package and register it with MLJ ecosystem; I don't know about SciML convention.
Good tips, thanks! Yes perhaps that is best.
e.g., could return a single object result::ResultType that includes everything: .equations would be a Tables.jl object of the Pareto frontier, .state would be the search state, .options would be a copy of the search options, .best would be the best expression found (using a similar default as in PySR, combining accuracy and complexity). Perhaps you could call result(X, 5) to compute the predictions of the 5th expression on the dataset. And plot(result, X, y) to generate some nice default plots of the Pareto frontier.
More importantly, printing result would indicate these different fields in a nicely formatted output, so the user doesn’t need to read the API page.
Then, one could pass either result or result.state back to equation_search to continue where it left off. (And perhaps it could just read the options from there, or accept new options if the hyperparameters are compatible)
Then, there could be new lightweight frontends for MLJ and SciML.
I am leaning towards an MLJ-style interface. I think the statefulness of the Regressor objects is nice for warm starts, and would be nice for plotting diagnostic info.
This might take the form of some sort of extension package that would load if users also import MLJ.jl.
I wonder if it should come with both a SciML interface (via ModelingToolkit.jl?), and an MLJ one. And the base interface defines internal types for an MLJ-style model setup.
Drafted the following Base.show method for Options. I think it looks much better:
Options:
├── Search Space:
│ ├── Unary operators: [cos, sin] # unary_operators
│ ├── Binary operators: [+, *, /, -] # binary_operators
│ ├── Max size of equations: 20 # maxsize
│ └── Max depth of equations: 20 # maxdepth
├── Search Size:
│ ├── Cycles per iteration: 550 # ncycles_per_iteration
│ ├── Number of populations: 15 # npopulations
│ └── Size of each population: 33 # npop
├── The Objective:
│ ├── Elementwise loss function: L2DistLoss # elementwise_loss
│ └── Full loss function (if any): nothing # loss_function
├── Selection:
│ ├── Expressions per tournament: 12 # tournament_selection_n
│ └── p(tournament winner=best expression): 0.86 # tournament_selection_p
├── Migration:
│ ├── Migrate equations: true # migration
│ ├── Migrate hall of fame equations: true # hof_migration
│ ├── p(replaced) during migration: 0.00036 # fraction_replaced
│ ├── p(replaced) during hof migration: 0.035 # fraction_replaced_hof
│ └── Migration candidates per population: 12 # topn
├── Complexities:
│ ├── Parsimony factor: 0.0032 # parsimony
│ ├── Complexity of each operator: [+=>1, *=>1, /=>1, -=>1, cos=>1, sin=>5] # complexity_of_operators
│ ├── Complexity of constants: [1] # complexity_of_constants
│ ├── Complexity of variables: [1] # complexity_of_variables
│ ├── Slowly increase max size: 0.0 # warmup_maxsize_by
│ ├── Use adaptive parsimony: true # use_frequency
│ ├── Use adaptive parsimony in tournament: true # use_frequency_in_tournament
│ ├── Adaptive parsimony scaling factor: 20.0 # adaptive_parsimony_scaling
│ └── Simplify equations: true # should_simplify
When you see this in a REPL, the comments are printed in a light grey color.
My 2c:
1/2
A 2-style interface using a Tables.jl-compatible form (e.g. Vector{NamedTuple} would be my preference.
Regarding returning the Pareto curve or the entire HoF, how about having something like this:
hall_of_fame, state = equation_search_full(dataset, options)
dominating, state = equation_search(hall_of_fame, dataset, options)
dominating, state = equation_search(dataset, options)
With two equation_search methods, one of which is simply
equation_search(dataset, options) =
equation_search(equation_search_full(dataset, options), dataset, options)
Alternatively, as has already come up, x, state could be replaced with some sort of Result structure. Then one could have the very simple:
full_result = equation_search_full(dataset, options)
dominating_result = equation_search(full_result)
dominating_result = equation_search(dataset, options)
Actually, if you made the Result structure iterable, you could support both of these usage patterns simultaneously.
3
This sound like maybe it could be good as a package extension to DataDrivenDiffEq.
4
This sounds like it could be a good package extension to have here.
I don't think MLJ interface packages make as much sense now we have package extensions.
Speaking of 4., I have an attempt here: #226. Indeed I think it makes the most sense to put it in an extension.
I like your ideas for 1-2. I’ll think more about this.
Moving to mid-importance now that the MLJ interface has matured. Remaining API changes would be to improve the low-level interface.
(Finished a while ago)
Nice!