SymbolicRegression.jl Questions on large datasets and best settings and scaling

I switched my code from Python to Julia. (To generate the datasets). I'm starting with a "small" dataset of 1,257,984 items. So the inputX is Matrix{Float32}(10, 1257984) and outputY is Vector{Float32}(1257984). (The actual data is integers, but I cast them to float). outputY is either 0 or 1, so it's a binary problem. Essentially finding an equation that divides two sets of vectors?

http://sirisian.com/randomfiles/symbolicregression/inputX.bin http://sirisian.com/randomfiles/symbolicregression/outputY.bin

using SymbolicRegression
using SymbolicUtils
using Serialization

inputX = deserialize("inputX.bin")
outputY = deserialize("outputY.bin")
options = SymbolicRegression.Options(
	binary_operators=(+, *, /, -, ^),
	unary_operators=(sqrt,),
	constraints=((^)=>(-1, 3),),
	# target == 0
	# We want the predicted value to be < 0
	# target == 1
	# We want the predicted value to be >= 0
	loss=(t,p) -> t == 0 ? (p < 0 ? 0 : p^2 + 0.0001) : (p >= 0 ? 0 : p^2 + 0.0001),
	maxsize=48,
	maxdepth=8,
	npopulations=20,
	batching=true,
	batchSize=1000,
	progress=true
)
hallOfFame = EquationSearch(inputX, outputY, niterations=1000, options=options, numprocs=0, multithreading=true)
dominating = calculateParetoFrontier(inputX, outputY, hallOfFame, options)
eqn = node_to_symbolic(dominating[end].tree, options)
println(simplify(eqn))
for member in dominating
	size = countNodes(member.tree)
	score = member.score
	string = stringTree(member.tree, options)
		
	println("$(size)\t$(score)\t$(string)")
end

I have a lot of questions. I don't know what's feasible, so I think it's easier to ask them in one place.

What settings should I be using for best results given my problem. Say I have an i9-12900k, so 24 threads to work with. I'm having problems reasoning about the size of batching or if I need to use other settings to speed things up.
The loss function I have defined does not appear to work. My thinking was to drive the system to return values that are negative when the output is 0 and positive when the output is 1. For some reason when I let the above run it returns things like: 1 0.000e+00 Inf -0.7851169 So I'm fundamentally not understanding something. Surely such a constant equation like -0.5 would return a large loss? Say the target was 1 for an element being evaluated. Then 1 == 0 ? (-0.5 < 0 ? 0 : (-0.5)^2 + 0.0001) : (-0.5 >= 0 ? 0 : (-0.5)^2 + 0.0001) = 0.2501 I noticed there are loss functions for binary classification problems. I see cross entropy and hinge loss mentioned a lot. I don't see these in the listed loss functions. I noticed in other places the target is expected to be -1 or 1 not 0 or 1. Any tips for using such loss functions with this library?
Is it possible to force the algorithm to include a minimum complexity of say including the 10 variables + 9 operators, so 19. Like the simplest possible equation I would ever expect to work is x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10. I can write a cubic function that gets close to the result.
Is it possible to define custom expression subtrees that use variables with lower complexity? Ideally I'd expect it to use varMap? For example say I have varMap=["x0", "y0", "x1", "y1", "x2", "y2", "x3", "y3", "cx", "cy"]. While I'm not sure (x0 - x1) is definitely going to be in my equation the likelihood is much higher than other subtrees. So I might have a list like ["x0 - x1", "x1 - x0", "y0 - y1", "y1 - y1", ...] or however best it would be stored and instead of complexity 3 it's more like complexity 2. In my example this happens because they're control points in a curve where (x0, y0), (x1, y1), (x2, y2), (x3, y3) are sequential and vector subtraction is expected to show up.
My complete dataset is ~20,000,000 trillion elements. Same structure with 10 integers and a 0 or 1 binary result, but at different resolutions. Obviously I can't compute perfect equations or even loop over this within a lifetime. It is a fixed number though that I can randomly sample (aka Monte Carlo) to verify the quality of the result. My question is if it's possible to support a kind of randomized window dataset. So rather than supplying a fixed dataset one would supply a function (or generator) that returns a random record. Let's call this getRandomRecord. Calling this method might be slightly expensive, so you wouldn't want to just call it and throw away the result after using it once.

I assume this would be an extension of the batching system. The system could generate data (probably multithreaded) calling getRandomRecord() populating a few GBs then batching would create a random array view on this data for each of its batches. The algorithm wouldn't be able to ever loop over all of the real data, but it could create another random view and calculate a fuzzy loss. Periodically the whole random dataset would be regenerated.

Even with my 1257984 record example above I don't think it's statistically significant to iterate over the whole dataset at the beginning. It would definitely be useful to quickly sample them. Maybe if the total loss is low then it would sample more data. The closer the total loss (what's the term for this, sum of all losses? error?) is to 0 the more of the whole dataset it would attempt to sample. So starting out it might be fine with using 1K then slowly ramp up to 100K for the view size?
In addition to the above, could this be scaled over hundreds of computers in the cloud? I couldn't get the numprocs to work without it crashing, so I'm not sure how stable it is. The concept would be that each computer generates its own randomized dataset and they only talks to ensure they don't process the same equations. My experience with distributed computing is relegated to C++ MPI stuff years ago, so I'm not familiar with Julia's features yet. I don't know your algorithm, but is it in your scope to support such distributed cloud setup in an easy way?

(Also as a side note I wouldn't be against sponsoring some of these changes for beer money, like 500 dollars. I'm interested in the potential of these kind of projects even if they don't solve my specific problems).

Nov 03 '21 05:11 sirisian

Hi @sirisian, sorry for the delay in responding. Am on travel for the week so won't be able to answer all of these now.

The first thing I want to note is that symbolic regression works for very small datasets. I usually use ~3,000 datapoints max. More datapoints generally won't help find a better solution; it will just slow the search down significantly.

In addition, you should limit the number of features. 10 is pushing it, I would try to find out beforehand what the ~5 most important features are, and only input those. These combinatorial searches scale with O(n!) where n is the number of features, so the fewer features the better.

After training with 5 features, you could try to fit the error with the additional 5 features? You could also check https://github.com/MilesCranmer/symbolic_deep_learning/ for how to do symbolic regression on high-dim datasets, though it's not a robust method and will require some work to adapt to your problem.

With 24 threads available, I would try using 24 procs, or 48 if you use multithreading=true.
So what has printed out shows "Inf" for your loss, so I think it hasn't completed a search yet. If you slim the dataset down to 3000, it should be much easier to get some results out. But you are right that your prediction problem might be closer to a classification problem, so do check out the losses page: https://astroautomata.com/SymbolicRegression.jl/dev/losses/.
Minimum complexity no (complexity is built on iteratively, so would be hard to search when starting from a large equation), but maximum complexity yes. Your current maxsize=48 is a bit high and might hurt the search - maybe start with 30? You could also try the warmupMaxsizeBy option (https://github.com/MilesCranmer/SymbolicRegression.jl/blob/8eceff3807456d7547122b40171747a4c128377b/src/Options.jl#L126), which warms up the maxsize slowly over time up to the given fraction of training time (eg 0.5 would reach max size by 50% of the way through).
This is an interesting area. Your suggestion is what I would start with - feed in the features you think will matter most. The other option is again to try the symbolic deep learning method, where you learn a hierarchy of features via deep learning, then fit each level in the hierarchy with symbolic regression. Again, it's not a robust method, but for very complex equations, it might be the way to go.
This is very interesting. I'm not sure how fast it would be; the equation evaluation part is extremely fast because the data is fixed size, maybe loading or slicing data from a large array would be expensive. Maybe one could do a samples data, and do convergence testing until an accurate loss is found? Certainly one wouldn't need the entire dataset to get a reasonable estimate.
Yes, I've used it over a cluster before. Instead of passing numprocs (which probably crashed due to the dataset size - it duplicates the data for every proc), you could define a set of processors outside of EquationSearch, and then feed them in with procs=myprocs. Depending on what sort of computing you are using, check out https://github.com/JuliaParallel/ClusterManagers.jl for some options. These would let you create a set of processors, which you can feed into EquationSearch. To use this, you will need to declare things with the@everywhere macro, which will let you set up a search over a cluster: see e.g., https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers.

And thanks for the offer! I can't take honorariums/sponsorship because of my current US visa and unfortunately must turn it down, but I appreciate the thoughtfulness! If anything I'd recommend sponsoring the Julia project, they're awesome (and the better funding they have, the faster Julia will get!).

Cheers, Miles

Nov 08 '21 01:11 MilesCranmer

With 24 threads available, I would try using 24 procs, or 48 if you use multithreading=true.

When I try to set multithreading=true and settings either procs or numprocs results in:

ERROR: LoadError: AssertionError: procs == nothing && numprocs in [0, nothing]

https://github.com/MilesCranmer/SymbolicRegression.jl/blob/d4458bd8137105c8210fa8a65dfb0bdcf91bd461/src/SymbolicRegression.jl#L173

Doesn't that mean that if multithreading then procs must be nothing and numprocs must be 0 or nothing? How do I set procs to 24 or 48? I have to be reading that wrong or using the wrong approach.

I have a smaller dataset now I'm testing with. It has 6 symbols and 1 output and 39,168 data points.

http://sirisian.com/randomfiles/symbolicregression/inputX2.bin http://sirisian.com/randomfiles/symbolicregression/outputY2.bin

Nov 11 '21 04:11 sirisian

Hi @sirisian,

Yes, sorry, the error message should be clearer. distributed (procs) and multithreaded modes cannot be used at the same time. The number of threads is determined at runtime of the Julia binary (e.g., julia --threads=24 ... would set up 24 threads), so you cannot set them in calling EquationSearch (I would if I could, but its not possible yet). Further, you cannot pass procs, since that is for distributed (not multithreaded) mode.

40k points is actually still a bit large. Can you try 1k? You should only move to the larger dataset if it starts to overfit to the 1k points (which would presumably be hard to do). 6 symbols is great!

Cheers, Miles

Nov 11 '21 14:11 MilesCranmer

Okay I've lowered it to 480 points with 334 outputs that are 1 and 146 that are -1.

http://sirisian.com/randomfiles/symbolicregression/inputX3.bin
http://sirisian.com/randomfiles/symbolicregression/outputY3.bin

using SymbolicUtils
using Serialization
using SymbolicRegression

inputX = deserialize("inputX3.bin")
outputY = deserialize("outputY3.bin")
	
#display(view(inputX, :, 1:10))
#println()
#display(view(outputY, 1:50))
#display(typeof(outputY))
	
options = SymbolicRegression.Options(
	binary_operators=(+, *, /, -, ^),
	unary_operators=(sqrt,),
	constraints=((^)=>(-1, 3),),
	loss=L1HingeLoss(),
	npopulations=20,
	progress=true
)
hallOfFame = EquationSearch(inputX, outputY, niterations=1000, options=options, multithreading=true)
dominating = calculateParetoFrontier(inputX, outputY, hallOfFame, options)
eqn = node_to_symbolic(dominating[end].tree, options)
println(simplify(eqn))
for member in dominating
	size = countNodes(member.tree)
	score = member.score
	string = stringTree(member.tree, options)
		
	println("$(size)\t$(score)\t$(string)")
end

Running with julia --threads=16 problem.jl

Using ZeroOneLoss() I see basically:

Complexity  Loss       Score     Equation
1           0.000e+00  Inf       0.0

With L1HingeLoss() I see basically:

Complexity  Loss       Score     Equation
1           6.083e-01  3.305e-01  1.0000001

I think I'm fundamentally not understanding how this should be working. With an equation 0.0 and ZeroOneLoss every data point would need to be >= 0. My data is filled with 1 and -1 though so the loss should be exactly 146 with that equation right since that's how many data points are -1?

Nov 13 '21 03:11 sirisian

Hey @sirisian,

Sorry I forgot about this thread. The issue you were seeing should be fixed. There is an internal loss that is normalized by the standard deviation of the y-data - that is what used to be printed out. Now the internal loss is re-scaled before printing, so you shouldn't see this.

Cheers, Miles

Sep 24 '22 21:09 MilesCranmer

SymbolicRegression.jl SymbolicRegression.jl copied to clipboard

Questions on large datasets and best settings and scaling

SymbolicRegression.jl
SymbolicRegression.jl copied to clipboard