QuasiMonteCarlo.jl icon indicating copy to clipboard operation
QuasiMonteCarlo.jl copied to clipboard

Better Sampling Interface

Open ParadaCarleton opened this issue 3 years ago • 15 comments

The sampling interface is a bit of a mess at the moment. The way users are asked to specify sample sizes is clunky and unnatural, demanding that they perform unusual calculations to avoid shooting themselves in the foot. Sobol nets, for example, only exist for powers of two, but users can request any sample size. At the moment we just give users these sample sizes by truncating the sequence, even though these truncated sequences are not guaranteed to converge to the correct answer, and will typically perform worse than Monte Carlo point sets. For Faure nets, the situation is safer (we error if the sample size is inappropriate) but even more annoying for users (they have to calculate multiples of powers of prime numbers).

For digital nets, a better approach is outlined here, and used by Art Owen in his QMC packages. Rather than request a sample size, users can provide parameters for a point set's stratification properties (m), how many independent replications they'd like (λ), and a base (b), before providing them λ*b^m points. I'd also suggest a feature where users can set an approximate sample size by using a keyword, then receive a net with the smallest possible net larger than the requested sample size.

This would be a breaking change for sure.

ParadaCarleton avatar Oct 21 '22 03:10 ParadaCarleton

This makes a lot of sense.

ChrisRackauckas avatar Oct 22 '22 05:10 ChrisRackauckas

About making the API evolve, I had few of questions/comments:

  • Could we change the name LowDiscrepancySample for HaltonSample? I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.
  • Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.
  • I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).
  • Following this post (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea! I have a piece of code from Owen website doing that.

dmetivie avatar Nov 24 '22 14:11 dmetivie

About making the API evolve, I had few of questions/comments:

  • Could we change the name LowDiscrepancySample for HaltonSample? I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.
  • Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.
  • I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).
  • Following this post (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea! I have a piece of code from Owen website doing that.

I agree with all of these, and I think they'd be great additions! I was planning on making PRs for most of these myself. I think the Sobol Sample should just be centered, though--every point should be placed in the middle of the appropriate box, rather than at the start. So the first two points would be 1/4 and 3/4 instead of 0 and 1/2, for example. This is what FaureSample currently does.

ParadaCarleton avatar Nov 24 '22 15:11 ParadaCarleton

Could we change the name LowDiscrepancySample for HaltonSample? I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.

100% agreed. It's something one of the students did that I never ended up correcting. It's one of the reasons I wouldn't make a v1.0 on this library yet 😅.

Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.

Yes, though the sample(n, lb, ub, sampler) is still going to be the core algorithm that most people use. We can document sample(n, d::Integer, SamplingAlgorithm()) though, and it would serve as a nice abstraction to implement sample(n, lb, ub, sampler) generically just by rescaling the outputs of sample(n, d::Integer, SamplingAlgorithm()).

I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).

The interface should be type-matching. If you use Rational for your lb and ub, then you should get Rational values for the points. This is one reason why a shorthand for sample(n, d::Integer, SamplingAlgorithm()) is scary: it's necessarily information dropping.

Following this https://github.com/stevengj/Sobol.jl/issues/31#issue-1328916662 (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!

Or just make it an option?

ChrisRackauckas avatar Nov 27 '22 01:11 ChrisRackauckas

Could we change the name LowDiscrepancySample for HaltonSample? I find it very confusing, as Low Discrepency is a generic term used to define all QMC sequences.

100% agreed. It's something one of the students did that I never ended up correcting. It's one of the reasons I wouldn't make a v1.0 on this library yet sweat_smile.

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

Could we allow the method sample(n, d::Integer, SamplingAlgorithm()) for [0,1]^d sampling without arbitrary bounds. This is for people only working in [0,1]^d make more sense. Plus, randomization methods are defined in this interval.

Yes, though the sample(n, lb, ub, sampler) is still going to be the core algorithm that most people use. We can document sample(n, d::Integer, SamplingAlgorithm()) though, and it would serve as a nice abstraction to implement sample(n, lb, ub, sampler) generically just by rescaling the outputs of sample(n, d::Integer, SamplingAlgorithm()).

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

I wondered if some sequence like Faure, Halton, Sobol could have an option to be produced as Rational numbers (since they are often exactly defined with fraction //). On performance, I am not sure of the result, but it would for example allow exact binary expansion in base b).

The interface should be type-matching. If you use Rational for your lb and ub, then you should get Rational values for the points. This is one reason why a shorthand for sample(n, d::Integer, SamplingAlgorithm()) is scary: it's necessarily information dropping.

I don't think that's a huge deal; we can add an optional type argument that defaults to Float64.

Following this stevengj/Sobol.jl#31 (comment) (and ref there), could we allow alternative Sobol sample? Like S0b0lSample (with zero instead of o to highlight hat the zero is in the sequence, contrary to the SobolSample). The name is probably a bad idea!

Or just make it an option?

I'm not sure we should include an option--there's really no reason to drop the initial 0 when we can use the centered Sobol samples instead, which have better discrepancy and don't have the problem of an initial 0. Including an option would just be a "shoot self in foot" argument.

ParadaCarleton avatar Nov 27 '22 02:11 ParadaCarleton

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

It can cause namespacing issues when exported. And Sobol is already the module name of Sobol.jl

Box(d, type=Float64)

Passing DataTypes is almost always a bad idea because of how that plays with specialization heuristics. I mean, you can do it, but any function that isn't careful will lose inference.

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

sample(n,region,sampler)?

ChrisRackauckas avatar Nov 27 '22 02:11 ChrisRackauckas

With regards to names, I actually think we should drop Sample from all the types; it's 6 extra letters but contains 0 new information.

It can cause namespacing issues when exported. And Sobol is already the module name of Sobol.jl

Hmm, maybe, but would that come up all that often? The majority of users probably don't have 2 separate packages for doing QMC. And using import QuasiMonteCarlo as QMC instead of using QuasiMonteCarlo isn't too much of a problem--QMC.Faure is still shorter than FaureSample.

As for Sobol, that shouldn't be a problem because we don't reexport it, right?

Box(d, type=Float64)

Passing DataTypes is almost always a bad idea because of how that plays with specialization heuristics. I mean, you can do it, but any function that isn't careful will lose inference.

Hmm, that's strange; I haven't had any problems with it before. Is Box{Float64}(d) any better?

Yep, that makes sense! Although actually, it makes me think of a possibly-better interface. We might want users to specify what kind of region they want (e.g. Box(lb, ub), or Box(d, type=Float64) for the unit box), then map to said region if it's not a box; by default we can have a method that assumes the unit box.

sample(n,region,sampler)?

Yep!

ParadaCarleton avatar Nov 27 '22 20:11 ParadaCarleton

Another suggestion--maybe it's better to use pointset or a similar name instead of sample? Partly this is so it doesn't conflict with sample, since not exporting sample is annoying for users. But mostly this is because I think sample gives the wrong impression that QMC is a drop-in replacement for Monte Carlo. Latin hypercube and Faure samples work like that, but not much else; Sobol samples in particular assume only the first few dimensions are relevant to your problem, so you have to choose your parametrization carefully or you'll get bad results (much worse than LHS).

ParadaCarleton avatar Nov 30 '22 19:11 ParadaCarleton

I agree pointset is a good name. I think what we want to do is update the interface and then claim it to be the v1.0 release, so now would be the time to break things if they need to break.

ChrisRackauckas avatar Dec 02 '22 14:12 ChrisRackauckas

@dcarrera Just wanted to ask about this because you opened an issue about it in Sobol.jl. Would you be interested in making a pull request implementing your suggested interface here in QMC.jl?

ParadaCarleton avatar Dec 15 '22 20:12 ParadaCarleton

@ParadaCarleton Hey! I'm not sure I feel qualified to implement this. As I noted in that other issue on Sobol.jl, I looked at the implementations scipy.stats.qmc.Sobol and Art Owen's R package. Superficially it didn't LOOK complicated, but I could not figure out what the code was doing in either case. I think I just don't understand the Sobol algorithm ell enough. I mentioned in that other issue that the R package is BSD licensed. So someone that understands it could translate it to Julia. I just had another look at the R code:

https://artowen.su.domains/code/rsobol.R

I don't know R but I can see that there are two ways to scrambe the points: A "nested uniform" ("nestu") and "Matousek's linear scramble" ("mato"). A minimal Julia version only needs one way to scramble the points. I had a look, but I don't really understand how they work.

dcarrera avatar Dec 16 '22 18:12 dcarrera

@ChrisRackauckas do you think you or someone else could make a draft PR to Surrogates.jl to swap to the "iterator over points" interface? Now that eachcol makes it easy to return an iterator over points, I'd like to switch over sooner rather than later to avoid breaking more code than we have to.

ParadaCarleton avatar Jul 21 '23 00:07 ParadaCarleton

Yes sorry JuliaCon delays. I talked with @sharanry about it at JuliaCon. @sharanry did my description of the issue make sense? I'd like to get this prioritized a bit because right now it's the only big major bound on any SciML library (Surrogates.jl still doesn't allow the latest major on QuasiMonteCarlo, 6 months later) and it's somewhat a ticking timebomb.

ChrisRackauckas avatar Aug 08 '23 23:08 ChrisRackauckas

I meant @thazhemadam

ChrisRackauckas avatar Aug 09 '23 00:08 ChrisRackauckas

If @thazhemadam wants to talk on Slack about this I’d be happy to.

ParadaCarleton avatar Aug 14 '23 18:08 ParadaCarleton