Update random allocation in DataContainer allocate

Open hrobarts opened this issue 11 months ago • 0 comments

Description

We want to move away from using numpy.random.random() especially in DataContainer.allocate(random) because it always generates the array as float64
There are alternative random number generator methods in numpy which would allow us to sample the generator with the data size we want. However these methods don’t allow you to set a global seed.

We discussed two options

Create a CIL global parameter that stores the seed value which will try to replicate the existing behaviour but with the new random generator method. Current behaviour: you set a seed value globally but numpy increments the generator each time it is used e.g. if we set a seed then run ig.allocate('random') twice, the two arrays will be different. But if you reset the seed to the same in between calling allocate, the arrays will be the same.
- Pros: replicates existing behaviour
- Cons: internally keeping track of the seed increment will be complicated. If we have a global seed, users might expect it to be used in every place a random seed could be used. It's also potentially confusing as the seed is used internally in some places so behaviour could be unexpected. e.g.

global_seed.set_seed(5)
data1 = ig.allocate('random')
data2 = ig.allocate('random')

will give a different output to

global_seed.set_seed(5)
data1 = ig.allocate('random')
operator.norm() # any function that calls random unexpectedly
data2 = ig.allocate('random')

This is the behaviour that the generators try to avoid by not having a global seed, so it's probably not best practice to implement it ourselves.

Switch to using a random number generator but do not have a global seed. Wherever possible we could pass a seed directly to the function. We think this is possible everywhere that uses DataContainer.allocate('random'), (with this update to the power method https://github.com/TomographicImaging/CIL/pull/1585/files.) However operator norm methods call the power method sometimes and we don't want to allow a seed argument in norm because sometimes the result is cached.
- Pros: simpler, no confusing behaviour caused by global seed
- Cons: setting numpy global seed will no longer fix the behaviour of algorithms that call norm. We could update our advice that if you want to use a fixed seed in an algorithm then you should calculate the norm yourself and use set_norm

We also discussed whether if we use the global seed it should be updated elsewhere in CIL. These are known places that use random: - Sampler - Partitioner - LSVRG and SVRG - Operator dot test method - Dataexample scikit random noise - Noise - Lots of tests

Environment

import cil, sys
print(cil.version.version, cil.version.commit_hash, sys.version, sys.platform)

Jan 09 '25 14:01 hrobarts