HistoQC icon indicating copy to clipboard operation
HistoQC copied to clipboard

The first version of unit test for HistoQC

Open nanli-emory opened this issue 1 year ago • 25 comments

nanli-emory avatar Oct 18 '23 19:10 nanli-emory

Could be nice to have some coverage report though. A few previous bugs (e.g., the one to parse the image_work_size) occurred in certain branches that are rarely invoked by users - wonder if they could be caught within full-coverage tests.

CielAl avatar Oct 25 '23 03:10 CielAl

Could be nice to have some coverage report though. A few previous bugs (e.g., the one to parse the image_work_size) occurred in certain branches that are rarely invoked by users - wonder if they could be caught within full-coverage tests.

this is an interesting idea, i think it should be pretty easy to implement with coverage or pytest-cov plugins. others thoughts?

choosehappy avatar Oct 25 '23 08:10 choosehappy

i love that, and it should be expanded a bit into the main workflow

user should be able to specify a seed as a command line parameter which sets the seed for the entire system, enabling true reproducibility

seed should as well then be saved in the results.tsv file

thoughts? seems fairly straightforward?


From: Jackson Jacobs @.> Sent: Wednesday, October 25, 2023 3:39 PM To: choosehappy/HistoQC @.> Cc: Janowczyk, Andrew R. @.>; Comment @.> Subject: [External] Re: [choosehappy/HistoQC] The first version of unit test for HistoQC (PR #266)

@jacksonjacobs1 commented on this pull request.


In histoqc/tests/test_images_tsv_results.pyhttps://github.com/choosehappy/HistoQC/pull/266#discussion_r1371781298:

  •        content2 = file2.read()
    
  •        # comparing labels
    
  •        for label in self.tsv_labels[:-1]:
    
  •            label_value1 = tu.parseLabel(label, content1)
    
  •            label_value2 = tu.parseLabel(label, content2)
    
  •            self.assertEqual(label_value1, label_value2)
    
  •            print(f'label {label} in tsv results comparison pass')
    
  •        # comparing dataset
    
  •        dataset_label = self.tsv_labels[-1]
    
  •        dataset1 = tu.parseDataset(dataset_label, content1)
    
  •        dataset2 = tu.parseDataset(dataset_label, content2)
    
  •        for field_name in self.tsv_dataset_fields[:10]:
    

That's correct, there is an assertAlmostEqual() function. However, we don't want to use this broadly as other modules should be exactly reproducible.

One option would be to keep a list of modules which have random behavior, and apply assertAlmostEqual() only to the output from those modules (tsv and mask output).

However, I would argue that all modules with random behavior should also accept seeds to ensure that results are reproducible. The random seed could be passed as a configuration variable. What do you think about that?

— Reply to this email directly, view it on GitHubhttps://github.com/choosehappy/HistoQC/pull/266#discussion_r1371781298, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACJ3XTBOKQVQKP4XQQBG4HTYBEJCNAVCNFSM6AAAAAA6GB3ZL2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOJXGM4DAMJQGQ. You are receiving this because you commented.Message ID: @.***>

choosehappy avatar Oct 25 '23 13:10 choosehappy

Agree with @CielAl that a coverage report would be good to have. I'll open an issue to track this feature request.

Regarding the random seed, @choosehappy 's plan makes sense to me. Nan, how long would you expect it would take to add the random seed to this PR? Then your assert statements can be expanded to all output.

jacksonjacobs1 avatar Oct 25 '23 14:10 jacksonjacobs1

Almost there. Can the seed parameter be located at the top of the file to indicate it is applied globally (to all modules)?

jacksonjacobs1 avatar Nov 07 '23 15:11 jacksonjacobs1

Almost there. Can the seed parameter be located at the top of the file to indicate it is applied globally (to all modules)?

Do you mean on top of config files? If you want to put the seed parameter as global variables. Let's put it into ** BasicModule.getBasicStats** and treat the seed same as the image_work_size. Every module can read it from s['seed']. Any thought?

nanli-emory avatar Nov 07 '23 16:11 nanli-emory

Yes, that makes sense to me.

Can you also run pytest-cov and include the report in the readme?

jacksonjacobs1 avatar Nov 09 '23 17:11 jacksonjacobs1

Looks like there is no such thing as a global config parameter, since each module is only passed the parameters from it's section: https://github.com/choosehappy/HistoQC/blob/3653ffb50a4a17c44fd5daacd357ddd4227e8884/histoqc/_pipeline.py#L365

def load_pipeline(config):
    """load functions and parameters from config

    Parameters
    ----------
    config : configparser.ConfigParser
    """
    steps = config.get(section='pipeline', option='steps').splitlines()

    process_queue = []
    for process in steps:
        mod_name, func_name = process.split('.')

        try:
            mod = import_module(f"histoqc.{mod_name}")
        except ImportError:
            raise NameError(f"Unknown module in pipeline from config file:\t {mod_name}")

        func_name = func_name.split(":")[0]  # take base of function name
        try:
            func = getattr(mod, func_name)
        except AttributeError:
            raise NameError(f"Unknown function from module in pipeline from config file:\t {mod_name}.{func_name}")

        if config.has_section(process):
            params = dict(config.items(section=process))
        else:
            params = {}

        process_queue.append((func, params))
    return process_queue

That's why image_work_size must be specified at the module level.

We can either keep the current setup and specify a seed separately to each section of the config, or add a new section to the config file which gets appended to every params dict during load_pipeline()

The config section would look something like this:

[global]
seed = 123

@choosehappy any comments?

jacksonjacobs1 avatar Dec 04 '23 17:12 jacksonjacobs1

Looks like there is no such thing as a global config parameter, since each module is only passed the parameters from it's section:

https://github.com/choosehappy/HistoQC/blob/3653ffb50a4a17c44fd5daacd357ddd4227e8884/histoqc/_pipeline.py#L365

def load_pipeline(config):
    """load functions and parameters from config

    Parameters
    ----------
    config : configparser.ConfigParser
    """
    steps = config.get(section='pipeline', option='steps').splitlines()

    process_queue = []
    for process in steps:
        mod_name, func_name = process.split('.')

        try:
            mod = import_module(f"histoqc.{mod_name}")
        except ImportError:
            raise NameError(f"Unknown module in pipeline from config file:\t {mod_name}")

        func_name = func_name.split(":")[0]  # take base of function name
        try:
            func = getattr(mod, func_name)
        except AttributeError:
            raise NameError(f"Unknown function from module in pipeline from config file:\t {mod_name}.{func_name}")

        if config.has_section(process):
            params = dict(config.items(section=process))
        else:
            params = {}

        process_queue.append((func, params))
    return process_queue

That's why image_work_size must be specified at the module level.

We can either keep the current setup and specify a seed separately to each section of the config, or add a new section to the config file which gets appended to every params dict during load_pipeline()

The config section would look something like this:

[global]
seed = 123

@choosehappy any comments?

Hi @jacksonjacobs1, I think we can put the seed as global variable into BaseImage since every pipeline has to use it.

[BaseImage.BaseImage]
seed = 123

nanli-emory avatar Dec 04 '23 17:12 nanli-emory

Ah my mistake. Yes, that should work.

jacksonjacobs1 avatar Dec 04 '23 17:12 jacksonjacobs1

I guess my question is - why do we need to pass this variable around? Once the seed is set at application launch -- all subsequent modules should become deterministic, should they not? its not clear to me why individual modules would need to know the seed - they should just be inheriting from the parent app? at least that's how i understood it works, but could be wrong?

or, is there another use case you folks are thinking about?

choosehappy avatar Dec 05 '23 13:12 choosehappy

I guess my question is - why do we need to pass this variable around? Once the seed is set at application launch -- all subsequent modules should become deterministic, should they not? its not clear to me why individual modules would need to know the seed - they should just be inheriting from the parent app? at least that's how i understood it works, but could be wrong?

or, is there another use case you folks are thinking about?

Got it. We add an optional argument -s / --seed in the command line at the application launch

nanli-emory avatar Dec 05 '23 15:12 nanli-emory

dividual modules would need to know the seed - they should just be inheriting from the parent app? at least that's how i understood it works, but cou

If I understand it correctly: certain libraries may implement/invoke their own RNG, and therefore in this case if you want the parent app/process to take care of the random state you probably need to centralize the import/seed-setting of all libraries across all modules, but that sorta defeats the purpose of making modules somewhat a "detachable" component that users can customize without modifying the core, e.g., BaseImage.

Therefore, the only difference here between doing it in config or in BaseImage is rather which sementically fulfull the role to carry the seed. I would say that Jack's suggestion might be better since you don't want an overly redundant BaseImage, and technically the program does compile the config first to construct the pipeline (BaseImage + params + shared states of multiprocessing) so it wouldn't be too troublesome to do so this way.

CielAl avatar Dec 05 '23 22:12 CielAl

certain libraries may implement/invoke their own RNG,

i think this is to be tested - i remember an issue with pytorch + numpy previously using different RNGs, but eventually they were merged. I'm unsure if we actually have different RNG approaches - almost all of the stuff within histoqc is going to be determnistic . i would think that we could use the tests to verify this, run it a couple of times with the same seed and different seeds and (a) note if/where they are the same different (b) construct (and note down) a list of the modules which are non-deterministic so folks know

in the end, we should be using a default seed, with the option to change it on the command line, so that things are both "automatically" reproducible, with the option to add randomness at the users discretion.

does that make sense?

choosehappy avatar Dec 06 '23 18:12 choosehappy

in the end, we should be using a default seed, with the option to change it on the command line, so that things are both "automatically" reproducible, with the option to add randomness at the users discretion.

Definitely not trying to dispute that but I think it worth stating that, since Modules are executed by worker processes, whether the RNG state is inheritated may depends on how you created the subprocesses. If it is by fork (e.g., UNIX-alike) then the child process inherit the seed naturally, but that won't happen in Windows which uses spawn.

Therefore it means when you execute the pipeline in your worker function, the seed should be either in the config or in the shared_dict in order to set the seed in subprocesses. What I said in previous replies is that, since you need to let the subprocess/Module be aware of the seed, the seed is better in the config/shared_dict rather than in the BaseImage, as when you call the worker function, the config/shared_dict are more relevant arguments.

I asked chatgpt to write a simple draft to prove this (lol):

# run it in Windows
import multiprocessing
import random
import os

def generate_random_number():
   # uncomment this line for comparison
   # random.seed(42)
   print(f"Process ID: {os.getpid()}, Random Number: {random.random()}")

def main():
   # Set random seed in the main process
   random.seed(42)
   print(f"Main Process ID: {os.getpid()}, Random Seed Set: 42")

   # Ensure the 'spawn' context is used for creating new processes
   multiprocessing.set_start_method('spawn')

   # Create and start multiple subprocesses
   processes = []
   for _ in range(4):
       p = multiprocessing.Process(target=generate_random_number)
       p.start()
       processes.append(p)

   # Wait for all subprocesses to finish
   for p in processes:
       p.join()

if __name__ == "__main__":
   main()

CielAl avatar Dec 07 '23 16:12 CielAl

in the end, we should be using a default seed, with the option to change it on the command line, so that things are both "automatically" reproducible, with the option to add randomness at the users discretion.

Definitely not trying to dispute that but I think it worth stating that, since Modules are executed by worker processes, whether the RNG state is inheritated may depends on how you created the subprocesses. If it is by fork (e.g., UNIX-alike) then the child process inherit the seed naturally, but that won't happen in Windows which uses spawn.

Therefore it means when you execute the pipeline in your worker function, the seed should be either in the config or in the shared_dict in order to set the seed in subprocesses. What I said in previous replies is that, since you need to let the subprocess/Module be aware of the seed, the seed is better in the config/shared_dict rather than in the BaseImage, as when you call the worker function, the config/shared_dict are more relevant arguments.

I asked chatgpt to write a simple draft to prove this (lol):

# run it in Windows
import multiprocessing
import random
import os

def generate_random_number():
   # uncomment this line for comparison
   # random.seed(42)
   print(f"Process ID: {os.getpid()}, Random Number: {random.random()}")

def main():
   # Set random seed in the main process
   random.seed(42)
   print(f"Main Process ID: {os.getpid()}, Random Seed Set: 42")

   # Ensure the 'spawn' context is used for creating new processes
   multiprocessing.set_start_method('spawn')

   # Create and start multiple subprocesses
   processes = []
   for _ in range(4):
       p = multiprocessing.Process(target=generate_random_number)
       p.start()
       processes.append(p)

   # Wait for all subprocesses to finish
   for p in processes:
       p.join()

if __name__ == "__main__":
   main()

PS: if you execute this in Windows, but not setting random seed inside the worker function, you will most likely get different printed random number in each subprocess (very small chance that subprocesses ended up with the same seed without explicit setting).

CielAl avatar Dec 07 '23 16:12 CielAl

in the end, we should be using a default seed, with the option to change it on the command line, so that things are both "automatically" reproducible, with the option to add randomness at the users discretion.

Definitely not trying to dispute that but I think it worth stating that, since Modules are executed by worker processes, whether the RNG state is inheritated may depends on how you created the subprocesses. If it is by fork (e.g., UNIX-alike) then the child process inherit the seed naturally, but that won't happen in Windows which uses spawn. Therefore it means when you execute the pipeline in your worker function, the seed should be either in the config or in the shared_dict in order to set the seed in subprocesses. What I said in previous replies is that, since you need to let the subprocess/Module be aware of the seed, the seed is better in the config/shared_dict rather than in the BaseImage, as when you call the worker function, the config/shared_dict are more relevant arguments. I asked chatgpt to write a simple draft to prove this (lol):

# run it in Windows
import multiprocessing
import random
import os

def generate_random_number():
   # uncomment this line for comparison
   # random.seed(42)
   print(f"Process ID: {os.getpid()}, Random Number: {random.random()}")

def main():
   # Set random seed in the main process
   random.seed(42)
   print(f"Main Process ID: {os.getpid()}, Random Seed Set: 42")

   # Ensure the 'spawn' context is used for creating new processes
   multiprocessing.set_start_method('spawn')

   # Create and start multiple subprocesses
   processes = []
   for _ in range(4):
       p = multiprocessing.Process(target=generate_random_number)
       p.start()
       processes.append(p)

   # Wait for all subprocesses to finish
   for p in processes:
       p.join()

if __name__ == "__main__":
   main()

PS: if you execute this in Windows, but not setting random seed inside the worker function, you will most likely get different printed random number in each subprocess (very small chance that subprocesses ended up with the same seed without explicit setting).

Hi @CielAl @choosehappy and @jacksonjacobs1, I added the '-s' or '--seed' as optional arguments in command lines. Please review ./histoqc/_work.py and ./histoqc/main.py and provide some feedback. Thanks.

nanli-emory avatar Dec 07 '23 16:12 nanli-emory

thanks @CielAl , i agree.

i was as well able to confirm that the code provided produces different results in windows when using spawn, meaning special actions will have to be taken here

in line with what you said, i think this is the exact purpose of the mpm, ( multiprocessing.Manager()), made here:

https://github.com/choosehappy/HistoQC/blob/79943ccd8f07e9b9a75460866a1f43a10d616c34/histoqc/main.py#L69C11-L69C36

and that the shared_dict could hold the random seed, to be used to set each worker process internally to ensure reproducibility

@jacksonjacobs1 thoughts?

choosehappy avatar Dec 11 '23 10:12 choosehappy

choosehappy 12/11/2023 5:03 AM the shared_dict could hold the random seed, to be used to set each worker process internally to ensure reproducibility I agree that this seems to be the best solution. I think this is essentially what Nan implemented, no?

https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/main.py#L165 ^ set seed in _shared_state

https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40 ^ _shared_state passes the see to the worker.

And the random seed is set at the command line.

Nan, did you get the chance to test this out?


From: choosehappy @.> Sent: Monday, December 11, 2023 5:03 AM To: choosehappy/HistoQC @.> Cc: Jacobs, Jackson @.>; Mention @.> Subject: [External] Re: [choosehappy/HistoQC] The first version of unit test for HistoQC (PR #266)

thanks @CielAlhttps://github.com/CielAl , i agree.

i was as well able to confirm that the code provided produces different results in windows when using spawn, meaning special actions will have to be taken here

in line with what you said, i think this is the exact purpose of the mpm, ( multiprocessing.Manager()), made here:

https://github.com/choosehappy/HistoQC/blob/79943ccd8f07e9b9a75460866a1f43a10d616c34/histoqc/main.py#L69C11-L69C36

and that the shared_dict could hold the random seed, to be used to set each worker process internally to ensure reproducibility

@jacksonjacobs1https://github.com/jacksonjacobs1 thoughts?

— Reply to this email directly, view it on GitHubhttps://github.com/choosehappy/HistoQC/pull/266#issuecomment-1849719858, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQF4IJNOTGDORKMLXOGWLE3YI3K5PAVCNFSM6AAAAAA6GB3ZL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBZG4YTSOBVHA. You are receiving this because you were mentioned.Message ID: @.***>

jacksonjacobs1 avatar Dec 11 '23 15:12 jacksonjacobs1

I don't see where the seed actually gets set in the worker?

also, shouldn't it be set in the worker initilizer?

https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L12

would that carry it into the processes indefinitely? -- this actually wouldn't work, because you can imagine if you want something determnistic at a slide level -- if the slide is in spot 1 or spot 10, and you set the seed at time t=0, then you'll be pulling different numbers at the t=1 or t=10 spot, one actually would indeed want to set the seed before processing each individual image

choosehappy avatar Dec 11 '23 16:12 choosehappy

Okay, so we agree that the seed is being passed down far enough (each worker gets the same seed) but isn't actually set anywhere, correct?

I think the appropriate place to set the seed would then be here instead of passing it further into the modules: https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40

I would note though that doing so would only guarantee reproducibility if the pipeline is identical between runs. If modules are re-ordered in the pipeline, or new random calls are added to one module, then other modules may behave differently.

Is that acceptable?

jacksonjacobs1 avatar Dec 13 '23 22:12 jacksonjacobs1

Okay, so we agree that the seed is being passed down far enough (each worker gets the same seed) but isn't actually set anywhere, correct?

I think the appropriate place to set the seed would then be here instead of passing it further into the modules: https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40

I would note though that doing so would only guarantee reproducibility if the pipeline is identical between runs. If modules are re-ordered in the pipeline, or new random calls are added to one module, then other modules may behave differently.

Is that acceptable?

I think how we expect reproducibility itself answers that: if we are dealing with a new published code of a new research, we expect to "reproduce" the results by setting the seed (assuming using the same kind of hardware/software RNG ofc) and execute their exact scripts. It won't be "reproduce" if you manually shuffle a few exchangeable procedures that involves random number generation. So my opinion is that it is fine.

As for where seeds goes, since shared_state in the current commit already have seeds, it is probably the best to leave it there: if the user-defined Module has anything specific to do with seeds (e.g., a third-party library involking their own RNG) it offers the flexibility. Also this is coherent to how other modules handle multiprocessing related information (e.g., context and locks).

CielAl avatar Dec 13 '23 23:12 CielAl

I think the appropriate place to set the seed would then be here instead of passing it further into the modules: https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40

do you mean putting it in between line 39 and line 40?

i would probably put it as the first line in the worker function, here on line 21: https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L21

the reasoning is that ultimately it impacts the working of the entire module, and if someone eventually writes code between 21 and 40 and don't realize that a seed is being set on 40, you end up with unintended consequences. by putting things which impact the entire function at the top, it makes it easier for code maintenance and long term sustainability

I would note though that doing so would only guarantee reproducibility if the pipeline is identical between runs. If modules are re-ordered in the pipeline, or new random calls are added to one module, then other modules may behave differently.

yes, that is correct

one can imagine this in e.g., a clinical setting. they lock down the config file, and then the next 1m WSI will be subjected to that config over the year, but at different times (e.g., 1x per day run over all the WSI). in this case, the results should be directly comparable between runs -- this is the type of reproducibility I'm thinking of.

similarly, i think this concept trickles down into the unit tests, to ensure that the same results are produced all the time for verification purposes?

choosehappy avatar Dec 14 '23 08:12 choosehappy

Hi @choosehappy, @jacksonjacobs1 and @CielAl,

Currently, I moved the seed as optional arguments in command line and the default value is None https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/main.py#L57

As you see, I passed the seed into _shared_state https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/main.py#L165

and worker get the seed and pass it into BaseImage https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L19 https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40

At BaseImage, I save the seed into s["seed"] as a global variable. https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/BaseImage.py#L34

There are 2 modules - ClassificationModule and LocalTextureEstimationModule- use the seed to set the RNG. I let each module handle the setting seed on its own. https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/ClassificationModule.py#L199 https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/LocalTextureEstimationModule.py#L24

I have 3 questions: First, should we give a not-None-value seed as the default such as seed = 123 which ensures that the entire system has the same seed and guarantees reproducibility?

Second, should we give users the ability to turn on and off setting RNG seed? For example, it turns off setting RNG seed if the user gives the seed as 'None' or 'NaN'.

Third, should we set the seed in the worker or leave it for each module to handle on its own?

nanli-emory avatar Dec 14 '23 15:12 nanli-emory

Hi @choosehappy, @jacksonjacobs1 and @CielAl,

Currently, I moved the seed as optional arguments in command line and the default value is None https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/main.py#L57

As you see, I passed the seed into _shared_state https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/main.py#L165

and worker get the seed and pass it into BaseImage https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L19 https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/_worker.py#L40

At BaseImage, I save the seed into s["seed"] as a global variable. https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/BaseImage.py#L34

There are 2 modules - ClassificationModule and LocalTextureEstimationModule- use the seed to set the RNG. I let each module handle the setting seed on its own. https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/ClassificationModule.py#L199 https://github.com/nanli-emory/HistoQC/blob/12dca26383b5b826e683027b91d80d880eaf7cee/histoqc/LocalTextureEstimationModule.py#L24

I have 3 questions: First, should we give a not-None-value seed as the default such as seed = 123 which ensures that the entire system has the same seed and guarantees reproducibility?

Second, should we give users the ability to turn on and off setting RNG seed? For example, it turns off setting RNG seed if the user gives the seed as 'None' or 'NaN'.

Third, should we set the seed in the worker or leave it for each module to handle on its own?

Hi @nanli-emory, @choosehappy , and @jacksonjacobs1 ,

I think what Andrew mentioned above, besides where the seed goes, is that the seeds should be set at worker initialization level. Herein we have two scopes that need to take care of: (1) the parent app that manages and invokes the pipeline of modules. (2) modules themselves.

We can take pytorch's multiprocessing DataLoader as an example: for the DataLoader, there are also two parts where random states are involved - One is the creation of the base iterator (link above), their own RNG (torch.Generator class) is initialized (multiprocessing version here), which is plain. The second is in the worker_init_fn, which is recommended but not enforced see here mainly because user can optionally destroy each worker after a round of iteration (persistent_worker=False) and therefore the previously set random state may be lost.

Now if we go back to the HistoQC pipeline, we know three factors:

(1) as Andrew suggested, the seed is an argument set at parent-app-level, meaning that it should overwrite most of whatever random seed settings in both parent app and modules. Therefore it makes sense to simply set the random state using the seeds in shared_dict in worker function here because worker function is invoked when a worker is created and asigned with a job - that will be similar to how pytorch Dataloader set the seed for the iterator of each worker in the multiprocessIter subclasss.

(2) Majority of the default modules use the numpy RNG (perhaps some use the python built-in ones), therefore it is feasible to implement a fix_seed function to set the random states globally at worker-level for all known RNG used in default modules .

(3) It is potentially possible that future user-defined modules (or even our new modules) may use different RNG (e.g., pytorch itself is an example) other than those. This means that the worker-level seeds fixing may not set the state of these modules as we cannot anticipate what RNG they use. Herein, since seed is already passed to the modules in the shared_dict, and modules can directly set the seeds from the shared_dict similar to how they involk locks. ---- this is analogically similar to how pytorch let user to fix their seeds in their own worker_init_fn if they choose to destroy the worker, or that their dataset implementation use different RNGs.

In short, it indeed feels reasonable to set the seeds at worker-level, meaning that the parent-app mostly controls the random states since seed is a parent-app-level argument rather than module level config (also the reason why it should be in shared_dict rather than in configparser). Also if individual user-defined module need more control of the RNG (e.g., thirdparty library with different RNGs), modules can access the seed setting (e.g., whether None/systemdefault or user defined) from the shared_dict.

I personally think while in terms of functionality, putting seeds as global variable of BaseImage has no difference to just leave it in shared_dict, the former might not be an ideal solution as essentially you can put everything into BaseImage since it is used by all modules --- but we definitely don't want to do that due to all the coupling and redundancy level - so why don't we just leave the seed in shared_dict as it's more worker-relevant and used to pass the worker states to modules.

CielAl avatar Dec 14 '23 16:12 CielAl

Revisiting this issue now.

I agree that the seed should be set in the _shared_state dict.

@nanli-emory Is there anything that still needs to be done here or is this branch ready to be merged in?

jacksonjacobs1 avatar May 01 '24 20:05 jacksonjacobs1

Revisiting this issue now.

I agree that the seed should be set in the _shared_state dict.

@nanli-emory Is there anything that still needs to be done here or is this branch ready to be merged in? It's done. Please merge it.

nanli-emory avatar May 03 '24 15:05 nanli-emory