tidy3d Tyler/mode/batchsolver2

@prashkh I took the changes from your batch (minus all of the black formats) and put them into this branch.

If you'd like, I can edit your branch to put it in this state so you can wrap it up?

btw the basic idea was:

I checked out a new branch off of prash/mode/batchsolver2
I rebased this branch against pre/2.7 and "dropped" the 2nd commit (with the changes and also black formatting)
then I did git checkout prash/mode/batchsolver2 -- file1 file2 ... all of the files related to the mode solver that you changed.
commited this and pushed it here.

you could either try something similar on your branch or I could try to do it manually for you. hopefully that makes sense and let me know how you want to proceed

(and no worries about this, I've done it many times). Usually before I run black, I make sure I have all of my actual changes commited, that way if this ends up happening, I can simply drop the black commit.

May 07 '24 16:05 tylerflex

Thanks for the explaining the detailed process and taking care of this. Next time it will be a lot smoother :)

May 07 '24 18:05 prashkh

No Problem. I think I forgot to checkout the CHANGELOG. Should be all set now. can you do one more check to makes sure I got everything? then maybe we can simply merge this one

May 07 '24 18:05 tylerflex

Hey @prashkh i noticed the mode solver test is failing on this branch. Any idea why?

May 07 '24 19:05 tylerflex

it seems like it is failing in this step. Not sure why since this works fine on my local machine.

  assert all(isinstance(x, ModeSolverData) for x in results)

E assert False E + where False = all(<generator object test_mode_solver_web_run_batch.. at 0x33085f300>)

May 07 '24 19:05 prashkh

Could it be that we need to do import ...web as msweb so it references local web.py file? rather than doing import tidy3d.plugins.mode.web as msweb in the test file

May 07 '24 20:05 prashkh

That's weird. Also passed for me, but there are 125 warnings, so not sure about that. Let's see if it passes this time? hopefully it doesn't indicate some issue.

May 07 '24 23:05 tylerflex

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

May 20 '24 11:05 momchil-flex

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

In the test_mode_solver_web_run_batch function we set verbose=False in the run_batch function so the as_completed behavior used for the progress bar is not even triggered. Also, I don't see any error coming from the run function. Maybe the ThreadPoolExecutor for multithreading does not behave well for different execution environments during testing?

Another idea is to try to do this same thing with joblib. I can implement this today.

May 20 '24 12:05 prashkh

@prashkh One other thing I noticed is that the default folder handling here is probably not ideal. It automatically creates a directory and separate folder on the web UI when running batch. I'd say maybe it's better to leave the defaults? just my preference / impression though, maybe some users would like this.

May 20 '24 13:05 tylerflex

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

In the test_mode_solver_web_run_batch function we set verbose=False in the run_batch function so the as_completed behavior used for the progress bar is not even triggered. Also, I don't see any error coming from the run function. Maybe the ThreadPoolExecutor for multithreading does not behave well for different execution environments during testing?

Another idea is to try to do this same thing with joblib. I can implement this today.

I think if as_completed doesn't get triggered then there's nothing to make the threads wait for the result before the assert statement is triggered? Surprising this test would then not fail for you locally. But there could be some differences I don't understand. But yeah either joblib or making sure that you explicitly wait for the result (through as_completed or in some other way) should be good to try.

May 20 '24 13:05 momchil-flex

@momchil-flex Converted to joblib and used 'threading' as backend. It works fine locally so let's see if this fixes it. @tylerflex I also left the save directory to default as per your suggestion. Could you please help me check if this looks reasonable and if this passes the tests now during merging. Only this function and the import needs to be changed.


from joblib import Parallel, delayed

def run_batch(
    mode_solvers: List[ModeSolver],
    task_name: str = "BatchModeSolver",
    folder_name: str = "BatchModeSolvers",
    results_files: List[str] = None,
    verbose: bool = True,
    max_workers: int = DEFAULT_NUM_WORKERS,
    max_retries: int = DEFAULT_MAX_RETRIES,
    retry_delay: float = DEFAULT_RETRY_DELAY,
    progress_callback_upload: Callable[[float], None] = None,
    progress_callback_download: Callable[[float], None] = None,
) -> List[ModeSolverData]:
    """
    Submits a batch of ModeSolver to the server concurrently, manages progress, and retrieves results.

    Parameters
    ----------
    mode_solvers : List[ModeSolver]
        List of mode solvers to be submitted to the server.
    task_name : str
        Base name for tasks. Each task in the batch will have a unique index appended to this base name.
    folder_name : str
        Name of the folder where tasks are stored on the server's web UI.
    results_files : List[str], optional
        List of file paths where the results for each ModeSolver should be downloaded. If None, a default path based on the folder name and index is used.
    verbose : bool
        If True, displays a progress bar. If False, runs silently.
    max_workers : int
        Maximum number of concurrent workers to use for processing the batch of simulations.
    max_retries : int
        Maximum number of retries for each simulation in case of failure before giving up.
    retry_delay : int
        Delay in seconds between retries when a simulation fails.
    progress_callback_upload : Callable[[float], None], optional
        Optional callback function called when uploading file with ``bytes_in_chunk`` as argument.
    progress_callback_download : Callable[[float], None], optional
        Optional callback function called when downloading file with ``bytes_in_chunk`` as argument.


    Returns
    -------
    List[ModeSolverData]
        A list of ModeSolverData objects containing the results from each simulation in the batch. ``None`` is placed in the list for simulations that fail after all retries.
    """
    console = get_logging_console()
    
    if not all(isinstance(x, ModeSolver) for x in mode_solvers):
        console.log("Validation Error: All items in `mode_solvers` must be instances of `ModeSolver`.")
        return []

    num_mode_solvers = len(mode_solvers)

    if results_files is None:
        results_files = [f"mode_solver_batch_results_{i}.hdf5" for i in range(num_mode_solvers)]

    def handle_mode_solver(index, progress, pbar):
        retries = 0
        while retries <= max_retries:
            try:
                result = run(
                    mode_solver=mode_solvers[index],
                    task_name=f"{task_name}_{index}",
                    mode_solver_name=f"mode_solver_batch_{index}",
                    folder_name=folder_name,
                    results_file=results_files[index],
                    verbose=False,
                    progress_callback_upload=progress_callback_upload,
                    progress_callback_download=progress_callback_download,
                )
                progress.update(pbar, advance=1)
                return result
            except Exception as e:
                console.log(f"Error in mode solver {index}: {str(e)}")
                if retries < max_retries:
                    time.sleep(retry_delay)
                    retries += 1
                else:
                    console.log(f"The {index}-th mode solver failed after {max_retries} tries.")
                    progress.update(pbar, advance=1)
                    return None

    if verbose:
        console.log(f"[cyan]Running a batch of [deep_pink4]{num_mode_solvers} mode solvers.\n")
        with Progress(console=console) as progress:
            pbar = progress.add_task("Status:", total=num_mode_solvers)
            results = Parallel(n_jobs=max_workers, backend='threading')(
                delayed(handle_mode_solver)(i, progress, pbar) for i in range(num_mode_solvers)
            )
            # Make sure the progress bar is complete
            progress.update(pbar, completed=num_mode_solvers, refresh=True)
            console.log("[green]A batch of `ModeSolver` tasks completed successfully!")
    else:
        results = Parallel(n_jobs=max_workers, backend='threading')(
            delayed(handle_mode_solver)(i, None, None) for i in range(num_mode_solvers)
        )

    return results

May 20 '24 14:05 prashkh

Thanks @prashkh , I implemented the changes in https://github.com/flexcompute/tidy3d/pull/1685/commits/7b7384abe0a020957be33315a87475029b4bb33b. Unfortunately I get same error when running locally about None.

Feel free to check out this branch tyler/mode/batchsolver2 and add whatever commits you need.

May 20 '24 22:05 tylerflex

@tylerflex I found the problem. When verbose=False, we don't need to update the pbar since it is None :). Committed with this change after checking it runs fine locally. Could you please try to merge and see if it passes. Thank you!

May 21 '24 21:05 prashkh

Thanks @prashkh ! Yea these rich errors can be a bit hard to debug..

I rebased this branch against the latest pre/2.7, updated the changlog, ran black, ruff, and then squashed all of the commits into one. Should be ready to merge once all of the tests pass so I'll check in later.

May 22 '24 11:05 tylerflex

@prashkh. shoot, looks like it's still failing

Could you look into it?

Important note: I force pushed to this branch, so to get the changes you'd need to

git checkout tyler/mode/batchsolver2
git reset --hard origin/tyler/mode/batchsolver2

to reset your local copy to the one I pushed here.

May 22 '24 11:05 tylerflex

@tylerflex I reset the local copy and ran it locally and it works fine. I think the problem is that when running pytest it is asking for API access to run mode solver. How does this work for pytest?

Error in mode solver 0: API key not found. To get your API key,    
             sign into 'https://tidy3d.simulation.cloud/' and copy it from your  
             'Account' page. Then you can configure tidy3d through command line 
             'tidy3d configure' and enter your API key when prompted.           
             Alternatively, especially if using windows, you can manually create
             the configuration file by creating a file at their home directory  
             '~/.tidy3d/config' (unix) or '.tidy3d/config' (windows) containing 
             the following line: apikey = 'XXX'. Here XXX is your API key copied
             from your account page within quotes.

May 22 '24 12:05 prashkh

@dbochkov-flexcompute @lucas-flexcompute

Sorry to bother you, but we're trying to get this mode solver run_batch functionality in and the web tests aren't working because the web API mocking is not kicking in for the batch mode solvers.

I spend a good amount of time trying to fix it but I think there's something I'm not understanding. It looks like both of you worked on these mocked tests for the regular mode solver run. Do you mind taking a look and modifying to make them work with batch mode solver if you get a chance?

https://github.com/flexcompute/tidy3d/pull/1685/commits/18a26d1c177c25fbacf032187a5ab7c773554369

May 22 '24 12:05 tylerflex

I've added the missing response to the tests, but I'm not sure you'll want to keep it separate to the existing one or change the run_batch parameters to match the old one (if that is even possible).

May 23 '24 10:05 lucas-flexcompute

I also don't see an explicit test function for run function under test_mode_solver? If mode solver run function works, then the only thing that is different with the run_batch is that it is using joblib to call the run function in parallel. run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

May 23 '24 12:05 prashkh

thanks @lucas-flexcompute !

@prashkh there are a set of tests for run and run_local, I think. See here for example.

run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the run_batch internals or the web API. That being said, it would nice if one of the notebooks use the new run_batch feature so we can test it with the real web API from time to time.

May 23 '24 12:05 tylerflex

Another consideration is testing speed, even with mock responses and data for everything in these tests, they take almost 10 minutes to finish. We also need to run them probably a dozen times a day, pretty much whenever a PR changes. if we ran the real web API, it would probably take hours to finish and use a ton of credits.

May 23 '24 12:05 tylerflex

thanks @lucas-flexcompute !

@prashkh there are a set of tests for run and run_local, I think. See here for example.

run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the run_batch internals or the web API. That being said, it would nice if one of the notebooks use the new run_batch feature so we can test it with the real web API from time to time.

That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch? I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.

May 23 '24 12:05 prashkh

That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch?

So test_mode_solver_simple is also using mock web API to test the run function. Maybe I wasn't super clear about that. It's mostly testing that the mode solver python front end interface logic is working as expected given some inputs and outputs from the web API call.

In principle it would be enough to just test joblib parallelization! except we need to ensure we're calling the joblib in the exact same way as the source code. in the end it just makes sense to test run_batch for consistency and automation.

I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.

So when we call the web API, we basically send a dictionary to some URL and get a dictionary back from the server. The web API mocking basically interrupts this process and so whenever you send a specific dictionary, it just immediately returns some "mock" data without actually making the call. The mode solver mocking does this for all of the sub-steps of msweb.run() for example, upload, downloading, etc.

I think the problem was that the dictionary sent out in the batch one had different fields from the single mode solver (eg the folder name) so the mocking was getting confused. Basically we couldn't use the same mocking function for both single and batch mode solver. This is what I was trying to figure out unsuccessfully yesterday.

Basically you can think of the two mocks as the same, but they're matching on the different data sent out by the batched mode solver and the regular one. Does that clear it up?

May 23 '24 12:05 tylerflex

@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.

May 24 '24 15:05 tylerflex

@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.

Thanks for the explanation on the mock api. It makes sense! Also seems like the tests were passed!! Let me check it one more time locally and we are all good :)

May 24 '24 16:05 prashkh

@tylerflex I just tested it locally and the code looks good. Seems like all checks were passed as well so all good now. Thank you!

May 24 '24 16:05 prashkh

Awesome congrats on your first contribution! 🎉 . sorry it was a little rocky with the testing there.

May 24 '24 16:05 tylerflex

Thanks for all the help!

May 24 '24 16:05 prashkh

tidy3d tidy3d copied to clipboard

Tyler/mode/batchsolver2

tidy3d
tidy3d copied to clipboard