tidy3d
tidy3d copied to clipboard
Tyler/mode/batchsolver2
@prashkh I took the changes from your batch (minus all of the black formats) and put them into this branch.
If you'd like, I can edit your branch to put it in this state so you can wrap it up?
btw the basic idea was:
- I checked out a new branch off of
prash/mode/batchsolver2 - I rebased this branch against
pre/2.7and "dropped" the 2nd commit (with the changes and alsoblackformatting) - then I did
git checkout prash/mode/batchsolver2 -- file1 file2 ...all of the files related to the mode solver that you changed. - commited this and pushed it here.
you could either try something similar on your branch or I could try to do it manually for you. hopefully that makes sense and let me know how you want to proceed
(and no worries about this, I've done it many times). Usually before I run black, I make sure I have all of my actual changes commited, that way if this ends up happening, I can simply drop the black commit.
Thanks for the explaining the detailed process and taking care of this. Next time it will be a lot smoother :)
No Problem. I think I forgot to checkout the CHANGELOG. Should be all set now. can you do one more check to makes sure I got everything? then maybe we can simply merge this one
Hey @prashkh i noticed the mode solver test is failing on this branch. Any idea why?
it seems like it is failing in this step. Not sure why since this works fine on my local machine.
assert all(isinstance(x, ModeSolverData) for x in results)
E assert False
E + where False = all(<generator object test_mode_solver_web_run_batch.
Could it be that we need to do
import ...web as msweb so it references local web.py file? rather than doing
import tidy3d.plugins.mode.web as msweb in the test file
That's weird. Also passed for me, but there are 125 warnings, so not sure about that. Let's see if it passes this time? hopefully it doesn't indicate some issue.
I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?
I wonder if something is wrong with the
as_completedbehavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert showsNoneTypefor all of the results. Maybe worth a try if an explicit .result will fix this?
In the test_mode_solver_web_run_batch function we set verbose=False in the run_batch function so the as_completed behavior used for the progress bar is not even triggered. Also, I don't see any error coming from the run function. Maybe the ThreadPoolExecutor for multithreading does not behave well for different execution environments during testing?
Another idea is to try to do this same thing with joblib. I can implement this today.
@prashkh One other thing I noticed is that the default folder handling here is probably not ideal. It automatically creates a directory and separate folder on the web UI when running batch. I'd say maybe it's better to leave the defaults? just my preference / impression though, maybe some users would like this.
I wonder if something is wrong with the
as_completedbehavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert showsNoneTypefor all of the results. Maybe worth a try if an explicit .result will fix this?In the
test_mode_solver_web_run_batchfunction we setverbose=Falsein therun_batchfunction so theas_completedbehavior used for the progress bar is not even triggered. Also, I don't see any error coming from therunfunction. Maybe theThreadPoolExecutorfor multithreading does not behave well for different execution environments during testing?Another idea is to try to do this same thing with joblib. I can implement this today.
I think if as_completed doesn't get triggered then there's nothing to make the threads wait for the result before the assert statement is triggered? Surprising this test would then not fail for you locally. But there could be some differences I don't understand. But yeah either joblib or making sure that you explicitly wait for the result (through as_completed or in some other way) should be good to try.
@momchil-flex Converted to joblib and used 'threading' as backend. It works fine locally so let's see if this fixes it. @tylerflex I also left the save directory to default as per your suggestion. Could you please help me check if this looks reasonable and if this passes the tests now during merging. Only this function and the import needs to be changed.
from joblib import Parallel, delayed
def run_batch(
mode_solvers: List[ModeSolver],
task_name: str = "BatchModeSolver",
folder_name: str = "BatchModeSolvers",
results_files: List[str] = None,
verbose: bool = True,
max_workers: int = DEFAULT_NUM_WORKERS,
max_retries: int = DEFAULT_MAX_RETRIES,
retry_delay: float = DEFAULT_RETRY_DELAY,
progress_callback_upload: Callable[[float], None] = None,
progress_callback_download: Callable[[float], None] = None,
) -> List[ModeSolverData]:
"""
Submits a batch of ModeSolver to the server concurrently, manages progress, and retrieves results.
Parameters
----------
mode_solvers : List[ModeSolver]
List of mode solvers to be submitted to the server.
task_name : str
Base name for tasks. Each task in the batch will have a unique index appended to this base name.
folder_name : str
Name of the folder where tasks are stored on the server's web UI.
results_files : List[str], optional
List of file paths where the results for each ModeSolver should be downloaded. If None, a default path based on the folder name and index is used.
verbose : bool
If True, displays a progress bar. If False, runs silently.
max_workers : int
Maximum number of concurrent workers to use for processing the batch of simulations.
max_retries : int
Maximum number of retries for each simulation in case of failure before giving up.
retry_delay : int
Delay in seconds between retries when a simulation fails.
progress_callback_upload : Callable[[float], None], optional
Optional callback function called when uploading file with ``bytes_in_chunk`` as argument.
progress_callback_download : Callable[[float], None], optional
Optional callback function called when downloading file with ``bytes_in_chunk`` as argument.
Returns
-------
List[ModeSolverData]
A list of ModeSolverData objects containing the results from each simulation in the batch. ``None`` is placed in the list for simulations that fail after all retries.
"""
console = get_logging_console()
if not all(isinstance(x, ModeSolver) for x in mode_solvers):
console.log("Validation Error: All items in `mode_solvers` must be instances of `ModeSolver`.")
return []
num_mode_solvers = len(mode_solvers)
if results_files is None:
results_files = [f"mode_solver_batch_results_{i}.hdf5" for i in range(num_mode_solvers)]
def handle_mode_solver(index, progress, pbar):
retries = 0
while retries <= max_retries:
try:
result = run(
mode_solver=mode_solvers[index],
task_name=f"{task_name}_{index}",
mode_solver_name=f"mode_solver_batch_{index}",
folder_name=folder_name,
results_file=results_files[index],
verbose=False,
progress_callback_upload=progress_callback_upload,
progress_callback_download=progress_callback_download,
)
progress.update(pbar, advance=1)
return result
except Exception as e:
console.log(f"Error in mode solver {index}: {str(e)}")
if retries < max_retries:
time.sleep(retry_delay)
retries += 1
else:
console.log(f"The {index}-th mode solver failed after {max_retries} tries.")
progress.update(pbar, advance=1)
return None
if verbose:
console.log(f"[cyan]Running a batch of [deep_pink4]{num_mode_solvers} mode solvers.\n")
with Progress(console=console) as progress:
pbar = progress.add_task("Status:", total=num_mode_solvers)
results = Parallel(n_jobs=max_workers, backend='threading')(
delayed(handle_mode_solver)(i, progress, pbar) for i in range(num_mode_solvers)
)
# Make sure the progress bar is complete
progress.update(pbar, completed=num_mode_solvers, refresh=True)
console.log("[green]A batch of `ModeSolver` tasks completed successfully!")
else:
results = Parallel(n_jobs=max_workers, backend='threading')(
delayed(handle_mode_solver)(i, None, None) for i in range(num_mode_solvers)
)
return results
Thanks @prashkh , I implemented the changes in https://github.com/flexcompute/tidy3d/pull/1685/commits/7b7384abe0a020957be33315a87475029b4bb33b. Unfortunately I get same error when running locally about None.
Feel free to check out this branch tyler/mode/batchsolver2 and add whatever commits you need.
@tylerflex I found the problem. When verbose=False, we don't need to update the pbar since it is None :). Committed with this change after checking it runs fine locally. Could you please try to merge and see if it passes. Thank you!
Thanks @prashkh ! Yea these rich errors can be a bit hard to debug..
I rebased this branch against the latest pre/2.7, updated the changlog, ran black, ruff, and then squashed all of the commits into one. Should be ready to merge once all of the tests pass so I'll check in later.
@prashkh. shoot, looks like it's still failing
Could you look into it?
Important note: I force pushed to this branch, so to get the changes you'd need to
git checkout tyler/mode/batchsolver2
git reset --hard origin/tyler/mode/batchsolver2
to reset your local copy to the one I pushed here.
@tylerflex I reset the local copy and ran it locally and it works fine. I think the problem is that when running pytest it is asking for API access to run mode solver. How does this work for pytest?
Error in mode solver 0: API key not found. To get your API key,
sign into 'https://tidy3d.simulation.cloud/' and copy it from your
'Account' page. Then you can configure tidy3d through command line
'tidy3d configure' and enter your API key when prompted.
Alternatively, especially if using windows, you can manually create
the configuration file by creating a file at their home directory
'~/.tidy3d/config' (unix) or '.tidy3d/config' (windows) containing
the following line: apikey = 'XXX'. Here XXX is your API key copied
from your account page within quotes.
@dbochkov-flexcompute @lucas-flexcompute
Sorry to bother you, but we're trying to get this mode solver run_batch functionality in and the web tests aren't working because the web API mocking is not kicking in for the batch mode solvers.
I spend a good amount of time trying to fix it but I think there's something I'm not understanding. It looks like both of you worked on these mocked tests for the regular mode solver run. Do you mind taking a look and modifying to make them work with batch mode solver if you get a chance?
https://github.com/flexcompute/tidy3d/pull/1685/commits/18a26d1c177c25fbacf032187a5ab7c773554369
I've added the missing response to the tests, but I'm not sure you'll want to keep it separate to the existing one or change the run_batch parameters to match the old one (if that is even possible).
I also don't see an explicit test function for run function under test_mode_solver? If mode solver run function works, then the only thing that is different with the run_batch is that it is using joblib to call the run function in parallel. run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.
thanks @lucas-flexcompute !
@prashkh there are a set of tests for run and run_local, I think. See here for example.
run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.
Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the run_batch internals or the web API. That being said, it would nice if one of the notebooks use the new run_batch feature so we can test it with the real web API from time to time.
Another consideration is testing speed, even with mock responses and data for everything in these tests, they take almost 10 minutes to finish. We also need to run them probably a dozen times a day, pretty much whenever a PR changes. if we ran the real web API, it would probably take hours to finish and use a ton of credits.
thanks @lucas-flexcompute !
@prashkh there are a set of tests for
runandrun_local, I think. See here for example.run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.
Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the
run_batchinternals or the web API. That being said, it would nice if one of the notebooks use the newrun_batchfeature so we can test it with the real web API from time to time.
That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch? I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.
That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch?
So test_mode_solver_simple is also using mock web API to test the run function. Maybe I wasn't super clear about that. It's mostly testing that the mode solver python front end interface logic is working as expected given some inputs and outputs from the web API call.
In principle it would be enough to just test joblib parallelization! except we need to ensure we're calling the joblib in the exact same way as the source code. in the end it just makes sense to test run_batch for consistency and automation.
I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.
So when we call the web API, we basically send a dictionary to some URL and get a dictionary back from the server. The web API mocking basically interrupts this process and so whenever you send a specific dictionary, it just immediately returns some "mock" data without actually making the call. The mode solver mocking does this for all of the sub-steps of msweb.run() for example, upload, downloading, etc.
I think the problem was that the dictionary sent out in the batch one had different fields from the single mode solver (eg the folder name) so the mocking was getting confused. Basically we couldn't use the same mocking function for both single and batch mode solver. This is what I was trying to figure out unsuccessfully yesterday.
Basically you can think of the two mocks as the same, but they're matching on the different data sent out by the batched mode solver and the regular one. Does that clear it up?
@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.
@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.
Thanks for the explanation on the mock api. It makes sense! Also seems like the tests were passed!! Let me check it one more time locally and we are all good :)
@tylerflex I just tested it locally and the code looks good. Seems like all checks were passed as well so all good now. Thank you!
Awesome congrats on your first contribution! 🎉 . sorry it was a little rocky with the testing there.
Thanks for all the help!