OpenWPM
OpenWPM copied to clipboard
Refactor how browsers handle failures and restarts
The Browser.restart_browser_manager(),Browser.reset(), and Browser.launch_browser_manager() methods are needlessly complex and confusing. This had led to a multitude of bugs, particularly around profile handling (e.g. the simple one in 6e384db5a06ecad4cf1b3aa2369b519a757026cb). This section of the infrastructure needs to be refactored in the following way:
- Have a single pipeline for restarts, which handles restarts triggered from timeouts, excessive memory usage, crashes, and restarts for stateless crawls.
- Infer whether or not the profile should be cleared by the flag on the command. No part of the infrastructure should be hard-coding this.
- Make the logic of
launch_browser_manager()clear. It should be obvious when a launch is post-crash, is loading a clean profile, or is loading an external (user-specified) profile. Right now we condition on a bunch of nested logic (i.e. is a directoryNoneor not).
Right now we overload browser_params['profile_tar'] to both track the profile directory provided by the user and to load the profile of a crashed browser. Thus when clearing a profile (reset = True) we must also clear browser_params['profile_tar']. Thus, crawl which loads a profile with profile_tar and subsequently calls a command with reset = True will fail to load the profile tar again on all subsequent restarts. This is probably not the desired functionality.
Profile loading and saving needs to be refactored alongside restarting. A user should not be required to issue a command to dump a profile. Instead, a user should specify if they want a stateful crawl, and if they want to save the resulting profile. We can keep the ability to dump a command around if there is a need for it. The way I see it is:
is_statefuldetermines whether or not to restart the browser after each command (orCommandSequence), and whether or not to preserve a profile through a crashprofile_archivespecifies the directory a profile will be saved to when the TaskManager closes for any reason (e.g. an exception, a manager.close() command, etc).dump_profilecommand should act as it currently does
Thus all of the reset flags on commands should be removed, and instead the restarting of browsers should occur when is_stateful is set to False. These settings can still be set on a per-browser basis. If we want to keep support for sending multiple commands to the same browser under this model, we'll need to wait until #27 lands.