OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Refactor how browsers handle failures and restarts

Open englehardt opened this issue 10 years ago • 2 comments
trafficstars

The Browser.restart_browser_manager(),Browser.reset(), and Browser.launch_browser_manager() methods are needlessly complex and confusing. This had led to a multitude of bugs, particularly around profile handling (e.g. the simple one in 6e384db5a06ecad4cf1b3aa2369b519a757026cb). This section of the infrastructure needs to be refactored in the following way:

  • Have a single pipeline for restarts, which handles restarts triggered from timeouts, excessive memory usage, crashes, and restarts for stateless crawls.
  • Infer whether or not the profile should be cleared by the flag on the command. No part of the infrastructure should be hard-coding this.
  • Make the logic of launch_browser_manager() clear. It should be obvious when a launch is post-crash, is loading a clean profile, or is loading an external (user-specified) profile. Right now we condition on a bunch of nested logic (i.e. is a directory None or not).

englehardt avatar Sep 17 '15 03:09 englehardt

Right now we overload browser_params['profile_tar'] to both track the profile directory provided by the user and to load the profile of a crashed browser. Thus when clearing a profile (reset = True) we must also clear browser_params['profile_tar']. Thus, crawl which loads a profile with profile_tar and subsequently calls a command with reset = True will fail to load the profile tar again on all subsequent restarts. This is probably not the desired functionality.

englehardt avatar Sep 17 '15 21:09 englehardt

Profile loading and saving needs to be refactored alongside restarting. A user should not be required to issue a command to dump a profile. Instead, a user should specify if they want a stateful crawl, and if they want to save the resulting profile. We can keep the ability to dump a command around if there is a need for it. The way I see it is:

  • is_stateful determines whether or not to restart the browser after each command (or CommandSequence), and whether or not to preserve a profile through a crash
  • profile_archive specifies the directory a profile will be saved to when the TaskManager closes for any reason (e.g. an exception, a manager.close() command, etc).
  • dump_profile command should act as it currently does

Thus all of the reset flags on commands should be removed, and instead the restarting of browsers should occur when is_stateful is set to False. These settings can still be set on a per-browser basis. If we want to keep support for sending multiple commands to the same browser under this model, we'll need to wait until #27 lands.

englehardt avatar Nov 05 '15 23:11 englehardt