signac-flow icon indicating copy to clipboard operation
signac-flow copied to clipboard

Speed up parameter formatting in status

Open vyasr opened this issue 5 years ago • 4 comments

Feature description

I have a somewhat large signac project (~500) jobs with some complicated conditions. I've profiled and cleaned up the conditions, but I was still finding the status updates to be slow. I was able to isolate the problem to the part of the code that handles gathering parameter information (if you run python project.py status -d -p). This part of the code is taking ~250 s (about 4 minutes). We should be able to do it faster than that, so I would like to optimize this part of the code.

Proposed solution

My very rough benchmark was done by just printing the amount of time spent in the code between line 1873 (if parameters is self.PRINT_STATUS_ALL_VARYING_PARAMETERS) and line 1902 (parameters[i] = shorten(self._alias(str(para)), param_max_width)) in project.py. Offhand I'd guess that the issue either comes from opening many jobs and performing the conversion of the statepoint to a dictionary (line 1882), or from the fact that the get function there is implemented as splitting a string and then rejoining it on each recursive call. Since my schema is flat, the former seems more likely. I would recommend creating a large project with a simple schema (something like 5 keys with 4 values each for 1024 total jobs) and then running a profile specifically on that segment of the code. Running signac update-cache beforehand has effectively no impact on the runtime.

vyasr avatar Apr 04 '20 16:04 vyasr

@vyasr won't this will speed up once the parallelization for fetching status is implemented?

kidrahahjo avatar Apr 05 '20 16:04 kidrahahjo

Adding a permalink to this code section on the current commit of master, for posterity and convenience: https://github.com/glotzerlab/signac-flow/blob/215b252bbda798d2fcb9f0b44917c4fd70e996c5/flow/project.py#L1872-L1902

bdice avatar Apr 05 '20 16:04 bdice

@kidrahahjo no, the parallelization will speed up the call to _fetch_status, but the code that I'm referring to here is run after _fetch_status. You can see the difference by running a status update with and without the -p option, which triggers the parameter printing but does not change the process of determining the current status of each job.

vyasr avatar Apr 05 '20 18:04 vyasr

This code is also the only place that _to_hashable is used (see PR #384). It seems like this portion of the code could use a thorough refactoring. I have a couple questions. Note that this code is refactored some in #374, but its performance characteristics won't change.

  1. Potentially redundant code: Is the _to_hashable helper function really necessary? It may be possible to use signac's schema detection here instead of this logic, which seems redundant with signac schema. This is the only place _to_hashable is used and it could be removed from signac-flow if this use case is eliminated in favor of signac's schema detection. I would guess that signac's schema detection will be smarter/faster but if not, it should be optimized in one place (in signac's schema detection logic) instead of two places (both signac-flow and signac schema detection).

  2. Documentation/naming: Is the name "parameter" the right word for a state point key? It seems like we use "state point key" more frequently in the docs but it might be a good idea for us to be more consistent in the docs for signac and signac-flow.

bdice avatar Nov 30 '20 02:11 bdice