distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Tracking Child Processes (tree of child processes)

Open CMCDragonkai opened this issue 5 years ago • 15 comments

I've noticed that when using subprocess.run on my dask tasks, the workers still report low CPU usage, when the child process is in fact maxing out my CPU usage. Does Dask track the forked/child process CPU usage under the worker CPU usage? If not, I think it should since child processes can occur when interfacing dask workers against programs written in non-python.

CMCDragonkai avatar Mar 18 '19 07:03 CMCDragonkai

The task stream isn't able to keep track of the child process either, so it just looks like the task is blocked and there's no progress report.

CMCDragonkai avatar Mar 18 '19 07:03 CMCDragonkai

Does Dask track the forked/child process CPU usage under the worker CPU usage?

No, it doesn't

If not, I think it should since child processes can occur when interfacing dask workers against programs written in non-python.

Do you have recommendations on how to do this well? Currently we use the psutil module. If there is something else we should be doing then suggestions would be welcome.

The relevant code is here:

https://github.com/dask/distributed/blob/fb30c33562862f30864456766424b44a3e91aa5b/distributed/system_monitor.py#L50

mrocklin avatar Mar 18 '19 15:03 mrocklin

It looks like you would need to do something where you find the pid tree of the worker process and then acquire stats for each pid. This is transitive as child processes that produce more child processes would all come under the pid tree.

CMCDragonkai avatar Mar 19 '19 01:03 CMCDragonkai

Here's an example of doing this in Node.js: https://github.com/soyuka/pidusage-tree

CMCDragonkai avatar Mar 19 '19 01:03 CMCDragonkai

This is natively supported in the psutil package: https://unix.stackexchange.com/a/339071/56970

pid=2235; python3 -c "import psutil
for c in psutil.Process($pid).children(True):
  print(c.pid)"

The True ensures that it gets it recursively.

https://psutil.readthedocs.io/en/latest/#psutil.Process.children

Another way is to realise that process group on Linux will always encapsulate the entire process tree when launched by a shell. See: https://en.wikipedia.org/wiki/Process_group This means that on Linux, if you start each worker on its own process group. Then you can just query the PGID and get all processes part of that group. This is also useful for distribution of signals. However this may not be portable to Windows or other operating systems, whereas process tree via direct descendants may be more standard among all OS.

I wrote a gist about this process groups: https://gist.github.com/CMCDragonkai/f58afb7e39fcc422097849b853caa140

CMCDragonkai avatar Mar 19 '19 01:03 CMCDragonkai

Perhaps you'd like to submit a PR? Also, will this affect performance? We do these checks relatively frequently, and they show up frequently in overhead profiling.

On Mon, Mar 18, 2019 at 6:12 PM Roger Qiu [email protected] wrote:

This is natively supported in the psutil package: https://unix.stackexchange.com/a/339071/56970

pid=2235; python3 -c "import psutil for c in psutil.Process($pid).children(True): print(c.pid)"

The True ensures that it gets it recursively.

https://psutil.readthedocs.io/en/latest/#psutil.Process.children

Another way is to realise that process group on Linux will always encapsulate the entire process tree when launched by a shell. See: https://en.wikipedia.org/wiki/Process_group This means that on Linux, if you start each worker on its own process group. Then you can just query the PGID and get all processes part of that group. This is also useful for distribution of signals. However this may not be portable to Windows or other operating systems, whereas process tree via direct descendants may be more standard among all OS.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2568#issuecomment-474159592, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszG2fzN4gZXIBVTdT8XoGuaIa4-Saks5vYDmEgaJpZM4b5GpO .

mrocklin avatar Mar 19 '19 01:03 mrocklin

It should only affect performance for workers who are creating a tree of subprocesses. Otherwise it would just be checking one process. I think in this case it should be linear to the number total subprocesses that exist. I haven't checked psutil internals if it has to iterate over the entire process tree each time that is called, if it does, that would not be ideal. Then the ideal should be to use process groups. Windows actually does have process groups.

CMCDragonkai avatar Mar 19 '19 01:03 CMCDragonkai

If it doesn't affect the case where people aren't making many subprocesses then I'm not too concerned.

On Mon, Mar 18, 2019 at 6:23 PM Roger Qiu [email protected] wrote:

It should only affect performance for workers who are creating a tree of subprocesses. Otherwise it would just be checking one process. I think in this case it should be linear to the number total subprocesses that exist. I haven't checked psutil internals if it has to iterate over the entire process tree each time that is called, if it does, that would not be ideal. Then the ideal should be to use process groups. Windows actually does have process groups.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2568#issuecomment-474161627, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDOAtrniytQiY-CSHFjicr7yFLvGks5vYDwggaJpZM4b5GpO .

mrocklin avatar Mar 19 '19 01:03 mrocklin

@CMCDragonkai do you have any interest in implementing and testing this?

mrocklin avatar Mar 29 '19 06:03 mrocklin

Yes, however it's in the queue of things to do. I don't mind if somebody else takes the lead on this.

CMCDragonkai avatar Apr 01 '19 07:04 CMCDragonkai

https://github.com/dask/distributed/pull/2390 if you can find a way to make this use less cpu then I'm sure the developers would add it.

currently it takes about 3-10% of the cpu to run these checks, so your options would be to only poll for cpu/memory a subset of the time, or to use a different module and figure it out in some other fashion.

the problem lies in the fact that you have to hold onto the children and ask them for resource usage more than once. a single time will give you 0 usage.

danpf avatar Sep 13 '19 01:09 danpf

If it is worth it for correctness, allow the user of the system to decide whether the overhead is worth it.

CMCDragonkai avatar Jan 06 '20 01:01 CMCDragonkai

I tried to find a solution to follow the workers' child processes.

To do this, the class handling the system monitoring must track the children; otherwise, the CPU usage is always 0 for all children (no previous time exists to calculate the usage). Initially, the children's set is empty. Then we have to add new children and remove the old ones. In short, it adds a little bit of calculation. In my test case, I didn't see any problem with the monitoring performance.

--- 8,14 ----
  class SystemMonitor:
      def __init__(self, n=10000):
          self.proc = psutil.Process()
+         self.children = set()
  
          self.time = deque(maxlen=n)
          self.cpu = deque(maxlen=n)
*************** class SystemMonitor:
*** 45,50 ****
--- 46,60 ----
          with self.proc.oneshot():
              cpu = self.proc.cpu_percent()
              memory = self.proc.memory_info().rss
+             children = set(self.proc.children(True)) - self.children
+             if children:
+                 self.children.update(children)
+             self.children = set(item for item in self.children
+                                 if item.is_running())
+             for item in self.children:
+                 with item.oneshot():
+                     cpu += item.cpu_percent()
+                     memory += item.memory_info().rss
          now = time()
  
          self.cpu.append(cpu)

I can open a PR to insert this change.

fbriol avatar Nov 19 '20 21:11 fbriol

Any traction on this? I'd pay the extra cpu penalty to be able to get these stats or even more granular stats.

zbarr avatar Jul 26 '23 07:07 zbarr

Ditto here, has any switch been added to allow this? I use dask primarily to taskfarm a fortran program where each execution takes minutes or more, so a few percent extra overhead paid per task would be no problem for me. On the other hand, knowing the CPU and memory headspace per node would be invaluable for optimizing my use of the nodes

rasmus98 avatar Apr 02 '24 15:04 rasmus98