WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

MSUnmerged: Add multithreading for remote operations

Open todor-ivanov opened this issue 1 year ago • 3 comments

Impact of the new feature MSUnmerged

Is your feature request related to a problem? Please describe. This issue is a request for feature development, following our findings we have made in the previous issue related to the service slowness: https://github.com/dmwm/WMCore/issues/11701#issuecomment-1791072045

Long story short the WebDAV protocol is obviously slower than SRMv2 in communicating remote data control commands (like listdir and stat) between the server and the client. One way to mitigate this slowness is to try to avoid recursive remote operations to the best we can. This is what we do with this PR: https://github.com/dmwm/WMCore/pull/11781 Unfortunately this won't work for every site (especially for those which are using WebDAV as a proxy protocol to xrootd, but I may be wrong here - this needs to be doublechecked).

So the proper way to overcome the situation is to implement some sort of parallelization of remote operations, because they are implying the constraint here. And since the delays are mostly remote communication dependent it is natural to think about increasing the number of communication sessions locally - inside a single service instance (K8 pod).

Describe the solution you'd like

There are two places in MSUnmerged, where we can do that. The two of them reflecting a different granularity:

  • We can try to spread multiple threads on a per RSE basis, here: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L281-L284

In this approach we should create N pipelines and assign them to N threads. This way by sending one RSE at a time through through 1 pipelines we should scale up N times, because we will run N RSEs simultaneously, and expecting to reduce the I/O wait N folds.

  • We can try to spread multiple threads on a per baseDirectory basis, here: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L325

In this approach we should create N threads in the background, and then we should simply slice this data structure: rse['files']['toDelete'] in slices of size N and then start feeding the N threads with the resultant N subsets of the data. This will basically create N FIFO queues, which should again supposedly reduce the I/O wait N folds.

Restrictions:

  • In the first approach we will hit some memory limits, because uploading N RSEs in memory would be quite expensive.
  • In the second approach we already have everything in memory, so in that sense this one is much more promissing, but as will be seen down from my small scale test, this one is actually CPU limited.

Here is how I already implemented a Proof of concept based the second approach suggested (parallelizateion of remote operations inside the RSE): https://github.com/todor-ivanov/WMCoreAux/pull/8/commits/190eab1d647f4dd4fdafdae967414c39c7934fc9

This is following one to one the method of splitting the data structure inside the RSE and start feeding it in N FIFO queues, as explained above. I used this code to see if we can actually gain something and here are the results:

  • Remote Operation used: stat
  • Remote Site: T2_IT_Legnaro
  • Hardware: Single core VM - OpenStack profile small
  • Number of threads: 1 vs. 10

Single Thread:

(WMCore.MSUnmergedStandalone) [@ WMCore.MSUnmergedStandalone]$ time python  $WMCORE_SERVICE_ROOT/WMCoreAux/bin/MSStandalone/MSUnmerged/init.py >  scaleTest-1_threads-T2_IT_Legnaro.log  2>&1

real	14m1.500s
user	10m15.514s
sys	0m4.998s

10 Threads:

(WMCore.MSUnmergedStandalone) [@ WMCore.MSUnmergedStandalone]$ time python  $WMCORE_SERVICE_ROOT/WMCoreAux/bin/MSStandalone/MSUnmerged/init.py >  scaleTest-10_threads-T2_IT_Legnaro.log  2>&1

real	8m37.584s
user	8m4.143s
sys	0m7.261s

  • In the first case the machine was running at 70% CPU and never exceeded it
  • In the second run it was hitting the 100% constantly and all the 10 threads were evenly sharing at around 10% CPU time each.

The increase of CPU usage from 70% to 100% matches almost the gain we have in terms of runtime as well. This CPU time increase could be also due to thread managing overhead as well. This last guess of mine sounds as an overstretch, but we should not overlook this possibility as well. Even though it looks we are gaining some benefit from splitting those I/O operations in multiple threads. But for better scale test we'll need a bigger hardware.

Describe alternatives you've considered None

Additional context none

todor-ivanov avatar Nov 10 '23 14:11 todor-ivanov

@amaltaro @vkuznet , please take a look at the explanation provided in this issue's description.

todor-ivanov avatar Nov 10 '23 15:11 todor-ivanov

@todor-ivanov , obviously I backup your way of thinking that speed up can be accomplished via parallelism. Said that, I think we should explore which way would provide best performance. In Python there are multiple ways, via threads, multiprocessing, asyncio, etc. In my view threads are most expensive way among others. What I'm interested is to know how much time SINGLE operations takes. If we have N operations the right parallelism should only bring 10% overhead over single operation and not more. In your results it is not the case, you bring 10 threads and reduced time by factor of 2. I rather interested if we can explore asyncio instead of threads and see how much boost it will give us.

vkuznet avatar Nov 10 '23 15:11 vkuznet

Hi @vkuznet I'v brought 10 threads, but I was CPU constrained and I've hit the CPU limit quite early... so above the second new process al the rest were mostly splitting the same resource. If we want real results we should redo the test with a better machine. I agree with you though, to explore all the rest of the options. I'll take a look on them as well.

todor-ivanov avatar Nov 10 '23 20:11 todor-ivanov