WMCore
WMCore copied to clipboard
MSUnmerged: Add multithreading for remote operations
Impact of the new feature MSUnmerged
Is your feature request related to a problem? Please describe. This issue is a request for feature development, following our findings we have made in the previous issue related to the service slowness: https://github.com/dmwm/WMCore/issues/11701#issuecomment-1791072045
Long story short the WebDAV
protocol is obviously slower than SRMv2
in communicating remote data control commands (like listdir
and stat
) between the server and the client. One way to mitigate this slowness is to try to avoid recursive remote operations to the best we can. This is what we do with this PR: https://github.com/dmwm/WMCore/pull/11781 Unfortunately this won't work for every site (especially for those which are using WebDAV
as a proxy protocol to xrootd
, but I may be wrong here - this needs to be doublechecked).
So the proper way to overcome the situation is to implement some sort of parallelization of remote operations, because they are implying the constraint here. And since the delays are mostly remote communication dependent it is natural to think about increasing the number of communication sessions locally - inside a single service instance (K8 pod).
Describe the solution you'd like
There are two places in MSUnmerged
, where we can do that. The two of them reflecting a different granularity:
- We can try to spread multiple threads on a per
RSE
basis, here: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L281-L284
In this approach we should create N pipelines and assign them to N threads. This way by sending one RSE
at a time through through 1 pipelines we should scale up N times, because we will run N RSE
s simultaneously, and expecting to reduce the I/O wait N folds.
- We can try to spread multiple threads on a per
baseDirectory
basis, here: https://github.com/dmwm/WMCore/blob/21342f5af95dfaa6af30aa92a714739799ed1160/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L325
In this approach we should create N threads in the background, and then we should simply slice this data structure: rse['files']['toDelete']
in slices of size N and then start feeding the N threads with the resultant N subsets of the data. This will basically create N FIFO queues, which should again supposedly reduce the I/O wait N folds.
Restrictions:
- In the first approach we will hit some memory limits, because uploading N
RSE
s in memory would be quite expensive. - In the second approach we already have everything in memory, so in that sense this one is much more promissing, but as will be seen down from my small scale test, this one is actually CPU limited.
Here is how I already implemented a Proof of concept based the second approach suggested (parallelizateion of remote operations inside the RSE): https://github.com/todor-ivanov/WMCoreAux/pull/8/commits/190eab1d647f4dd4fdafdae967414c39c7934fc9
This is following one to one the method of splitting the data structure inside the RSE and start feeding it in N FIFO queues, as explained above. I used this code to see if we can actually gain something and here are the results:
- Remote Operation used:
stat
- Remote Site:
T2_IT_Legnaro
- Hardware: Single core VM - OpenStack profile
small
- Number of threads: 1 vs. 10
Single Thread:
(WMCore.MSUnmergedStandalone) [@ WMCore.MSUnmergedStandalone]$ time python $WMCORE_SERVICE_ROOT/WMCoreAux/bin/MSStandalone/MSUnmerged/init.py > scaleTest-1_threads-T2_IT_Legnaro.log 2>&1
real 14m1.500s
user 10m15.514s
sys 0m4.998s
10 Threads:
(WMCore.MSUnmergedStandalone) [@ WMCore.MSUnmergedStandalone]$ time python $WMCORE_SERVICE_ROOT/WMCoreAux/bin/MSStandalone/MSUnmerged/init.py > scaleTest-10_threads-T2_IT_Legnaro.log 2>&1
real 8m37.584s
user 8m4.143s
sys 0m7.261s
- In the first case the machine was running at 70% CPU and never exceeded it
- In the second run it was hitting the 100% constantly and all the 10 threads were evenly sharing at around 10% CPU time each.
The increase of CPU usage from 70% to 100% matches almost the gain we have in terms of runtime as well. This CPU time increase could be also due to thread managing overhead as well. This last guess of mine sounds as an overstretch, but we should not overlook this possibility as well. Even though it looks we are gaining some benefit from splitting those I/O operations in multiple threads. But for better scale test we'll need a bigger hardware.
Describe alternatives you've considered None
Additional context none
@amaltaro @vkuznet , please take a look at the explanation provided in this issue's description.
@todor-ivanov , obviously I backup your way of thinking that speed up can be accomplished via parallelism. Said that, I think we should explore which way would provide best performance. In Python there are multiple ways, via threads, multiprocessing, asyncio, etc. In my view threads are most expensive way among others. What I'm interested is to know how much time SINGLE operations takes. If we have N operations the right parallelism should only bring 10% overhead over single operation and not more. In your results it is not the case, you bring 10 threads and reduced time by factor of 2. I rather interested if we can explore asyncio instead of threads and see how much boost it will give us.
Hi @vkuznet I'v brought 10 threads, but I was CPU constrained and I've hit the CPU limit quite early... so above the second new process al the rest were mostly splitting the same resource. If we want real results we should redo the test with a better machine. I agree with you though, to explore all the rest of the options. I'll take a look on them as well.