Peter Andreas Entschev comments

Results 210 comments of


                                            Peter Andreas Entschev

Consolidate messages in UCX

> Just to be clear it is doubling the memory usage per object being transmitted during its transmission. So it is not as simple as doubling all memory or for...

Consolidate messages in UCX

> I think we are on the same page. I'd like people to try it and report feedback before we consider merging. Awesome, please keep us posted. Let me know...

Consolidate messages in UCX

> Thus far have tried the MRE from issue ( rapidsai/ucx-py#402 ) where it seems to help. Could you elaborate on what you mean by "seems to help"? Another question:...

Consolidate messages in UCX

My tests show an improvement of this PR versus the current master branch, so definitely +1 from that perspective. I'm not able to evaluate memory footprint right now, but I'm...

Consolidate messages in UCX

> I hadn't confirmed this yet. Though NVLink was enabled when I ran in all cases before. Of course that isn't confirmation that it works 😉 I forgot to mention...

Consolidate messages in UCX

And of course, thanks for the nice work @jakirkham !

I'm now seeing the following errors just as workers connect to the scheduler. Errors on scheduler: ```python-traceback ucp.exceptions.UCXMsgTruncated: Comm Error "[Recv #002] ep: 0x7fac27641380, tag: 0xf2597f095b80a8c, nbytes: 1179, type: ":...

Consolidate messages in UCX

@jakirkham performance-wise, I'd say this is a good improvement. I did some runs with 4 DGX-1 nodes using the code from https://github.com/rapidsai/ucx-py/issues/402#issue-556594147, please see details below: ``` IB Create time:...

Consolidate messages in UCX

> Thanks Peter! Can you please share a bit about where this was run? This was run on a small cluster of 4 DGX-1 nodes, I updated my post above...

CUDA SDK packages & usage docs

From discussions in https://github.com/rapidsai/ucxx/pull/61 , it seems that `cuda-nvcc` is required even if linking only from the host compiler, @jakirkham wrote: > Essentially the CUDA compiler package is needed to...