roryqi
roryqi
cc @colinmjj . There seems not be cases in our production environment. But I think the analysis is correct. What do you think?
> It looks like taskAttemptId has been modified from 2^20 to 2^21, but the comment for org.apache.uniffle.client.util.ClientUtils has not been modified. org.apache.uniffle.common.util.Constants `public static final int TASK_ATTEMPT_ID_MAX_LENGTH = 21;` `public...
> Currently we can support 2^20 tasks, which is not a small number. If spark.rss.writer.buffer.size is set to the default value of 3m, then the data written by a taskAttempt...
Could you provide more detailed information? Could you add some logs to help us solve this problem?
> No logs, we just found this phenomenon. Maybe `org.apache.uniffle.common.rpc.MonitoringServerCall#close` not called sometimes. I try to call `decCounter` in `MonitoringServerCallListener#onComplete/onCancel/onComplete` and it work. But i don't know the real reasion...
cc @colinmjj , Do you remember our flaky metric test? I guess that it's caused by this issue.
I understand that you need a `rolling upgrade` feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plans,....
Could you write a design doc's (use google doc) ? Because this issue is a little complex.
If we want to add some interface to control shuffle server's behavior, we should have a complete design, and we think we need detailed discussions. We ever have similar mind...