cruise icon indicating copy to clipboard operation
cruise copied to clipboard

Enable Metric collection service in Dolphin on ET

Open yunseong opened this issue 7 years ago • 1 comments

We need to collect metrics (both in servers and workers) in Dolphin on ET. We may be able to reuse the worker-side code and change the Driver-side message handler (for metrics) only. In server-side, we need to implement a message sender as well, since we have a different implementation for updating parameters instead of reusing ps.

yunseong avatar Mar 26 '17 03:03 yunseong

I was considering to reuse most code in Evaluator-side, but I've found that this will result in many ad-hoc workarounds, especially 1) for obtaining the number of table blocks, which is used to check validity (i.e., should be equal to the one in Driver), and 2) for collecting server-side metrics, which I thought to execute a dummy ServerTask.

Instead, after discussion with @wynot12, we think a better solution would be for ET to take charge of metric collection (then we don't have to worry about above problems).

One tricky part is stitching configuration for supporting different level of metrics (e.g., ET-specific metrics, Worker/Server-specific metrics) that have different mechanisms to be collected; (server-side metrics are collected periodically, whereas worker-side metrics are collected on-demand).

But since we have a clear view of the structure of metrics (as changed in #1070), we can tackle the issue pretty easily. I'll send a PR hopefully tonight.

yunseong avatar Apr 09 '17 06:04 yunseong