Yunseong Lee comments

Results 16 comments of


                                            Yunseong Lee

Version problem in Jenkins REEF

The main reason for this failure is our code depends on the REEF's latest `SNAPSHOT`. Whenever we need new features in REEF, we have to manually build REEF and include...

Version problem in Jenkins REEF

IMHO, we'd better find a way to rebuild REEF automatically somehow. I think we can add another project to Jenkins, which tests & builds Cay periodically (e.g., daily), and there...

Out of memory: unable to create new native thread

Thanks for the report! Probably we are creating too many threads; i think it's time to look into the problem and come up with a better management on threads. I'll...

Implement multi-threaded Trainer.

When we implement multi-threaded Trainer, we can consider two versions for threads to write their gradient updates: 1) Synchronized fashion 2) Hogwild-style (lock-free) We should build both versions and compare...

Implement multi-threaded Trainer.

I'll start to send a PR that enables multi-thread in MLR of consistent (i.e., non-hogwild) version first.

Debug Dolphin's algorithm correctness

@gyeongin Yes, both results were very similar.

Collect metrics in Dolphin on ET

Totally agreed! I'll prepare a draft and share it in this thread. Thanks for the great suggestion!

Collect metrics in Dolphin on ET

I'm sharing the draft: ![image](https://cloud.githubusercontent.com/assets/1748276/25069621/82ff090c-22c1-11e7-98f2-26a43fb0574a.png) You can find the original file [here](https://docs.google.com/presentation/d/1j3X9bWRjzarhjwlOI1S7mHkAhUQ1uhQCORKyioRYdfc/edit#slide=id.p), and I would appreciate if you have any comments/feedback. Thanks!

Introduce a worker-side component that aggregates partial updates

We can first implement the simplest version of keeping all the updates until requested to aggregate them. Then we can improve it by aggregating beforehand; for example, when the number...

Implement Checkpoint in Dolphin

Two strategies are possible for checkpoint: 1) Stop-the-world 2) Asynchronous We can easily notice a trade-off between performance and correctness.