Aurick Qiao issues

Results 10 issues of


                                            Aurick Qiao

Automated testing for the Controller

The AdaptDLJob controller sits at the critical path for every job lifecycle, it can be subject to unexpected side-effects due to interactions with diverse Kubernetes environments, failures, race conditions, etc....

Add ability to suspend a job without deleting the AdaptDLJob

Might be implemented by adding an additional state "Suspended". Suspending a job can be done by setting `spec.suspended=true`, and can be unsuspended later by setting `spec.suspended=false` or `spec.suspended=nil`.

enhancement

Integration with KubeFlow

Make it easy to install AdaptDL with KubeFlow and run AdaptDLJobs as part of DL workflows on KubeFlow.

enhancement

Support new torchtext data loading

torchtext (as of 0.4.0) adopts `torch.utils.data.DataLoader`, and the older iterator interface is deprecated. Ensure AdaptDL's `AdaptiveDataLoader` supports this new torchtext interface for data loading, and port the example transformer code...

enhancement

good first issue

Scheduler should respect namespace resource quotas

The AdaptDL scheduler currently assumes no resource quotas are set on namespaces running AdaptDLJobs, and has undefined behavior if this assumption is not satisfied. Given resource quotas are a common...

enhancement

Document AdaptDLJob CRD usage

Write a documentation page describing usage of the AdaptDLJob custom resource. Include information on what is automatically configured by the AdaptDL CLI and what needs to be manually configured if...

documentation

Reorganize and simplify AdaptDL-PyTorch tutorial

Currently the AdaptDL-PyTorch document is a sequence of 6-7 steps, it would be more readable to re-organize to 3 sections. First section on "simple" usage of only `AdaptiveDataParallel`, `AdaptiveDataLoader`, and...

documentation

restructure repo

autoagents/ - data/ - eval/ - serve/ - train/ - prompt.py

[Core] Implement sharded state loader

This PR implements a new model loader that directly loads the sharded states of each worker when using `DistributedGPUExecutor`. When using tensor parallelism, this avoids each worker reading the full...

[Misc] LoRA + Chunked Prefill

Make LoRA work with chunked prefill, taking the same approach as https://github.com/vllm-project/vllm/pull/4994 but updated to latest code. Also needed to modify the request sorting by the scheduler so prefill sequences...