Aurick Qiao

Results 10 issues of Aurick Qiao

The AdaptDLJob controller sits at the critical path for every job lifecycle, it can be subject to unexpected side-effects due to interactions with diverse Kubernetes environments, failures, race conditions, etc....

Might be implemented by adding an additional state "Suspended". Suspending a job can be done by setting `spec.suspended=true`, and can be unsuspended later by setting `spec.suspended=false` or `spec.suspended=nil`.

enhancement

Make it easy to install AdaptDL with KubeFlow and run AdaptDLJobs as part of DL workflows on KubeFlow.

enhancement

torchtext (as of 0.4.0) adopts `torch.utils.data.DataLoader`, and the older iterator interface is deprecated. Ensure AdaptDL's `AdaptiveDataLoader` supports this new torchtext interface for data loading, and port the example transformer code...

enhancement
good first issue

The AdaptDL scheduler currently assumes no resource quotas are set on namespaces running AdaptDLJobs, and has undefined behavior if this assumption is not satisfied. Given resource quotas are a common...

enhancement

Write a documentation page describing usage of the AdaptDLJob custom resource. Include information on what is automatically configured by the AdaptDL CLI and what needs to be manually configured if...

documentation

Currently the AdaptDL-PyTorch document is a sequence of 6-7 steps, it would be more readable to re-organize to 3 sections. First section on "simple" usage of only `AdaptiveDataParallel`, `AdaptiveDataLoader`, and...

documentation

autoagents/ - data/ - eval/ - serve/ - train/ - prompt.py

This PR implements a new model loader that directly loads the sharded states of each worker when using `DistributedGPUExecutor`. When using tensor parallelism, this avoids each worker reading the full...

Make LoRA work with chunked prefill, taking the same approach as https://github.com/vllm-project/vllm/pull/4994 but updated to latest code. Also needed to modify the request sorting by the scheduler so prefill sequences...