cortex
cortex copied to clipboard
Production infrastructure for machine learning at scale
#### Description Has to be done in `debug.sh`. Could be done by running something like `eksctl get nodegroup --cluster=$CORTEX_CLUSTER_NAME --region=$CORTEX_REGION -o json > nodegroups.json`. Inspired by [this commit](https://github.com/cortexlabs/cortex/commit/010221f43bc1775581353b4b8a61ffe6343243e7#diff-1fbad9bd9765a163aa3a48dee78a86ffb1d260e73c86b33bae623a2ac692b33b) where `debug.sh`...
### Description Consider using [NLB IP mode](https://kubernetes-sigs.github.io/aws-load-balancer-controller/guide/service/nlb_ip_mode/) with `externalTrafficPolicy: Local`. This would reduce an extra network hop. Related tickets: * https://github.com/kubernetes/ingress-nginx/issues/6828 * https://github.com/kubernetes/cloud-provider-aws/issues/87
#### Description As part of this ticket, checking that the provided volume type is available in the selected region (i.e. us-west-2) is also required. More on io2 volume type here...
#### Description Given the following list of node groups from a cluster config: ```yaml node_groups: - name: A instance_type: t3.medium - name: B instance_type: c5.xlarge - name: C instance_type: c5.4xlarge...
Allow users to perform a get operation for a specific API kind. This will reduce burden on the operator. Notes: - apiKind can be filter as flag in the CLI...
#### Implementation notes * It will require installing the [AMD GPU device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-amd-gpu-device-plugin). * The user will need specify that they are requesting AMD gpus (e.g. `gpu_amd` in the `compute`...
Investigate the possibility of relying on the CRI to retry pulling docker images from different hosts (i.e. try quay and then dockerhub). Determine which CRI is used by AmazonLinux2 AMIs....
### Description Allow (or support out-of-the-box) the ability to retain metrics data from Prometheus for long periods of time. It would probably be best to make the retention period configurable....
### Implementation notes This could be achieved by adding a field to the TaskAPI (e.g. `cron`), which would result in a task being submitted on the specified schedule. Another alternative...