cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Production infrastructure for machine learning at scale

Results 121 cortex issues
Sort by recently updated
recently updated
newest added
trafficstars

#### Description Has to be done in `debug.sh`. Could be done by running something like `eksctl get nodegroup --cluster=$CORTEX_CLUSTER_NAME --region=$CORTEX_REGION -o json > nodegroups.json`. Inspired by [this commit](https://github.com/cortexlabs/cortex/commit/010221f43bc1775581353b4b8a61ffe6343243e7#diff-1fbad9bd9765a163aa3a48dee78a86ffb1d260e73c86b33bae623a2ac692b33b) where `debug.sh`...

enhancement

### Description Consider using [NLB IP mode](https://kubernetes-sigs.github.io/aws-load-balancer-controller/guide/service/nlb_ip_mode/) with `externalTrafficPolicy: Local`. This would reduce an extra network hop. Related tickets: * https://github.com/kubernetes/ingress-nginx/issues/6828 * https://github.com/kubernetes/cloud-provider-aws/issues/87

enhancement

#### Description As part of this ticket, checking that the provided volume type is available in the selected region (i.e. us-west-2) is also required. More on io2 volume type here...

enhancement

#### Description Given the following list of node groups from a cluster config: ```yaml node_groups: - name: A instance_type: t3.medium - name: B instance_type: c5.xlarge - name: C instance_type: c5.4xlarge...

bug
performance
research

Allow users to perform a get operation for a specific API kind. This will reduce burden on the operator. Notes: - apiKind can be filter as flag in the CLI...

#### Implementation notes * It will require installing the [AMD GPU device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-amd-gpu-device-plugin). * The user will need specify that they are requesting AMD gpus (e.g. `gpu_amd` in the `compute`...

enhancement

Investigate the possibility of relying on the CRI to retry pulling docker images from different hosts (i.e. try quay and then dockerhub). Determine which CRI is used by AmazonLinux2 AMIs....

blocked
research
ux

### Description Allow (or support out-of-the-box) the ability to retain metrics data from Prometheus for long periods of time. It would probably be best to make the retention period configurable....

enhancement
metrics

### Implementation notes This could be achieved by adding a field to the TaskAPI (e.g. `cron`), which would result in a task being submitted on the specified schedule. Another alternative...

enhancement
TaskAPI