Junsang Yoo issues

Results 10 issues of


                                            Junsang Yoo

Is ConnectX-6 supported?

Dear, I'm currently using Mellanox ConnectX-6 Adapter (HPE InfiniBand HDR/Ethernet 200Gb 2-port QSFP56 PCIe4 x16 MCX653106A-HDAT Adapter) and trying to using sriov-network-metrics-exporter in Kubernetes cluster, but any sriov-network-metrics-exporter PODs can't...

Capacity and Allocatable number shows wrong if sriov-network-device-plugin restarts

### What happened? Node `Capacity` and `Allocatable` number shows wrong in case of restarting `sriov-network-device-plugin` if any pods attach SR-IOV IB VFs. * Before restarts ``` Capacity: ..... openshift.io/gpu_mlnx_ib0: 8...

bug

Sidecar: After a block upload was delayed, all subsequent block uploads were delayed as well

**Thanos, Prometheus and Golang version used**: 1. Thanos: `thanos:0.35.1-debian-12-r1` (from `thanos-15.7.10` chart) 2. Prometheus: `2.53` (from `kube-prometheus-stack-60.3.0` chart) **Object Storage Provider**: AWS **What happened**: I found that some blocks weren't...

[BUG] CNIs are attaching very slow while deploying large scale deployment

**What happend**: CNIs are attaching very slow while deploying large scale deployment, such as 2,400 Pods in 300 Nodes. I configured to deploy 8 Pods per node and each Pod...

node-feature-discovery sends excessive LIST requests to the API server

**What happened**: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server **What you expected to happen**: Recently I got several alerts from K8S cluster which describes that API...

kind/bug

[BUG] Attach CNI is very slow in large scale deployment

**Describe the bug** Hello, whereabouts team. I'm facing an issue while I'm trying to deploy large scale deployment (e.g. daemonset or deployment with 600 Pods or above) with SR-IOV VFs....

stale

GPU resources are not recovered even XID error is resolved

Hello, NVIDIA team. I recently faced an issue while GPU resources (`nvidia.com/gpu`) can be shown from `kubelet` are not recovered (e.g. 7 -> 8) even any XID error is resolved....

node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

### 1. Quick Debug Information * OS/Version(e.g. RHEL8.6, Ubuntu22.04): **Ubuntu 20.04.4 LTS** * Kernel Version: **5.4.0-113-generic** * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): **containerd://1.5.8** * K8s Flavor/Version(e.g. K8s, OCP, Rancher,...

GPU resources are not recovered even XID error is resolved

Hello, NVIDIA team. I recently faced an issue while GPU resources (`nvidia.com/gpu`) can be shown from `kubelet` are not recovered (e.g. 7 -> 8) even any XID error is resolved....

[Feature Request] Add hostNetwork mode for dcgmExporter

Hello, NVIDIA Team. I'm facing an issue while configurating `dcgm-exporter` from `gpu-operator`. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is...

good-first-issue