optscale icon indicating copy to clipboard operation
optscale copied to clipboard

Underutilized Instances Recommendation

Open nadeem-nasir opened this issue 2 months ago • 6 comments

I’ve gone through the documentation, but I’m still unclear on the exact requirements. I’m encountering the same situation as mentioned in issue #445.

Question 1: What permissions are required for these service credentials to work correctly?

Question 2: My Azure subscriptions span multiple regions. 👉 Will OptScale fetch data from all regions, or is it limited to one region only? I went through the documentation, but it does not specify which role to assign in Azure. The documentation states

  • Pay attention to the service_credentials parameter, as OptScale uses it to retrieve cloud pricing data for recommendations calculation.

  • Service credentials are required to fetch pricing information from different clouds.

  • Recommendations will not work without this configuration.

  • For recommendations to function, service_credentials must be set correctly.

nadeem-nasir avatar Nov 07 '25 12:11 nadeem-nasir

Hello @nadeem-nasir

  1. You need to specify Reader permission.
  2. OptScale extracts data from all regions.

VR-Hystax avatar Nov 10 '25 04:11 VR-Hystax

Hi @VR-Hystax

Thank you for the reply.

I followed these steps, and the overlay has been updated successfully along with the service credentials:

virtualenv -p python3 .venv
source .venv/bin/activate

source ~/.profile

nano overlay/user_template.yml   # edited the YAML file

./runkube.py --no-pull -o overlay/user_template.yml -- nadeem-optscale-deployment latest


Deployment output:


% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 76359    0 76359    0     0   242k      0 --:--:-- --:--:-- --:--:--  242k
09:39:20.039: Latest release tag: 2025102901-public
09:39:20.063: Connecting to ctd daemon 10.1.0.4:2376
09:39:20.063: Comparing local images for 10.1.0.4
09:39:25.911: Generating base overlay...
09:39:25.916: Connecting to ctd daemon 10.1.0.4:2376
09:39:27.905: Creating component_versions.yaml file to insert it into configmap
09:39:27.911: Deleting /configured key
09:39:28.005: Removing old job pre-configurator...
09:39:28.012: Waiting for job deletion...
09:39:28.418: Starting helm chart optscale with name nadeem-optscale-deployment on k8s cluster 10.1.0.4
Release "nadeem-optscale-deployment" has been upgraded. Happy Helming!
NAME: nadeem-optscale-deployment
LAST DEPLOYED: Mon Nov 10 09:39:28 2025
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None

I performed a force check and also ran kubectl rollout restart, but I still don’t see any Underutilized Instances recommendations.

Resources Details

Image Data Sources Image

Underutilized Instances: Image

Are there any additional steps required? When I first deployed the cluster, I didn’t update the service credentials. They have now been updated and are using the same configuration, with the Reader role assigned on the subscriptions. The same applies both for the data source and for the service credentials.

Instances eligible for generation upgrade, Not attached Volumes, Obsolete IPs are working

nadeem-nasir avatar Nov 10 '25 12:11 nadeem-nasir

Hello @nadeem-nasir Please try adjusting the settings in this recommendation and viewing the bumiworker logs.

VR-Hystax avatar Nov 11 '25 07:11 VR-Hystax

@VR-Hystax Thank you for the help.

I modified the Rightsizing strategy, updated the cluster, triggered a force check, removed the data sources and added them again, but the issue still persists. The logs for bumiworker are attached. I retrieved them using: kubectl logs -n default bumiworker-6d9b7c6679-vvxhq

logs.txt

Image

nadeem-nasir avatar Nov 11 '25 11:11 nadeem-nasir

Hello @nadeem-nasir We found this entry in the logs:

Rightsizing_instances statistics for 33bf7b4d-7ce0-4372-930b-db9af8a77ee6 (azure_cnr): {'no_recommended_flavor': 5, 'unable_to_get_current_flavor': 100}

This means that the insider couldn't find a price for 100 machines. Please check that service creds exist and look at the insider-worker logs. If everything is ok there, then let's use the API:

GET https://<optscale_ip>/insider/v2/flavors (it uses the cluster secret) { 'cloud_type': 'azure_cnr', 'resource_type': 'instance', 'region': , # example 'Germany West Central' 'family_specs': {'source_flavor_id': } # example 'Standard_B1ms' 'mode': 'current' }

After that, send us both the body of the request and the response from the insider.

VR-Hystax avatar Nov 12 '25 06:11 VR-Hystax

@VR-Hystax

I currently have about 140 VMs across my subscriptions. I verified the configuration using kubectl describe configmaps optscale-etcd and confirmed that the service credentials are present.

I also tested sending requests as suggested — it works with POST but not with GET. For certain family_specs I’m receiving empty results, while for others the data is returned correctly.

  • Getting results with data:

Standard_D2ads_v6 
Standard_E4-2s_v5
Standard_F32als_v6
Standard_F32als_v6
Standard_D2ads_v6
Standard_D4ads_v5
Standard_E8d_v4
Standard_D4as_v4
Standard_E4s_v3
Standard_E4s_v3
Standard_E4s_v4
Standard_E16s_v3
Standard_E4s_v3
Image - All of these family_specs are returning empty results.
Standard_B1m
Standard_B2s
Standard_F4
Standard_F8
Standard_F8s
Standard_DS12_v2
Standard_F4s_v2
Standard_DS12_v2
Standard_B12ms
Standard_B4ms
Standard_F8s
Standard_B12ms

Image
  • insider-worker logs attached

insider-worker logs.txt

I also found Rightsizing_instances statistics for 33bf7b4d-7ce0-4372-930b-db9af8a77ee6 (azure_cnr): {'no_recommended_cpu': 2, 'no_recommended_flavor': 3, 'unable_to_get_current_flavor': 100} Error is showing no_recommended_flavor': 3 and no_recommended_flavor': 5

nadeem-nasir avatar Nov 12 '25 11:11 nadeem-nasir

Hi @nadeem-nasir We will investigate this issue. As soon as we have any conclusions, we will let you know immediately.

dsup-hystax avatar Nov 20 '25 15:11 dsup-hystax

Hi @nadeem-nasir!

I’ve investigated the issue and can confirm that there are no problems on our side. In the GitHub issue, I noticed the attached traceback file, which clarifies the root cause:

_pymongo.errors.AutoReconnect: mongo-0.mongo-discovery.default.svc.cluster.local:27017: [Errno -3] Temporary failure in name resolution (configured timeouts: connectTimeoutMS: 20000.0ms)_

This indicates a deployment-level problem. The Insider worker container is unable to resolve the DNS name, which prevents it from communicating with the MongoDB service. As a result, Insider doesn’t have the necessary information to correctly display the recommendation.

Please check your cluster’s DNS resolution and service configuration to ensure the worker container can reach the MongoDB endpoint.

ida-mn avatar Nov 25 '25 12:11 ida-mn

@ida-mn Thank you for the updates. I see it is giving recommendations for “Not attached Volumes” and “Obsolete IPs.” Are these recommendations stored in the Mango database? I usually keep the default settings and don’t change much during installation. I can either re‑deploy or update the DNS settings. Could you explain how to check the cluster’s DNS resolution and service configuration? Since I don’t have much data, I could also delete the VMs and redeploy everything. Please let me know which option would be best.

nadeem-nasir avatar Nov 26 '25 07:11 nadeem-nasir

Hello @nadeem-nasir I have forwarded your response to our engineering team. I will consult with the team and get back to you with recommendations as soon as possible.

VR-Hystax avatar Nov 27 '25 03:11 VR-Hystax

Hello @nadeem-nasir Please provide the output of these commands:

kubectl -n kube-system get pods | grep weave kubectl -n kube-system logs weave-*** | grep -i error kubectl -n kube-system get pods | grep coredns kubectl -n kube-system logs coredns-*** | grep -i error

VR-Hystax avatar Nov 27 '25 22:11 VR-Hystax

@nadeem-nasir Thank you for the updates and here is the logs.

  • Image
  • Image
  • ` for pod in $(kubectl -n kube-system get pods -o name | grep weave); do kubectl -n kube-system logs $pod | grep -i error; done

Defaulted container "weave" out of: weave, weave-npc, weave-init (init) INFO: 2025/11/27 11:17:51.393264 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=6.14.0-1014-azure&os=linux&signature=0TmZpkEyUuEMAg9YHZAmzQxzarUiW%2BnrR2PjYtwkyOI%3D&version=2.8.1": EOF INFO: 2025/11/27 17:09:36.138043 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=6.14.0-1014-azure&os=linux&signature=0TmZpkEyUuEMAg9YHZAmzQxzarUiW%2BnrR2PjYtwkyOI%3D&version=2.8.1": EOF INFO: 2025/11/28 00:29:26.987583 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=6.14.0-1014-azure&os=linux&signature=0TmZpkEyUuEMAg9YHZAmzQxzarUiW%2BnrR2PjYtwkyOI%3D&version=2.8.1": EOF INFO: 2025/11/28 06:02:27.216499 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=6.14.0-1014-azure&os=linux&signature=0TmZpkEyUuEMAg9YHZAmzQxzarUiW%2BnrR2PjYtwkyOI%3D&version=2.8.1": EOF`

`

  1. for pod in $(kubectl -n kube-system get pods -o name | grep coredns); do kubectl -n kube-system logs $pod | grep -i error; done`

for pod in $(kubectl -n kube-system get pods -o name | grep coredns); do kubectl -n kube-system logs $pod | grep -i error; done [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Namespace ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding [ERROR] plugin/errors: 2 checkpoint-api.weave.works. A: read udp 10.254.0.33:58290->168.63.129.16:53: i/o timeout [ERROR] plugin/errors: 2 checkpoint-api.weave.works. A: read udp 10.254.0.33:40593->168.63.129.16:53: i/o timeout

Please note I change kubectl -n kube-system logs weave-*** | grep -i error and kubectl -n kube-system logs coredns-*** | grep -i error. becuase it was showing error "error: error from server (NotFound): pods "coredns-***" not found in namespace "kube-system" so i used for pod in $(kubectl -n kube-system get pods -o name | grep weave); do kubectl -n kube-system logs $pod | grep -i error; done

nadeem-nasir avatar Nov 28 '25 08:11 nadeem-nasir

@nadeem-nasir Thank you! Looks like your weave plugin is not working properly. Try restarting it and see if there are any errors again. kubectl -n kube-system delete pod -l name=weave-net

sd-hystax avatar Nov 28 '25 08:11 sd-hystax

@sd-hystax I deleted the pods as you suggested and get the logs again. "stem logs $pod | grep -i error; done Defaulted container "weave" out of: weave, weave-npc, weave-init (init) INFO: 2025/11/30 12:48:20.782229 Error checking version: Get "https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=6.14.0-1014-azure&os=linux&signature=0TmZpkEyUuEMAg9YHZAmzQxzarUiW%2BnrR2PjYtwkyOI%3D&version=2.8.1": EOF"

weave pod is throwing exception

nadeem-nasir avatar Nov 30 '25 12:11 nadeem-nasir

Hello @nadeem-nasir We will investigate your issue. I'll let you know as soon as I get any conclusions.

VR-Hystax avatar Dec 01 '25 01:12 VR-Hystax