clouddriver icon indicating copy to clipboard operation
clouddriver copied to clipboard

feat(aws): Improving AWS EC2 instance types API integration and caching, feat(aws): Adding archi type to images API and caching

Open pdk27 opened this issue 3 years ago • 15 comments

Changes:

  • Improving AWS EC2 instance types API integration and caching:
    • Replaced AWS EC2 pricing docs integration with AWS EC2 describe-instance-types API. Pricing docs precedes the API. Instance types retrieved via API are up-to-date and include other associated metadata.
    • Added instance type information to the cache.
  • Adding architecture type to images API and caching

Clouddriver API response with changes:

Use cases / reasoning for the changes:

  • Deck - filtering out incompatible instance types
  • Deck - display instance type info in instance type selector drop down and support filtering by metadata (PR to follow) See draft PR and demo in https://github.com/spinnaker/deck/pull/9793

Instructions to be included in release (preview) notes:

  • As part of upgrading to a Spinnaker version that includes these changes, repopulate caches by running hal deploy apply --flush-infrastructure-caches.

pdk27 avatar Dec 21 '21 21:12 pdk27

Reviewers, please also review the Instructions to be included in release (preview) notes in PR overview.

pdk27 avatar Dec 21 '21 21:12 pdk27

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

mattgogerly avatar Dec 21 '21 23:12 mattgogerly

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

pdk27 avatar Dec 23 '21 17:12 pdk27

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

All I can find is that it wipes the Redis instance, but that would only work if you're using the built in Redis. Why is it necessary to do that?

mattgogerly avatar Jan 20 '22 20:01 mattgogerly

Hi @pdk27, just wanted to follow up on this PR, I saw the clouddriver feature was merged (https://github.com/spinnaker/clouddriver/pull/5610) and want to ensure this gets in as well 👍

link108 avatar Mar 09 '22 18:03 link108

Hi @pdk27, just wanted to follow up on this PR, I saw the clouddriver feature was merged (#5610) and want to ensure this gets in as well 👍

Thanks for the heads-up @link108. Will address the comments here soon.

pdk27 avatar Mar 28 '22 22:03 pdk27

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

All I can find is that it wipes the Redis instance, but that would only work if you're using the built in Redis. Why is it necessary to do that?

@mattgogerly Its needed to refresh the cache, and to be able to use the features in this PR.

pdk27 avatar Apr 04 '22 23:04 pdk27

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

All I can find is that it wipes the Redis instance, but that would only work if you're using the built in Redis. Why is it necessary to do that?

@mattgogerly Its needed to refresh the cache, and to be able to use the features in this PR.

Does the cache not get updated by normal clouddriver caching? The equivalent to the hal command for SQL/non-Halyard would be to wipe the database, at least of AWS data, which isn't ideal operationally.

mattgogerly avatar Apr 05 '22 14:04 mattgogerly

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

All I can find is that it wipes the Redis instance, but that would only work if you're using the built in Redis. Why is it necessary to do that?

@mattgogerly Its needed to refresh the cache, and to be able to use the features in this PR.

Does the cache not get updated by normal clouddriver caching? The equivalent to the hal command for SQL/non-Halyard would be to wipe the database, at least of AWS data, which isn't ideal operationally.

Gotcha! That doesn't sound ideal. As part of testing these changes, I couldn't see the new cache fields without flushing caches in my setup. Any ideas on how to confirm if flushing caches can be avoided? I will try a few things on my setup and update here.

pdk27 avatar Apr 05 '22 15:04 pdk27

Haven't had a chance to look at the code yet, but for the release note - not everybody deploys Spinnaker using Halyard. Are there instructions for other installation methods (i.e. what is that flag actually doing?)

Thanks for the callout @mattgogerly. That is where I need help :) I believe hal deploy apply --flush-infrastructure-caches refreshes the caches. Someone familiar with it can add more details. I will post about this PR and ask for help in the Slack channel too.

All I can find is that it wipes the Redis instance, but that would only work if you're using the built in Redis. Why is it necessary to do that?

@mattgogerly Its needed to refresh the cache, and to be able to use the features in this PR.

Does the cache not get updated by normal clouddriver caching? The equivalent to the hal command for SQL/non-Halyard would be to wipe the database, at least of AWS data, which isn't ideal operationally.

Gotcha! That doesn't sound ideal. As part of testing these changes, I couldn't see the new cache fields without flushing caches in my setup. Any ideas on how to confirm if flushing caches can be avoided? I will try a few things on my setup and update here.

I was able to verify in my setup that flushing caches is required in order to see the changes to the cached data. May be that is too extreme? Do you know if there is a softer option like a refresh? Will reach on Slack again to see if anyone has a solution - https://spinnakerteam.slack.com/archives/C091CCWRJ/p1649183109750109

Without running hal deploy apply --flush-infrastructure-caches, the images API response just didn't include the new architecture field and the response looked identical to one from before the change. I think the question here is related to cache refresh instructions (no matter what the change to the cache is / how cache is setup) and it doesn't need to block this PR from getting merged. I can followup and a note to release notes. Do you agree?

pdk27 avatar Apr 05 '22 18:04 pdk27

@pdk27

Hm, that seems weird (or a bug?) Tagging @german-muzquiz who might be more familiar with cache data schema updates

mattgogerly avatar Apr 07 '22 08:04 mattgogerly

I'm not too familiar with the AWS provider, but as a general principle the cache in clouddriver should be able to automatically rebuild itself if the underlying datastore is deleted (sql or redis), which according to the comments is working fine.

Then what I'm hearing is that the cache for instance types only gets populated once and then never gets updated. Probably that was expected in the beginning since instance types is not something that changes regularly, but then I think ideally this PR needs to automatically refresh that cache. If there are no caching agents for instance types, maybe refresh it upon startup or use something like last_modified fields to know that it should be refreshed?

german-muzquiz avatar Apr 13 '22 14:04 german-muzquiz

I'm not too familiar with the AWS provider, but as a general principle the cache in clouddriver should be able to automatically rebuild itself if the underlying datastore is deleted (sql or redis), which according to the comments is working fine.

Then what I'm hearing is that the cache for instance types only gets populated once and then never gets updated. Probably that was expected in the beginning since instance types is not something that changes regularly, but then I think ideally this PR needs to automatically refresh that cache. If there are no caching agents for instance types, maybe refresh it upon startup or use something like last_modified fields to know that it should be refreshed?

Thanks for the pointer @german-muzquiz. I will dig into this further and get back to you.

pdk27 avatar Apr 25 '22 21:04 pdk27

I'm not too familiar with the AWS provider, but as a general principle the cache in clouddriver should be able to automatically rebuild itself if the underlying datastore is deleted (sql or redis), which according to the comments is working fine. Then what I'm hearing is that the cache for instance types only gets populated once and then never gets updated. Probably that was expected in the beginning since instance types is not something that changes regularly, but then I think ideally this PR needs to automatically refresh that cache. If there are no caching agents for instance types, maybe refresh it upon startup or use something like last_modified fields to know that it should be refreshed?

Thanks for the pointer @german-muzquiz. I will dig into this further and get back to you.

@german-muzquiz Just to give you a brief context, there are 2 API related changes in this PR: instance type API, and images API

Within images API, there are 2 kinds: (1) find by name like http://localhost:8084/images/find?q=hello* (2) find by ID like http://localhost:8084/images/find?q=ami-23751835 The change in this PR includes adding a new attribute architecture to both the images APIs. I can see the new field in the API response for (2) without flushing the caches, but not in the API response for (1) which comes from namedImages, built in ImageCachingAgent. Flushing caches is non-ideal for certain types of cache setups as @mattgogerly pointed out that flushing caches means wiping the SQL/ non-Halyard database.

My observations:

  • Flushing infrastructure caches via hal deploy apply --service-names clouddriver --flush-infrastructure-caches is the only way I could get the public images in the caches to reload with the new architecture field. Re-deploying clouddriver didn't help.

Some questions we need help with:

  • Is this a problem in all types of cache setups? (I could verify the behavior only for Halyard + Redis cache) May be there are other ways to reload images cache in other setups, without the need for wiping the whole cache?
  • How are new attributes added to cached data in general?
  • With the given findings, how do we proceed?

pdk27 avatar Apr 27 '22 20:04 pdk27

@pdk27 I don't see anything wrong in the code leading to the issue of the architecture not showing up in http://localhost:8084/images/find?q=hello without flushing caches, the ImageCachingAgent should be rebuilding the cache each time for both named images and ami images.

My suggestion is to investigate as follows:

  • Start with an empty redis
  • Run clouddriver without your changes, so that images are initially cached without the architecture field.
  • Verify in redis that images don't have the architecture field. This can be done through redis-cli with a command like this:
mget "com.netflix.spinnaker.clouddriver.aws.provider.AwsProvider:namedImages:attributes:aws:namedImages:{account}:{image name}"

Where you replace {account} and {image name} for their respective values.

  • Verify that the request to clouddriver http://localhost:7002/aws/images/find?q={image name} doesn't include the architecture field
  • Stop clouddriver and then run it with your changes, without flushing the caches
  • Check again redis, this time the architecture field should have been stored
  • Check again the response of clouddriver REST API, this time the architecture field should be included

The best approach is to not having to delete the cache (redis or sql) in order for this change to work, because the images cache is rebuilt every 30 seconds anyway.

german-muzquiz avatar May 18 '22 20:05 german-muzquiz

Apologies for the delay following-up on this.

@mattgogerly @dbyron-sf I tried the instructions provided by @german-muzquiz and it worked. i.e. I didn't need to flush infrastructure caches to see the architecture type in response.

There is no other pending action from my side. Can you please help me identify the right reviewers for this PR?

pdk27 avatar Aug 22 '22 19:08 pdk27

Apologies for the delay following-up on this.

@mattgogerly @dbyron-sf I tried the instructions provided by @german-muzquiz and it worked. i.e. I didn't need to flush infrastructure caches to see the architecture type in response.

There is no other pending action from my side. Can you please help me identify the right reviewers for this PR?

Here is a demo of the testing done:

https://user-images.githubusercontent.com/3614196/186512114-f287d071-e397-4c83-acab-ec0501be70e9.mov

pdk27 avatar Aug 24 '22 20:08 pdk27

@dbyron-sf ✅ Updated release preview PR.

pdk27 avatar Aug 24 '22 20:08 pdk27