EC2 prices scraping script upgrade

Use official endpoint to get EC2 prices to get as many prices and as acurate as possible. This enables to add prices for instances with suse, rhel and rhel with high availability.

This like the previous script targets only OnDemand instance prices

Status

done, ready for review

Checklist (tick everything that applies)

[x] Code linting (required, can be done after the PR checks)
[x] Documentation
[ ] Tests
[x] ICLA (required for bigger changes)

Jun 20 '22 10:06 Eis-D-Z

Great, thanks!

Just curious is this official pricing endpoint any smaller? Or it's perhaps even bigger? Just curious since previous JSON we used to download was massive (multi GB) which meant that download could take a while (but we already downloaded it in chunks to avoid excessive memory usage + added progress bar).

EDIT: I can answer that myself - appears that file is 3 GB.

Jun 28 '22 20:06 Kami

I've added some inline comments / questions.

Jun 28 '22 20:06 Kami

I've pushed some small lint and tox.ini fixes, but it looks like the script still fails with empty cache:

Traceback (most recent call last):
  File "contrib/scrape-ec2-prices.py", line 301, in <module>
    main()
  File "contrib/scrape-ec2-prices.py", line 294, in main
    pricing_data = scrape_ec2_pricing()
  File "contrib/scrape-ec2-prices.py", line 137, in scrape_ec2_pricing
    prices = get_all_prices()
  File "contrib/scrape-ec2-prices.py", line 104, in get_all_prices
    with open(json_file, 'r') as f:
TypeError: expected str, bytes or os.PathLike object, not tuple

I pushed a small fix for that, but didn't have time to dig in if there are other issues present.

On top of that, it looks like script takes a very long time to complete even when cache is available (I never got it to finish - not sure if it just takes a really long time or there is an infinite loop or similar).

What kind of run times have you seen on your computer?

If it takes a really long time, it might also not be a bad idea to add some kind of basic progress bar (e.g. print "." every now and then to give some feedback that script didn't get stuck or even better, periodically print how many items we have processed so far out of the total number).

Jul 23 '22 12:07 Kami

EDIT: It looks like ijson.parse() takes a very long time already and then iteration over the parsed content.

I don't think it's possible to add progress bar for ijson.parse() step itself (we would need to modify ijson or similar), but we might be able to add it for actual code where we process parsed response.

Jul 23 '22 12:07 Kami

I pushed some additional changes to increase buffer size to speed up parsing a bit and to print up a progress bar - since ijson handles parsing in an incremental manner and returns a generator there is no easy way for us to determine how many more items we need to parse / how far along we are so we can only show that parsing is being performed, but we can't display how many items there are left.

Jul 23 '22 12:07 Kami

@Eis-D-Z Script seems to be working now and other checks passing, but unit tests one is still failing - https://github.com/apache/libcloud/runs/7481787304?check_suite_focus=true

It appears it's failing because the pricing.json file format has changed. Or is something else going on? Can you please have a look when you get a chance? Thanks!

Jul 23 '22 15:07 Kami

@Eis-D-Z How is this PR looking? Would be great if we could get it wrapped up merged. Thanks.

Sep 05 '22 08:09 Kami

@Kami Ready, sorry for the delay.

Sep 05 '22 17:09 Eis-D-Z

Codecov Report

Merging #1715 (f44038b) into trunk (77967bf) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##            trunk    #1715   +/-   ##
=======================================
  Coverage   83.28%   83.28%           
=======================================
  Files         400      400           
  Lines       87862    87862           
  Branches    10708    10708           
=======================================
  Hits        73180    73180           
  Misses      11495    11495           
  Partials     3187     3187

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 77967bf...f44038b. Read the comment docs.

Sep 05 '22 21:09 codecov-commenter

@Eis-D-Z Thanks! Since this change basically doubled the pricing file (from 2 MB to 3 MB), I will take some time to verify that it has no negative implications on memory usage and similar (we had similar issues with sizing file in the past and to avoid those issues, we split some of the stuff into different files and made some imports lazy and only exercised in case EC2 driver is used).

Sep 05 '22 21:09 Kami

I had a look and by default we only cache data in memory for drivers for which pricing data is requested by the using so this will only affect people using EC2 driver.

I will still add a note about that to the changelog.

Sep 08 '22 10:09 Kami

I added a note on pricing file size + potential small memory usage increase (ce5b74479f4d21f2928d2d2b56d1cf734c4ee38a) and also updated the script to include progress bar when parsing sku pricing data (7bffd095b151d451bf6f723e89a88745c52399a8).

Sep 08 '22 11:09 Kami

Merged, thanks!

Sep 08 '22 11:09 Kami

Use official endpoint for EC2 prices

EC2 prices scraping script upgrade

Use official endpoint to get EC2 prices to get as many prices and as acurate as possible. This enables to add prices for instances with suse, rhel and rhel with high availability.

Status

Checklist (tick everything that applies)

Codecov Report