Use official endpoint for EC2 prices
EC2 prices scraping script upgrade
Use official endpoint to get EC2 prices to get as many prices and as acurate as possible. This enables to add prices for instances with suse, rhel and rhel with high availability.
This like the previous script targets only OnDemand instance prices
Status
- done, ready for review
Checklist (tick everything that applies)
- [x] Code linting (required, can be done after the PR checks)
- [x] Documentation
- [ ] Tests
- [x] ICLA (required for bigger changes)
Great, thanks!
Just curious is this official pricing endpoint any smaller? Or it's perhaps even bigger? Just curious since previous JSON we used to download was massive (multi GB) which meant that download could take a while (but we already downloaded it in chunks to avoid excessive memory usage + added progress bar).
EDIT: I can answer that myself - appears that file is 3 GB.
I've added some inline comments / questions.
I've pushed some small lint and tox.ini fixes, but it looks like the script still fails with empty cache:
Traceback (most recent call last):
File "contrib/scrape-ec2-prices.py", line 301, in <module>
main()
File "contrib/scrape-ec2-prices.py", line 294, in main
pricing_data = scrape_ec2_pricing()
File "contrib/scrape-ec2-prices.py", line 137, in scrape_ec2_pricing
prices = get_all_prices()
File "contrib/scrape-ec2-prices.py", line 104, in get_all_prices
with open(json_file, 'r') as f:
TypeError: expected str, bytes or os.PathLike object, not tuple
I pushed a small fix for that, but didn't have time to dig in if there are other issues present.
On top of that, it looks like script takes a very long time to complete even when cache is available (I never got it to finish - not sure if it just takes a really long time or there is an infinite loop or similar).
What kind of run times have you seen on your computer?
If it takes a really long time, it might also not be a bad idea to add some kind of basic progress bar (e.g. print "." every now and then to give some feedback that script didn't get stuck or even better, periodically print how many items we have processed so far out of the total number).
EDIT: It looks like ijson.parse() takes a very long time already and then iteration over the parsed content.
I don't think it's possible to add progress bar for ijson.parse() step itself (we would need to modify ijson or similar), but we might be able to add it for actual code where we process parsed response.
I pushed some additional changes to increase buffer size to speed up parsing a bit and to print up a progress bar - since ijson handles parsing in an incremental manner and returns a generator there is no easy way for us to determine how many more items we need to parse / how far along we are so we can only show that parsing is being performed, but we can't display how many items there are left.
@Eis-D-Z Script seems to be working now and other checks passing, but unit tests one is still failing - https://github.com/apache/libcloud/runs/7481787304?check_suite_focus=true
It appears it's failing because the pricing.json file format has changed. Or is something else going on? Can you please have a look when you get a chance? Thanks!
@Eis-D-Z How is this PR looking? Would be great if we could get it wrapped up merged. Thanks.
@Kami Ready, sorry for the delay.
Codecov Report
Merging #1715 (f44038b) into trunk (77967bf) will not change coverage. The diff coverage is
n/a.
@@ Coverage Diff @@
## trunk #1715 +/- ##
=======================================
Coverage 83.28% 83.28%
=======================================
Files 400 400
Lines 87862 87862
Branches 10708 10708
=======================================
Hits 73180 73180
Misses 11495 11495
Partials 3187 3187
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 77967bf...f44038b. Read the comment docs.
@Eis-D-Z Thanks! Since this change basically doubled the pricing file (from 2 MB to 3 MB), I will take some time to verify that it has no negative implications on memory usage and similar (we had similar issues with sizing file in the past and to avoid those issues, we split some of the stuff into different files and made some imports lazy and only exercised in case EC2 driver is used).
I had a look and by default we only cache data in memory for drivers for which pricing data is requested by the using so this will only affect people using EC2 driver.
I will still add a note about that to the changelog.
I added a note on pricing file size + potential small memory usage increase (ce5b74479f4d21f2928d2d2b56d1cf734c4ee38a) and also updated the script to include progress bar when parsing sku pricing data (7bffd095b151d451bf6f723e89a88745c52399a8).
Merged, thanks!