smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

[WIP] Fixes #599 - use gcs blob interface.

Open cadnce opened this issue 3 years ago • 5 comments
trafficstars

Fixes #599 - Swap to using GCS native blob open under the hood.

This should reduce the amount of custom code to maintain, I have tried to keep the interfaces identical so there is no API breaking changes. Though this does mean there is still lots of code that can be trimmed down.

I think it might be worth re-thinking the test coverage and if the test suites like FakeAuthorizedSessionTest are still valid/useful.

What do you think? @petedannemann

cadnce avatar Oct 01 '22 02:10 cadnce

First off, @cadnce this is great and thank you for your efforts.

Something I am wondering is if we need these Reader and Writer classes at all now. Do you think there is a solution that just uses the Blob.open function and nothing more? It seems like almost all of our Reader and Writer methods just defer to the the raw reader returned by Blob.open. If this change breaks the smart-open API in some small way like removing the infrequently used terminate method I think that is fine. We can just do a major version bump.

I agree most of these tests are probably useless now, especially if we are able to accomplish what I stated above.

petedannemann avatar Oct 01 '22 13:10 petedannemann

Could we do something like this? We would need to do a bit of refactoring for handling kwargs but this should eliminate a lot more of our code

def open(
        bucket_id,
        blob_id,
        mode,
        client=None,  # type: google.cloud.storage.Client
        blob_properties=None,
        blob_open_kwargs=None,
        ):
    if client is None:
        client = google.cloud.storage.Client()

    bucket = client.bucket(bucket_id)

    if mode in (constants.READ_BINARY, 'r', 'rt'):
        blob = client.bucket.get_blob(blob_id)
        if blob is None:
            raise google.cloud.exceptions.NotFound('blob %s not found in %s' % (blob_id, bucket_id))
    elif mode in (constants.WRITE_BINARY, 'w', 'wt'):
        blob = client.bucket.blob(blob_id)
        if blob_properties:
            for k, v in blob_properties.items():
                setattr(blob, k, v)
    else:
        raise NotImplementedError('GCS support for mode %r not implemented' % mode)

    return blob.open(mode, **blob_open_kwargs)

petedannemann avatar Oct 01 '22 19:10 petedannemann

Cool!

If you’re not opposed to the idea of losing compatibility I’ll swap it to something much simpler like you suggest.

Ironically that’s where I started out before I spent ages ensuring it all matched - I guess I should have just asked

cadnce avatar Oct 01 '22 21:10 cadnce

Considering those changes would reduce the maintenance of the GCS module drastically I think it more than warrants a breaking change / major version bump. We could just redirect most limitations to the underlying python-storage repo. I don't think any meaningful breaking changes will really occur regardless though since the blob.open is fairly similar to our implementation here

petedannemann avatar Oct 01 '22 21:10 petedannemann

👍

I’ll fix this up/replace it next week.

cadnce avatar Oct 02 '22 00:10 cadnce

Hmm, so this has caused me a little more thinking than I was expecting and would be keen to get input on a few things

  1. Testing! GCS doesnt have a moto esque library, hence you've written your own. Cool! But since we're now deferring lots to the google-storage library i'm not sure many of the tests actually test much except the Mock/Fakes. I'd be keen for your opinion, but In my head the main things to test are:
  • That the smart_open interface works with gcs ( The existing OpenTest)
  • Using a real reader/writer works (the existing integration tests I suppose)
  • That we set the correct properties on the blobs (TODO)
  • That we passthrough the correct open arguments (TODO)
  • Error handling for blob or bucket not existing ( test_nonexisting_bucket and test_read_nonexisting_key )
  1. The google-storage blob module handles the case close() being called twice differently to the way that the previous reader did. i.e. in google storage it raises an exception if you close it twice https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/fileio.py#L426 wheras most other file type objects ignore this. I've raised a question on the upstream, but I'm not entirely familiar the underlying memory handling that is referred to in the failing test and if it is still relevant in python 3.x? - https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/tests/test_gcs.py#L831-L832

  2. Should i be trying to handle the kwarg arguments that are just getting passed through to the blob open or blob properties? It looks like it would be possible to validate against the valid upload/download kwargs from the google library but that seems like i'm just adding bloat for the sake of it as thats the first thing the google library checks.

cadnce avatar Oct 10 '22 08:10 cadnce

If you need to maintain the functionality in 2, we could do something super disgusting and monkeypatch the google BlobWriter object so that it allows close to be called multiple times. But that seems like a recipe for disaster in the future!

cadnce avatar Oct 10 '22 08:10 cadnce

Hmm, so this has caused me a little more thinking than I was expecting and would be keen to get input on a few things

  1. Testing! GCS doesnt have a moto esque library, hence you've written your own. Cool! But since we're now deferring lots to the google-storage library i'm not sure many of the tests actually test much except the Mock/Fakes. I'd be keen for your opinion, but In my head the main things to test are:
  • That the smart_open interface works with gcs ( The existing OpenTest)
  • Using a real reader/writer works (the existing integration tests I suppose)
  • That we set the correct properties on the blobs (TODO)
  • That we passthrough the correct open arguments (TODO)
  • Error handling for blob or bucket not existing ( test_nonexisting_bucket and test_read_nonexisting_key )
  1. The google-storage blob module handles the case close() being called twice differently to the way that the previous reader did. i.e. in google storage it raises an exception if you close it twice https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/fileio.py#L426 wheras most other file type objects ignore this. I've raised a question on the upstream, but I'm not entirely familiar the underlying memory handling that is referred to in the failing test and if it is still relevant in python 3.x? - https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/tests/test_gcs.py#L831-L832
  2. Should i be trying to handle the kwarg arguments that are just getting passed through to the blob open or blob properties? It looks like it would be possible to validate against the valid upload/download kwargs from the google library but that seems like i'm just adding bloat for the sake of it as thats the first thing the google library checks.
  1. That all sounds right to me. The majority of those unit tests no longer serve their purpose
  2. Oh, that is unfortunate about the double close. I'm not really sure if that requires a monkeypatch, I would probably be ok with the new behavior as long as we document these changes and possibly provide advice on how to migrate to the new version of smart-open. I'm pretty sure I copied that failing test from another test and I wouldn't worry about that. We can delete it.
  3. Let's defer to validity of those to the google library

petedannemann avatar Oct 10 '22 12:10 petedannemann

@mpenkov @piskvorky what is smart-open's stance on breaking API changes within the library?

The opinion I have been expressing so far is that they are ok if we perform a major version bump and the benefits are worth it. In this case, we defer almost all maintenance of the gcs integration to the google-cloud-storage library so the benefits seem worthwhile.

I think so far the breaking changes we have identified are:

  1. Removal of the terminate method, which is likely very infrequently used
  2. close can no longer be called more than once

petedannemann avatar Oct 10 '22 13:10 petedannemann

On the breaking changes front also

  • the parameters to gcs.open have changed (no more control of buffer_size)
  • no-longer has a gcs.Reader or gcs.Writer class

If we’re reasonably happy with the concept of it being a breaking change, I’ll have a look at ensuring those test cases are covered off (and probably strip out a few that are no-longer testing the right things)

cadnce avatar Oct 10 '22 21:10 cadnce

Btw thanks for your input @petedannemann always hard knowing the right way to contribute

cadnce avatar Oct 10 '22 21:10 cadnce

Ok, so of the changes that were breakign originally in this PR:

  1. teminate() previously this method removed an incomplete upload. However now the google-python-storage library under the hood uses ResumableUploads the behaviour with these is differnt. "Only a completed resumable upload appears in your bucket and, if applicable, replaces an existing object with the same name". Though i've added a terminate() feature for backwards compatibility which is no-op.

  2. The double close - this was something I wasnt happy with so I raised a ticket on the google library and that will hopefully be included soon - https://github.com/googleapis/python-storage/issues/884

  3. Removal of buffer_size (I added this back in and now warn about using it) 👍

  4. gcs.Reader gcs.Writer I added them back with the same interfaces they now just act as a passthrough 👍

Oh, and I updated the test coverage to cover the things I thought we were missing.

Which means its just a case of waiting for the next release of the google-cloud-storage library and the interface will remain identical.

cadnce avatar Oct 13 '22 05:10 cadnce

Its definitely not scientific, but I ran the integration/performance tests

use-gcs-open

-------------------------------------------------------------------------------------- benchmark: 7 tests -------------------------------------------------------------------------------------
Name (time in s)                        Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_gcs_performance_small_reads     2.6681 (1.0)      4.1263 (1.0)      3.2295 (1.0)      0.7510 (7.96)     2.7039 (1.0)      1.3380 (9.16)          1;0  0.3096 (1.0)           5           1
test_gcs_readwrite_text              4.9159 (1.84)     5.8216 (1.41)     5.2937 (1.64)     0.3342 (3.54)     5.2788 (1.95)     0.3598 (2.46)          2;0  0.1889 (0.61)          5           1
test_gcs_readwrite_binary_gzip       5.0102 (1.88)     5.2816 (1.28)     5.1368 (1.59)     0.1090 (1.16)     5.0950 (1.88)     0.1656 (1.13)          2;0  0.1947 (0.63)          5           1
test_gcs_readwrite_binary            5.0349 (1.89)     5.4729 (1.33)     5.1708 (1.60)     0.1791 (1.90)     5.1131 (1.89)     0.2120 (1.45)          1;0  0.1934 (0.62)          5           1
test_gcs_performance_gz              5.0807 (1.90)     6.0795 (1.47)     5.4019 (1.67)     0.4065 (4.31)     5.3494 (1.98)     0.4864 (3.33)          1;0  0.1851 (0.60)          5           1
test_gcs_readwrite_text_gzip         5.0951 (1.91)     5.3273 (1.29)     5.2043 (1.61)     0.0943 (1.0)      5.2325 (1.94)     0.1460 (1.0)           2;0  0.1921 (0.62)          5           1
test_gcs_performance                 5.6067 (2.10)     6.4920 (1.57)     5.9715 (1.85)     0.3289 (3.49)     5.9022 (2.18)     0.3725 (2.55)          2;0  0.1675 (0.54)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

develop

-------------------------------------------------------------------------------------- benchmark: 7 tests -------------------------------------------------------------------------------------
Name (time in s)                        Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_gcs_performance_small_reads     4.8196 (1.0)      4.9168 (1.0)      4.8616 (1.0)      0.0484 (1.0)      4.8348 (1.0)      0.0893 (1.0)           2;0  0.2057 (1.0)           5           1
test_gcs_readwrite_binary            4.8830 (1.01)     5.7310 (1.17)     5.1998 (1.07)     0.3588 (7.41)     5.0778 (1.05)     0.5689 (6.37)          1;0  0.1923 (0.93)          5           1
test_gcs_readwrite_text_gzip         4.9546 (1.03)     5.6261 (1.14)     5.1468 (1.06)     0.2742 (5.66)     5.0464 (1.04)     0.2550 (2.85)          1;1  0.1943 (0.94)          5           1
test_gcs_performance_gz              5.0127 (1.04)     5.2060 (1.06)     5.1066 (1.05)     0.0762 (1.57)     5.1100 (1.06)     0.1191 (1.33)          2;0  0.1958 (0.95)          5           1
test_gcs_readwrite_binary_gzip       5.3487 (1.11)     5.8019 (1.18)     5.5354 (1.14)     0.1659 (3.43)     5.5166 (1.14)     0.1612 (1.80)          2;0  0.1807 (0.88)          5           1
test_gcs_readwrite_text              5.4528 (1.13)     6.0964 (1.24)     5.8176 (1.20)     0.2517 (5.20)     5.7647 (1.19)     0.3452 (3.86)          2;0  0.1719 (0.84)          5           1
test_gcs_performance                 5.7934 (1.20)     7.7706 (1.58)     6.4189 (1.32)     0.7946 (16.41)    6.3065 (1.30)     0.8402 (9.41)          1;0  0.1558 (0.76)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The results pretty much being too similar to compare on my connection

cadnce avatar Oct 13 '22 05:10 cadnce

Do we need to author any documentation that helps people adjust to the breaking changes?

mpenkov avatar Oct 16 '22 03:10 mpenkov

If we wait for the google-python-storage library version 2.6.0 https://github.com/googleapis/python-storage/pull/838

I think there will be no breaking changes 😁

Though, I’ll add an update to the documentation about the new parameter passthrough.

the slightly unfortunate thing is it will break all the rest of the open PR’s which modify the gcs module, but they will all be easier to add in the future.

cadnce avatar Oct 16 '22 09:10 cadnce

fyi another PR waiting for gcs 2.6.0: https://github.com/RaRe-Technologies/smart_open/pull/727

would be awesome if that PR can be closed in favour of this one (just that acl cant be specified using setattr)

edit: yes this PR would cover it (source):

blob_open_kwargs={'predefined_acl': 'publicRead'}

if it's not too much to ask: could you copy-paste the corresponding test from my PR? its tiny :)

ddelange avatar Oct 16 '22 11:10 ddelange

@ddelange Yup, it should be covered by the blob_open_kwargs (why I made this pr instead of adding all the small gcs functionality additions)

I’ll add that as a use case in the docs update to explain how to use it?

I was trying to trim functionality out of the fakes (and removed the test you updated) because I was deferring a lot of that to the google library. But could add that in if you felt it was adding a lot?

cadnce avatar Oct 16 '22 12:10 cadnce

awesome, docs sounds good! 💥

yeah, all the fakes. still a mystery to me why there's no testing framework for gcs like moto for boto, or a local dev server for testing like azurite for azure blob...

ddelange avatar Oct 16 '22 12:10 ddelange

nvm, spoke too soon https://github.com/fsouza/fake-gcs-server

ddelange avatar Oct 16 '22 12:10 ddelange

here an example of switching to a dockerized azurite server for tests, was a quite satisfying refactor as you just point to fake server and run real assertions on the server's responses

ddelange avatar Oct 16 '22 12:10 ddelange

nvm, spoke too soon https://github.com/fsouza/fake-gcs-server

here an example of switching to a dockerized azurite server for tests, was a quite satisfying refactor as you just point to fake server and run real assertions on the server's responses

ah, that felt good again :recycle: https://github.com/ddelange/pypicloud/pull/4

thanks a lot for the efforts here, looking very forward to smart_open 7.0.0! 🚀

ddelange avatar Oct 17 '22 07:10 ddelange

is this PR ready for review?

ddelange avatar Oct 31 '22 17:10 ddelange

Sorry about the ambiguity.

We’re waiting on the release of the google cloud storage python module. That said, the code for this will remain the same except for the dependency version will be bumped. So in theory you can review now and just assume the double close issue is not an issue?

cadnce avatar Oct 31 '22 22:10 cadnce

gcs 2.6.0 was released 💥

ddelange avatar Nov 08 '22 06:11 ddelange

Awesome! I’ll make the changes requested and bump the version and test this evening.

cadnce avatar Nov 08 '22 06:11 cadnce

@cadnce could we squeeze this into the next release, v6.3.0? We're planning to release soon (~in a week or so).

piskvorky avatar Nov 19 '22 13:11 piskvorky

@cadnce Do you think you could finish this in the next couple of days? We're thinking of releasing 6.3.0 soon.

If not, then no rush, we can include this in the next release.

Please let me know.

mpenkov avatar Dec 05 '22 06:12 mpenkov

Hi 👋 Looking at @cadnce's contribution graph, I think he hasn't been active since. I took the liberty to add a commit which I think addresses the outstanding comments, see #744.

ddelange avatar Dec 05 '22 09:12 ddelange

Closing in favor of #744 due to inactivity.

mpenkov avatar Dec 09 '22 06:12 mpenkov