gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

Added support for the # character in filenames

Open Eugeny opened this issue 3 years ago • 15 comments
trafficstars

Google requires # to remain unescaped in the URL, but aiohttp considers it the start of the fragment and strips it out.

Eugeny avatar Nov 24 '22 15:11 Eugeny

@efiop , how does this interact with versioning?

@Eugeny , GCS recommends strongly against using "#" in a key name.

martindurant avatar Nov 24 '22 15:11 martindurant

According to https://cloud.google.com/storage/docs/objects it's the gcloud CLI that interprets it in a special way, not the actual storage. Anyway, currently valid files on GCS with # in the name are returned by .ls() but then can't be found by .info() in gcsfs

Eugeny avatar Nov 24 '22 16:11 Eugeny

I've originally tested this change against 2021.11.0, but just noticed that on the main branch, urllib's quote_plus is now used instead of the old custom implementation, and quote_plus would escape # which is likely going to break this again (unfortunately unable to test it myself right now)

Eugeny avatar Nov 24 '22 16:11 Eugeny

@Eugeny gcloud CLI is an important part of the ecosystem and it is a bit odd to try to use names that are not supported by it.

There was some discussion around it in https://github.com/fsspec/gcsfs/pull/504 CC @pmrowla

efiop avatar Nov 24 '22 18:11 efiop

Sure, I'm not going out of my way to play the system - just have to work with already-existing files that have all sorts of names and gcsfs tripped on a file like this when I was crawling the bucket with ls() + info()

Eugeny avatar Nov 24 '22 19:11 Eugeny

@Eugeny Makes sense. Would it be an option to just go through those files through API and rename them to comply with limitations? Merging this PR will break versioning support and it seems not trivial to have both (i suppose gcloud cli had similar reasons).

efiop avatar Nov 24 '22 19:11 efiop

I'm not that familiar with the versioning API, but it looks like the version is not actually stored in the name with a #, it's just how gcloud, an external application, displays it, so nothing should break

Eugeny avatar Nov 24 '22 21:11 Eugeny

it looks like the version is not actually stored in the name with a #

Unfortunately it does get included in the URL when accessing the different versions of the object (for which you do not pass a nicely formatted JSON like the one linked)

martindurant avatar Nov 24 '22 21:11 martindurant

Not sure what you mean by this.

  • In actual URLs, generation is chosen via a GET param:

@cloudshell:~ (tabby-320916)$ gsutil signurl  1.json 'gs://eugenetest/Screenshot 2022-11-22 at 14.43.30.png#1669325016814501'
URL     HTTP Method     Expiration      Signed URL
gs://eugenetest/Screenshot 2022-11-22 at 14.43.30.png#1669325016814501  GET     2022-11-24 22:26:57     https://storage.googleapis.com/eugenetest/Screenshot%202022-11-22%20at%2014.43.30.png?x-goog-signature=xxxx&generation=1669325016814501&x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=universal-test%40tabby-320916.iam.gserviceaccount.com%2F20221124%2Fus-east1%2Fstorage%2Fgoog4_request&x-goog-date=20221124T212657Z&x-goog-expires=3600&x-goog-signedheaders=host
  • In gs:// URIs in gcsfs, # was chopped off by aiohttp anyway and was lost.

Eugeny avatar Nov 24 '22 21:11 Eugeny

@Eugeny You are correct, # is chopped if it makes it all the way down to _request, but if fs is version_aware, it gets interpreted as generation in _split_path. I didn't test this myself, but if you create a version_aware instance and try to open a file with fragment in the url, it will probably break. Maybe you could make this work by wrapping your code into if not self.version_aware so urls with fragments only work when fs is not version_aware.

efiop avatar Nov 24 '22 22:11 efiop

I don't know your situation, but overall, having # in filenames is probably going to cause you more problems later on, so to me, if possible, just renaming them sounds like the simplest solution.

efiop avatar Nov 24 '22 22:11 efiop

Thanks, that makes total sense, I didn't know that gcsfs had a version aware mode.

I'm building an app that needs to index user-supplied cloud storage, so I have no influence on what's going to be there :)

Eugeny avatar Nov 24 '22 22:11 Eugeny

According to https://cloud.google.com/storage/docs/objects it's the gcloud CLI that interprets it in a special way, not the actual storage. Anyway, currently valid files on GCS with # in the name are returned by .ls() but then can't be found by .info() in gcsfs

# is technically allowed in object names, however the same page you linked also states

It is strongly recommended that you avoid the following in your object names:

  • The "#" character

When we implemented versioning support in gcsfs, the decision we made was essentially that if google says not to use # in object names, and google's own tools for using GCS do not support # in object names, then gcsfs should not overcomplicate things either (and does not need to try supporting both versioning and # in object names)

As you noted, it is possible to use the google API directly if you absolutely need support for #, but I don't think that it should be required in gcsfs

pmrowla avatar Nov 25 '22 02:11 pmrowla

Just to be clear, we totally understand that this is coming from the need and don't want to discourage you. We might even run into the same problem ourselves later on, as we also don't have control over what users already have in their cloud, but just didn't run into that yet.

Seems like supporting # with version_aware turned off is somewhat straightforward. Would that be enough for your application? If you need both at the same time, then I guess this is a little bit more complicated, but probably still doable. In both cases, I really hope fake-gcs-server that gcsfs uses for testing supports such object names, as that would at least make it all automatically testable.

efiop avatar Nov 25 '22 08:11 efiop

Just to follow up on this, I think supporting # in object names when version_aware is false is probably the best way to handle this situation. The assumptions about how google cli tools handle # in object names technically apply outside of versioned contexts, but I would not be opposed to allowing # (and ?) in non versioned contexts for gcsfs purposes

pmrowla avatar Dec 01 '22 13:12 pmrowla

This seems to have become lost. @Eugeny , do you have all the information you need to finish this off? There is now a conflict and linting is failing; and following the discussion, we would want to allow "#" only in absence of versioning.

martindurant avatar Jan 23 '23 14:01 martindurant

Unfortunately I don't think I'll have time to work on it - we've decided to blacklist # in paths internally to avoid UX issues with gcloud accessing the same files - feel free to close unless someone else is interested

Eugeny avatar Jan 24 '23 10:01 Eugeny

OK, closing for now, and we'll see if it emerges again. Thanks for looking into the issue, @Eugeny

martindurant avatar Jan 24 '23 13:01 martindurant