filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Strip protocol from paths in fsspec.generic.GenericFileSystem._copy

Open ryaminal opened this issue 10 months ago • 8 comments

Seems to fix a problem I was having. Can't do it upstream in the rsync method because at least one file needs the protocol to resolve the proper filesystem.

ryaminal avatar Apr 17 '24 02:04 ryaminal

This is probably reasonable (even given a fix in sshfs), but it would be nice if we could engineer a case that failed before this fix and now passes, as a test against regression.

martindurant avatar Apr 17 '24 14:04 martindurant

You are referencing this fix in sshfs for the _info/isdir stuff, right? Yeah, that fix got around one problem I was having with GenericFileSystem when it checks for a dir.

This fix is to get around another problem i was having after I resolved the sshfs one above.

And yeah, a breaking/fixing test case is a grand idea. I will work on that. I imagine I can use a mockssh server for this, perhaps? hmmm.

fs = fsspec.url_to_fs("sftp://user@host:1234")[0]

fsspec.generic.rsync(
    "sftp:///stuff",
    "gs://dbucket/dir1/dir2",
    inst_kwargs={"default_method": "current"},
)

is what i was using locally to test this... i wonder if the existing test_rsync isn't catching this here because it doesn't use the default_method=current? the other tests, however, do but not rsync.

ryaminal avatar Apr 17 '24 14:04 ryaminal

I imagine I can use a mockssh server for this, perhaps?

memory, local and ftp FSs are the typical ones used in tests (in that order). You could make a mock, but if any of those show the problem, it would be easier.

martindurant avatar Apr 18 '24 14:04 martindurant

hmmm, seems like any filesystem that doesn't strip in their _info or info calls has a similar problem with GenericFileSystem.

I decided to try this with the built in implementation of sftp instead of sshfs and sure enough, same issue with isdir failing and the same fix.

also was able to reproduce this in the test by commenting out the strip in the memory impl here.

with that line commented out the test fails in the same way as sshfs and the sftp impl.

i don't think the fix can be done in AbstractFileSystem or AsyncAbstractFileSystem so every impl needs to make sure they strip the protocol in their info and _info methods?

ryaminal avatar Apr 19 '24 15:04 ryaminal

Yes I think all backends should be able to cope with paths with or without protocol specifiers.

martindurant avatar Apr 19 '24 16:04 martindurant

bleh, this is becoming quite tricky to test. not certain how to do it.

would it be better to move this out of fsspec and just add another fix to sshfs? in sshfs i think this will have to be done in almost all locations that use a path.

ryaminal avatar Apr 19 '24 19:04 ryaminal

If you can state the problem as "sshfs should always use _strip_protocol, but doesn't", then yes.

martindurant avatar Apr 19 '24 19:04 martindurant

The logic in Generic changed recently, not to update the "name" field of the dircache of its target filesystems (part of https://github.com/fsspec/filesystem_spec/pull/1633 ). That might be enough to fix the issue here.

martindurant avatar Jun 25 '24 13:06 martindurant