pyfilesystem2 icon indicating copy to clipboard operation
pyfilesystem2 copied to clipboard

When walking an osfs:// with a symlink to a prior directory, pyfilesystem2 loops the circuit

Open dstromberg opened this issue 7 years ago • 7 comments

If I create a symlink like ./c/d/2 -> .. then pyfilesystem 2.0.20 gives: Traceback (most recent call last): File "/home/dstromberg/src/pyfilesystem-tests/lib/python3.6/site-packages/fs/osfs.py", line 468, in _scandir "is_dir": dir_entry.is_dir() OSError: [Errno 40] Too many levels of symbolic links: b'/home/dstromberg/src/home-svn/backshift/trunk/tests/57-symlinks/to-be-saved/a/b/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1'

It probably should not try to traverse symlinks.

dstromberg avatar May 09 '18 17:05 dstromberg

It would probably be better to detect recursive symlinks.

willmcgugan avatar May 12 '18 12:05 willmcgugan

I can see doing that with an OSFS instance (using readlink), but how would you do that for remote filesystems (such as an FTP) where the symlinks are traversed server-side?

geoffjukes avatar Jun 25 '18 17:06 geoffjukes

It may not be possible on such filesystems if its not exposed.

I can get the inode number from a stat call from the OS, but not for all filesystems. I was thinking of adding some kind of 'identity' info namespace, so the implementation can generate a unique key for files which I can use to figure out if it has been visited before.

willmcgugan avatar Jun 25 '18 17:06 willmcgugan

The identity hash is an interesting idea. With very few datapoints, you could reasonably ensure uniqueness, without adding much overhead.

I do a similar thing with a JSON data feed I deliver. I pop out anything that is always different (my example, a 'sent' date) and I MD5 json.dumps(thing, sort_keys=True)

Maybe there is a simple hash that can be created without adding too much overhead. As it's only really a problem in the walker, and a walker already collects sizes etc.

geoffjukes avatar Jun 25 '18 18:06 geoffjukes

My concern about that is that it is entirely possible to have folders with identical metadata. I'm tempted to say that if we can't tell for certain that a folder is a symlink, we shouldn't try to guess.

So for the ftp example, the only protection against a recursive symlink directory would be setting the max_depth parameter on the walker.

willmcgugan avatar Jun 27 '18 15:06 willmcgugan

Within a given tree (which is where I would expect this to cause issues) I'd wager it highly unlikely that 2 branches would share the exact same md5sum(name+atime+ctime+mtime+size) for every file and folder - especially if it were possible to somehow tag-in the 'parent' hash.

So have an identity hash for each element, then hash the hashes :)

Of course, total overkill if you just have a follow_symlinks options that can be set to false and rely on the OS to throw too many levels if it goes too deep :)

geoffjukes avatar Jun 27 '18 20:06 geoffjukes

On Wed, Jun 27, 2018 at 1:42 PM, Geoff Jukes [email protected] wrote:

Within a given tree (which is where I would expect this to cause issues) I'd wager it highly unlikely that 2 branches would share the exact same md5sum(name+atime+ctime+mtime+size) for every file and folder - especially if it were possible to somehow tag-in the 'parent' hash.

Perhaps a nondefault option recursion_heuristic=False? I probably wouldn't use it though.

-- Dan Stromberg

dstromberg avatar Jun 27 '18 20:06 dstromberg