pyfilesystem2
pyfilesystem2 copied to clipboard
When walking an osfs:// with a symlink to a prior directory, pyfilesystem2 loops the circuit
If I create a symlink like ./c/d/2 -> .. then pyfilesystem 2.0.20 gives: Traceback (most recent call last): File "/home/dstromberg/src/pyfilesystem-tests/lib/python3.6/site-packages/fs/osfs.py", line 468, in _scandir "is_dir": dir_entry.is_dir() OSError: [Errno 40] Too many levels of symbolic links: b'/home/dstromberg/src/home-svn/backshift/trunk/tests/57-symlinks/to-be-saved/a/b/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1'
It probably should not try to traverse symlinks.
It would probably be better to detect recursive symlinks.
I can see doing that with an OSFS instance (using readlink), but how would you do that for remote filesystems (such as an FTP) where the symlinks are traversed server-side?
It may not be possible on such filesystems if its not exposed.
I can get the inode number from a stat call from the OS, but not for all filesystems. I was thinking of adding some kind of 'identity' info namespace, so the implementation can generate a unique key for files which I can use to figure out if it has been visited before.
The identity hash is an interesting idea. With very few datapoints, you could reasonably ensure uniqueness, without adding much overhead.
I do a similar thing with a JSON data feed I deliver. I pop out anything that is always different (my example, a 'sent' date) and I MD5 json.dumps(thing, sort_keys=True)
Maybe there is a simple hash that can be created without adding too much overhead. As it's only really a problem in the walker, and a walker already collects sizes etc.
My concern about that is that it is entirely possible to have folders with identical metadata. I'm tempted to say that if we can't tell for certain that a folder is a symlink, we shouldn't try to guess.
So for the ftp example, the only protection against a recursive symlink directory would be setting the max_depth parameter on the walker.
Within a given tree (which is where I would expect this to cause issues) I'd wager it highly unlikely that 2 branches would share the exact same md5sum(name+atime+ctime+mtime+size) for every file and folder - especially if it were possible to somehow tag-in the 'parent' hash.
So have an identity hash for each element, then hash the hashes :)
Of course, total overkill if you just have a follow_symlinks options that can be set to false and rely on the OS to throw too many levels if it goes too deep :)
On Wed, Jun 27, 2018 at 1:42 PM, Geoff Jukes [email protected] wrote:
Within a given tree (which is where I would expect this to cause issues) I'd wager it highly unlikely that 2 branches would share the exact same md5sum(name+atime+ctime+mtime+size) for every file and folder - especially if it were possible to somehow tag-in the 'parent' hash.
Perhaps a nondefault option recursion_heuristic=False? I probably wouldn't use it though.
-- Dan Stromberg