goofys
goofys copied to clipboard
Add the generation of inode number based on the file path.
Pretend we have a filesystem.
/a /a/b /a/c /a/d
/a will always be inode 1 If I access /a/b second, it will have inode 2 If I access /a/c second, it will have inode 2 and so on.
Now it generates unique ID for this mount point that based on file path.
what's the motivation of this? things like rename will still mess things up anyway
motivation is to have a persistent inode so that a scanner operating on the filesystem will always get the same inodes from the same objects no matter the order. rename/move changing things is a livable condition. (not dissimilar to moving a file outside of a filesystem).
What is this filesystem scanner trying to ascertain by looking at the inode number?
So this PR creates a "strange" side-effect:
rm a && touch athe inode number will stay the same
And you still can have unstable inode numbers if:
- rename mentioned above
- hash collision (64 bit seems large but don't forget the birthday paradox)
(also apologies for never getting around to answering your email @dougstarfish)
I have a scanner which uses a change in inode (associated with a path), among other things, as an indication that the file has changed. A change in path with a change in inode is fine, since either of those would constitute a "fairly" determined new version of the file. Consider an editor like vi or emacs. It changes the inode of a file while leaving the path the same, but it's a new version of the object. Also a change in mtime, size, or owner, would constitute a new version of the file for purposes of keeping history. So, the fact that we keep history of changes, inodes changing on every mount is a "bad thing" (for us) because it results in constant churn in the database depending on the access order during a crawl. Re hash collision, yes, but 2^64 is a really, really large number of files. Even with birthday paradox (sqrt(2^64)) it's still quite a bit beyond the number of files (or objects) that most people have in a filesystem/bucket (4bil objects). It's worth the risk. But, even if there is a collision having a different path with the same inode will result in weirdness, but not in fatality. We might lose a file from a bucket (log it) in such a scenario and have to come up with a workaround. This is a big step in the right direction for our use case and lets us index large buckets of things as if they were normal posix filesystems. PS - this is super fast and I love the implementation. This one little enhancement makes it really useful for scanning a bucket and eliminating version churn in subsequent scans. PPS - we're all busy, but thanks for acknowledging. :) I added the hash -> modulo inode generator algorithm and SBoy integrated into Fuse/Go for me, with some tweaks. PPPS - (I) might want to change the hash algo to xxHash, Blake2, or HighwayHash, but md5 module was easy and available. (haven't benchmarked for scanning speed impact, yet)
I think a really good way to use this might be to have a --stable-inode option, by the way.. Just a thought.
what do you think about the approach suggested in https://github.com/kahing/goofys/issues/419 ? From my perspective both of these asks are valid but also strange, and I'd rather have one strange way than 2 strange ways (not that the current inode allocation isn't strange)
419 is going to have the same problem with persistency between mounts and access order. I'm not sure if he realizes that. I had started thinking about using Etag, but then, as you said, things with the same content would have the same inode. Maybe that wouldn't be so bad, but you'd then probably have to keep track of inode use to increase hardlink count for it to be sensible?
This solution using fnv/hash might meet his needs, since he explicitly mentions he doesn't want things with the same content to have the same inode (much as I thought). I don't know if there is a way to have both keeping the inode after a move and having a unique, persistent inode at the same time. There just isn't enough information from S3 to make that practical, I think.. "Alternatively, maybe we can introduce a flag that enables inodeId creation based on hashing a (Etag, size) tuple, rather than incrementing. This would effectively hardlink any files with same (etag, size) - which for some would be a bad idea, but for others, it may be fine." that would be easy and fine for some use cases, but I think then you have to increment link count, which means allocating a lot of live memory to keep track of the collisions? You're spot on with du, which would make it impossible to infer any reasonable billing, if that's a goal.
playing devil's advocate a bit:
why would #419 have the same problem with access order? it would be base on etags which would stay the same
it's not clear if maintaining correct link count is necessary, either way it should not be more expensive than keeping the inodes anyway.
Good question. I think I read the top part wrong the first time and missed the part about caching the dynamically generated inode after the first execution. Seems like a sizeable cache to maintain. sqllite, perhaps?
I wonder if he could live without moving keeping the inode? If no, I wonder if a hybrid hash approach would work. A move results in populating a hash (DB) like retain[newpath] = originode then you could check path in retain then use that originode otherwise inode would be hash%2^63+1? That at least would constrain the size of the DB by quite a bit since only moved objects would need a DB lookup. Thoughts?
I don't understand why a cache is needed, could you not hash(etag) instead of hash(filename) like this PR?
if you use the etag then all of the things with the same contents have the same inode. It's worth considering. The questions I would have about that are:
- what happens if two identical content objects with different names end up in the same directory? What does ls do? Is it normal?
- what standard unix tools might get confused (du? don't know)
- what happens when you do find with -inode? I suppose it would work, but it'd be interesting to confirm.
For my use case I'm not sure how it would behave if the hardlink count was always 1 but the same inode showed up elsewhere in the same filesystem. It's certainly testable.
you can have a hard link in the same directory but yes the link count would be wrong. I imagine that's fine most of the time since goofys doesn't return correct link count for directories anyway :-P
Got a chance to test with the hash-based inode generation. It doesn't seem to impact performance appreciably. I got 18732 stats/sec on 25 million file bucket over a latency of 17ms to the bucket with a parallel walk. Quite acceptable! (Faster than any Isilon that I've measured so far, fwiw). Thought you might be interested.
Here's a thought on 419 I just had after playing around with things. How would you do directories? Directories do not have an etag, so you'd have to have a different mechanism that is both persistent but also based upon some other criteria to maintain persistence. (cache? local db?). the path based mechanism works with directories. (though I have a small issue with the atime/mtime/ctime on directories changing between mounts. Not sure there's anything I can do about it since they aren't real)
I am not quite convinced that having static inode number base only on file path is the right thing to do. There's some notion of "sameness" that people expect if two files have the same inode number, which this PR breaks (rm a && touch a). It is true that the status quo also doesn't have that semantic, but it is currently quite obvious that the inode numbers are not stable.
because of that, if we want to make inode numbers stable, I prefer hashing etag over file name, and fallback to hashing filename for directories (and perhaps also for empty files since they also always have the same etag, and is another common case where this "sameness" may break down).
For my purposes at least, treating 2 files with different name, but having the same content as hardlinked would suffice. This is essentially what you would get if inode is based on the Etag hash.
419 is going to have the same problem with persistency between mounts and access order. I'm not sure if he realizes that.
My specific application does not need to keep track of files across remounts. But I recognize that as a significant shortcoming that could catch many people off guard.
Seems like a sizeable cache to maintain
My purpose for the cache is specifically to detect renaming of files. For that purpose, it would not be necessary to keep deleted inodes around for very long ~ 5 min. goofys already has a cache of all files listed by the user.
what happens if two identical content objects with different names end up in the same directory? What does ls do? Is it normal? They simple exists as 2 hardlinked files in the same dir. apart from hardlinking being unusual nowadays, this does not seem too exotic.
what standard unix tools might get confused (du? don't know) Let's see:
user@localhost:~/inode-test$ ls -lhi
total 1.0M
3426614 -rw-r--r-- 2 user user 512K Jun 11 15:13 file-a
3426615 lrwxrwxrwx 1 user user 6 Jun 11 15:13 file-b -> file-a
3426614 -rw-r--r-- 2 user user 512K Jun 11 15:13 file-c
user@localhost:~/inode-test$ du -h
516K .
user@localhost:~/inode-test$
du reports 512K for file-a/file-c which are hardlinked and 6 bytes for the symlink.
How would you do directories? Directories do not have an etag, so you'd have to have a different mechanism that is both persistent but also based upon some other criteria to maintain persistence.
Hardlinking directories are not allowed in POSIX so there is no need for different directories to have same indoe. Here hashing the absolute path would be enough to ensure persistency. We do run into the issue of renaming causing inode to change, but I consider that a minor issue. Possibly we could reserve inode > 8xxxxxx for directories and <= 8xxxxxx for files to avoid collisions.
I think etag will probably work for my purposes, pending how we deal with multiple things with the same inode number in the same filesystem but st_nlinks = 1. Might just be fine.
is this re-born as a result of killing the other etag-based one? (I have confirmed a variation of this also works with Azure backend)
I do intend to revisit this, but my current free time is going to be dedicated to releasing the multiple backend work first. Everything is mostly there, but figuring out how to do things like CI takes time.
@kahing , just getting back to this patch. I think it would be worth doing it with the --stable-inode type of flag. We can name it something else if you want. We have been running with this for quite some time and it is stable and appears to work. (At least for our needs).