goofys Add the generation of inode number based on the file path.

Pretend we have a filesystem.

/a /a/b /a/c /a/d

/a will always be inode 1 If I access /a/b second, it will have inode 2 If I access /a/c second, it will have inode 2 and so on.

Now it generates unique ID for this mount point that based on file path.

May 31 '19 15:05 SBoyNumber1

what's the motivation of this? things like rename will still mess things up anyway

May 31 '19 21:05 kahing

motivation is to have a persistent inode so that a scanner operating on the filesystem will always get the same inodes from the same objects no matter the order. rename/move changing things is a livable condition. (not dissimilar to moving a file outside of a filesystem).

May 31 '19 22:05 dougstarfish

What is this filesystem scanner trying to ascertain by looking at the inode number?

So this PR creates a "strange" side-effect:

rm a && touch a the inode number will stay the same

And you still can have unstable inode numbers if:

rename mentioned above
hash collision (64 bit seems large but don't forget the birthday paradox)

(also apologies for never getting around to answering your email @dougstarfish)

May 31 '19 22:05 kahing

I have a scanner which uses a change in inode (associated with a path), among other things, as an indication that the file has changed. A change in path with a change in inode is fine, since either of those would constitute a "fairly" determined new version of the file. Consider an editor like vi or emacs. It changes the inode of a file while leaving the path the same, but it's a new version of the object. Also a change in mtime, size, or owner, would constitute a new version of the file for purposes of keeping history. So, the fact that we keep history of changes, inodes changing on every mount is a "bad thing" (for us) because it results in constant churn in the database depending on the access order during a crawl. Re hash collision, yes, but 2^64 is a really, really large number of files. Even with birthday paradox (sqrt(2^64)) it's still quite a bit beyond the number of files (or objects) that most people have in a filesystem/bucket (4bil objects). It's worth the risk. But, even if there is a collision having a different path with the same inode will result in weirdness, but not in fatality. We might lose a file from a bucket (log it) in such a scenario and have to come up with a workaround. This is a big step in the right direction for our use case and lets us index large buckets of things as if they were normal posix filesystems. PS - this is super fast and I love the implementation. This one little enhancement makes it really useful for scanning a bucket and eliminating version churn in subsequent scans. PPS - we're all busy, but thanks for acknowledging. :) I added the hash -> modulo inode generator algorithm and SBoy integrated into Fuse/Go for me, with some tweaks. PPPS - (I) might want to change the hash algo to xxHash, Blake2, or HighwayHash, but md5 module was easy and available. (haven't benchmarked for scanning speed impact, yet)

May 31 '19 23:05 dougstarfish

I think a really good way to use this might be to have a --stable-inode option, by the way.. Just a thought.

May 31 '19 23:05 dougstarfish

what do you think about the approach suggested in https://github.com/kahing/goofys/issues/419 ? From my perspective both of these asks are valid but also strange, and I'd rather have one strange way than 2 strange ways (not that the current inode allocation isn't strange)

May 31 '19 23:05 kahing

419 is going to have the same problem with persistency between mounts and access order. I'm not sure if he realizes that. I had started thinking about using Etag, but then, as you said, things with the same content would have the same inode. Maybe that wouldn't be so bad, but you'd then probably have to keep track of inode use to increase hardlink count for it to be sensible?

This solution using fnv/hash might meet his needs, since he explicitly mentions he doesn't want things with the same content to have the same inode (much as I thought). I don't know if there is a way to have both keeping the inode after a move and having a unique, persistent inode at the same time. There just isn't enough information from S3 to make that practical, I think.. "Alternatively, maybe we can introduce a flag that enables inodeId creation based on hashing a (Etag, size) tuple, rather than incrementing. This would effectively hardlink any files with same (etag, size) - which for some would be a bad idea, but for others, it may be fine." that would be easy and fine for some use cases, but I think then you have to increment link count, which means allocating a lot of live memory to keep track of the collisions? You're spot on with du, which would make it impossible to infer any reasonable billing, if that's a goal.

Jun 01 '19 00:06 dougstarfish

playing devil's advocate a bit:

why would #419 have the same problem with access order? it would be base on etags which would stay the same

it's not clear if maintaining correct link count is necessary, either way it should not be more expensive than keeping the inodes anyway.

Jun 01 '19 00:06 kahing

Good question. I think I read the top part wrong the first time and missed the part about caching the dynamically generated inode after the first execution. Seems like a sizeable cache to maintain. sqllite, perhaps?

I wonder if he could live without moving keeping the inode? If no, I wonder if a hybrid hash approach would work. A move results in populating a hash (DB) like retain[newpath] = originode then you could check path in retain then use that originode otherwise inode would be hash%2^63+1? That at least would constrain the size of the DB by quite a bit since only moved objects would need a DB lookup. Thoughts?

Jun 01 '19 01:06 dougstarfish

I don't understand why a cache is needed, could you not hash(etag) instead of hash(filename) like this PR?

Jun 01 '19 02:06 kahing

if you use the etag then all of the things with the same contents have the same inode. It's worth considering. The questions I would have about that are:

what happens if two identical content objects with different names end up in the same directory? What does ls do? Is it normal?
what standard unix tools might get confused (du? don't know)
what happens when you do find with -inode? I suppose it would work, but it'd be interesting to confirm.

For my use case I'm not sure how it would behave if the hardlink count was always 1 but the same inode showed up elsewhere in the same filesystem. It's certainly testable.

Jun 01 '19 02:06 dougstarfish

you can have a hard link in the same directory but yes the link count would be wrong. I imagine that's fine most of the time since goofys doesn't return correct link count for directories anyway :-P

Jun 01 '19 02:06 kahing

Got a chance to test with the hash-based inode generation. It doesn't seem to impact performance appreciably. I got 18732 stats/sec on 25 million file bucket over a latency of 17ms to the bucket with a parallel walk. Quite acceptable! (Faster than any Isilon that I've measured so far, fwiw). Thought you might be interested.

Jun 01 '19 14:06 dougstarfish

Here's a thought on 419 I just had after playing around with things. How would you do directories? Directories do not have an etag, so you'd have to have a different mechanism that is both persistent but also based upon some other criteria to maintain persistence. (cache? local db?). the path based mechanism works with directories. (though I have a small issue with the atime/mtime/ctime on directories changing between mounts. Not sure there's anything I can do about it since they aren't real)

Jun 03 '19 18:06 dougstarfish

I am not quite convinced that having static inode number base only on file path is the right thing to do. There's some notion of "sameness" that people expect if two files have the same inode number, which this PR breaks (rm a && touch a). It is true that the status quo also doesn't have that semantic, but it is currently quite obvious that the inode numbers are not stable.

because of that, if we want to make inode numbers stable, I prefer hashing etag over file name, and fallback to hashing filename for directories (and perhaps also for empty files since they also always have the same etag, and is another common case where this "sameness" may break down).

Jun 10 '19 22:06 kahing

For my purposes at least, treating 2 files with different name, but having the same content as hardlinked would suffice. This is essentially what you would get if inode is based on the Etag hash.

419 is going to have the same problem with persistency between mounts and access order. I'm not sure if he realizes that.

My specific application does not need to keep track of files across remounts. But I recognize that as a significant shortcoming that could catch many people off guard.

Seems like a sizeable cache to maintain

My purpose for the cache is specifically to detect renaming of files. For that purpose, it would not be necessary to keep deleted inodes around for very long ~ 5 min. goofys already has a cache of all files listed by the user.

what happens if two identical content objects with different names end up in the same directory? What does ls do? Is it normal? They simple exists as 2 hardlinked files in the same dir. apart from hardlinking being unusual nowadays, this does not seem too exotic.

what standard unix tools might get confused (du? don't know) Let's see:

user@localhost:~/inode-test$ ls -lhi
total 1.0M
3426614 -rw-r--r-- 2 user user 512K Jun 11 15:13 file-a
3426615 lrwxrwxrwx 1 user user    6 Jun 11 15:13 file-b -> file-a
3426614 -rw-r--r-- 2 user user 512K Jun 11 15:13 file-c
user@localhost:~/inode-test$ du -h 
516K	.
user@localhost:~/inode-test$

du reports 512K for file-a/file-c which are hardlinked and 6 bytes for the symlink.

How would you do directories? Directories do not have an etag, so you'd have to have a different mechanism that is both persistent but also based upon some other criteria to maintain persistence.

Hardlinking directories are not allowed in POSIX so there is no need for different directories to have same indoe. Here hashing the absolute path would be enough to ensure persistency. We do run into the issue of renaming causing inode to change, but I consider that a minor issue. Possibly we could reserve inode > 8xxxxxx for directories and <= 8xxxxxx for files to avoid collisions.

Jun 11 '19 14:06 emillynge

I think etag will probably work for my purposes, pending how we deal with multiple things with the same inode number in the same filesystem but st_nlinks = 1. Might just be fine.

Jun 12 '19 14:06 dougstarfish

is this re-born as a result of killing the other etag-based one? (I have confirmed a variation of this also works with Azure backend)

Jul 18 '19 19:07 dougstarfish

I do intend to revisit this, but my current free time is going to be dedicated to releasing the multiple backend work first. Everything is mostly there, but figuring out how to do things like CI takes time.

Jul 18 '19 21:07 kahing

@kahing , just getting back to this patch. I think it would be worth doing it with the --stable-inode type of flag. We can name it something else if you want. We have been running with this for quite some time and it is stable and appears to work. (At least for our needs).

Aug 25 '22 14:08 ebressler

goofys goofys copied to clipboard

Add the generation of inode number based on the file path.

goofys
goofys copied to clipboard