dwarfs Enhance `mkdwarfs` to support specifying a list of files to include (similar to `cpio`)

A very nice feature to cpio (actually the only operation mode) or tar (via the --files-from) is the option of specifying a list of files to include (instead of recursing through the root folder).

Such a feature would allow one to easily exclude certain files from the source, without having to resort to rsync for example to build a temporary tree.

This could work in conjunction with -i as such: any file within the list are treated as relative to the -i folder, regardless if they start with /, ./ or plain path. Also warn if one tries to traverse outside the -i folder. For example, given that -i source is used:

whatever is actually source/whatever;
./whatever is the same as above;
/whatever is the same as above;
../whatever would issue an error as it tries to escape the source;
a/b/../../c is actually source/c, although it could issue a warning;
/some-folder (given it is a folder) would not be recursed, but only itself is created within the resulting image; (it is assumed that one would add other files afterwards);

Also it would be nice to have an option to zero-terminate files list instead of newline.

The above could be quite simple to implement, however an even more useful option would be something like this:

in the Linux kernel there is a small tool gen_init_cpio.c (https://github.com/torvalds/linux/blob/master/usr/gen_init_cpio.c#L452) which takes a file describing how an cpio archive (to be used for the initramfs) should be created (see the source code at the hinted line for the file syntax); thus in addition to the previous feature of file-lists, such a "file-system" descriptor would allow one to create (without root credentials on his machine) a file-system with any layout;
as an extension to the above, perhaps JSON would be a better choice; :)

Nov 29 '20 22:11 cipriancraciun

I'll take that as an item for the wishlist :)

What you can already do (although it's undocumented): you can build the whole thing with cmake -DWITH_LUA=1. This will give you the option of supplying a script where you can implement a filter function:

$ cat perl.lua 
function filter(f)
  return f.type ~= 'file' or f.name ~= "libperl.a"
end

This is really experimental and it currently depends on luabind, so I've better left it undocumented for now... ;)

I originally added Lua support to be able to supply an ordering function, but in the long run it turned out that similarity based ordering was beating every attempt I made at coming up with a better manual ordering algorithm.

Nov 29 '20 23:11 mhx

I originally added Lua support to be able to supply an ordering function, but in the long run it turned out that similarity based ordering was beating every attempt I made at coming up with a better manual ordering algorithm.

Hmm... This is a good point. In fact if one uses none as the ordering, then the tool should just use the order given in the file-list, thus allowing one to fine-tune the ordering to fit best his use-case.

(Something similar to what I propose in #8. For example put first all the .py files, then the .html / .css / .js, then images, etc.)

Nov 29 '20 23:11 cipriancraciun

It would be great to have a simple exclude option as well.

Sep 24 '22 20:09 tpwrules

It's been a while...

The wip branch now includes support for simple rsync-like filter options. The rules are not compatible with rsync's, but they operate in a similar manner and are hopefully a bit easier to use. E.g., to exclude all *.bak files:

-F '- *.bak'

That's simple enough. To include only *.so:

-F '+ *.so' -F '- *'

That's a bit easier than with rsync, as far as I'm aware.

The rules are described here. There's also --debug-filter so you can optimize your rule set without going through a full file system build every time. Reading rules from files is also supported via -F '. rules'. This works recursively.

This is completely experimental and not covered by any tests yet. :)

I'd appreciate your feedback on whether this helps with your use cases.

Oct 29 '22 13:10 mhx

Thanks for getting back on this.

Indeed rsync-style filters would be extremely useful.

However, for some complex cases it might not be enough to express the inclusion / exclusion logic, and a way to explicitly say which files to include would be better. (For example say I want to exclude files based on other properties than file names, like for example ownership or size or date, etc.)

Moreover, if one is able to express explicitly which files to use, the user can easily see (and debug) his inclusion / exclusion logic. (I use this often in backup scenarios to make sure I've included / excluded everything I wanted.)

Granted, one could generate that list and transform it into a long list of + /some-path/some-file trailed by - **, however that would be tedious (one must also escape paths that might contain * or other special characters (is that possible in today's rules?)).

Also, given the rules description (linear iteration of each rule until a matching one is found), for lots of files it will result in n^2 complexity.

And as a final point, if mkdwarfs would start reading the given list of files without other sorting, then the user can fine-tune the list for either faster IO read rates (sort input files based on location on disk, which with Ext4 can easily be made), or for compression rates (sort similar files, like by extension).

Nov 01 '22 07:11 cipriancraciun

However, for some complex cases it might not be enough to express the inclusion / exclusion logic, and a way to explicitly say which files to include would be better. (For example say I want to exclude files based on other properties than file names, like for example ownership or size or date, etc.)

I was actually already thinking of extending the rules to support this, e.g. something like:

+size<1m,uid=1000|2000 *
-type=file|link foo/bar

While I like the idea, I'm not convinced yet that this is exactly how I'd want it to look like.

Granted, one could generate that list and transform it into a long list of + /some-path/some-file trailed by - **, however that would be tedious

Hah, true, one could definitely do that. However, the order of the files would not be affected by the order of the list.

one must also escape paths that might contain * or other special characters (is that possible in today's rules?)

Yes, this is possible.

Also, given the rules description (linear iteration of each rule until a matching one is found), for lots of files it will result in n^2 complexity.

Yeah, this is probably suboptimal.

And as a final point, if mkdwarfs would start reading the given list of files without other sorting, then the user can fine-tune the list for either faster IO read rates (sort input files based on location on disk, which with Ext4 can easily be made), or for compression rates (sort similar files, like by extension).

Yep, I'm not saying this wouldn't be useful. It just requires some different processing in a few places to work and I haven't fully wrapped my head around it yet. I'm trying very hard to not have too many alternative branches in the core logic, some of which would (I assume) be only rarely executed. Right now input file discovery is based on a simple top-down recursive directory scan. This means the remainder of the code can rely on a) each parent directory will be visited before its children, b) each entry will be visited exactly once, c) each entry exists (assuming it's not deleted/moved/renamed during the scan). All these guarantees are unaffected by filtering. With a list of input files, a few more sanity checks are needed, and parent directories need to be created on the fly if they don't exist yet.

Nov 01 '22 09:11 mhx

I was actually already thinking of extending the rules to support this, e.g. something like:
+size<1m,uid=1000|2000 *
-type=file|link foo/bar

I believe supporting such a syntax would be quite involved on your part, and I bet it will be very hard to cover all corner-cases or even come close to the flexibility of a real programming language.

(So my suggestion is to keep the inclusion / exclusion rules simple, and do provide a way to explicitly state which files should be archived.)

Right now input file discovery is based on a simple top-down recursive directory scan.

If you manage to decouple the scanning from the archiving, in the end you can even parallelize the scanning. (In my own experiments, especially with network-based file-systems, if you implement parallelization for file reading the new bottleneck becomes the file-system scanning.)

This means the remainder of the code can rely on a) each parent directory will be visited before its children,

This check could be easily implemented with a hash-set of previously visited paths. Just check if the parent is present, else exit with a failure. (I.e. by specification of the input file list, the directories must always be present, and must always come before its children.)

b) each entry will be visited exactly once,

The same hash-set as above could be used; fail hard otherwise.

c) each entry exists (assuming it's not deleted/moved/renamed during the scan).

This guarantee doesn't exist even today, because as you mention it could have been deleted from the file-system after the scanning and before the reading.

However, I do understand that implementing such a feature is more complex than implementing the simple filter rules.

That's why, in my initial comment I've also added:

The above could be quite simple to implement, however an even more useful option would be something like this:

in the Linux kernel there is a small tool gen_init_cpio.c (https://github.com/torvalds/linux/blob/8f71a2b3f435f29b787537d1abedaa7d8ebe6647/usr/gen_init_cpio.c#L496) which takes a file describing how an cpio archive (to be used for the initramfs) should be created (see the source code at the hinted line for the file syntax); thus in addition to the previous feature of file-lists, such a "file-system" descriptor would allow one to create (without root credentials on his machine) a file-system with any layout;

as an extension to the above, perhaps JSON would be a better choice; :)

So, perhaps in the end it would be more flexible to implement something as hinted above:

currently, after scanning the file-system, I'm assuming you are building some internal "file records" that state what the file-metadata is (owner, permissions, timestamps, etc.) and it's path;
how about extending that file-metadata to include both the input path and output path that could differ; (currently the output path is a suffix of the input path;)
then one could deserialize these file-records from say JSON;
the user could now provide you with these file-records, and thus allow one to "synthetize" a source for mkdwarfs without actually having it on the disk (the actual data must still be present);

Where could one use such a feature? For example to easily build container-like backing-file-systems. Or file-systems that contain the same data in different "views" (imagine files having multiple categories, and each category would be transformed into a folder).

Nov 02 '22 08:11 cipriancraciun

I believe supporting such a syntax would be quite involved on your part, and I bet it will be very hard to cover all corner-cases or even come close to the flexibility of a real programming language.

(So my suggestion is to keep the inclusion / exclusion rules simple, and do provide a way to explicitly state which files should be archived.)

Agreed, especially since I don't see a way to make this nice/intuitive.

Right now input file discovery is based on a simple top-down recursive directory scan.

If you manage to decouple the scanning from the archiving, in the end you can even parallelize the scanning. (In my own experiments, especially with network-based file-systems, if you implement parallelization for file reading the new bottleneck becomes the file-system scanning.)

This is pretty much already the way it's implemented.

This means the remainder of the code can rely on a) each parent directory will be visited before its children, b) each entry will be visited exactly once, c) each entry exists (assuming it's not deleted/moved/renamed during the scan).

Okay, scratch all that. It turns out my code was way more flexible than I expected. :)

I added a bit of code to "autovivify" directories and everything else just seems to fall right into place. Even keeping the order of the input files "just works". ~~As some sort of weird bonus, you can now even combine this with filters.~~ I've not yet tested all of this properly, but something like this seems to work just fine:

$ find fmtlib | sort -R | ./mkdwarfs -i . --input-list=- -o /dev/null -f

However, I do understand that implementing such a feature is more complex than implementing the simple filter rules.

Surprisingly, it's not. Filters were probably more complicated overall.

So, perhaps in the end it would be more flexible to implement something as hinted above:

currently, after scanning the file-system, I'm assuming you are building some internal "file records" that state what the file-metadata is (owner, permissions, timestamps, etc.) and it's path;

how about extending that file-metadata to include both the input path and output path that could differ; (currently the output path is a suffix of the input path;)

then one could deserialize these file-records from say JSON;

the user could now provide you with these file-records, and thus allow one to "synthetize" a source for mkdwarfs without actually having it on the disk (the actual data must still be present);

Where could one use such a feature? For example to easily build container-like backing-file-systems. Or file-systems that contain the same data in different "views" (imagine files having multiple categories, and each category would be transformed into a folder).

This is definitely a nice idea, but I think it's much more involved than just passing in a list of files.

Nov 06 '22 17:11 mhx

Love this feature for testing, btw.

find wikipedia -type f \
    | perl -nle'rand() < 0.01 and print' | rev | sort | rev \
    | mkdwarfs --input-list=- -o sample.dwarfs

Nov 07 '22 10:11 mhx

Please check out the v0.7.0-RC1 release candidate: https://github.com/mhx/dwarfs/releases/tag/v0.7.0-RC1

Nov 08 '22 13:11 mhx

I guess the main issue (allowing a list of files to be specified) is resolved now.

The idea of providing a way to describe how to build the file system (like gen_init_cpio.c) is great and I actually even like the syntax. However, this is much more of a special use case and not as trivial to implement as a simple input list.

I'll close this issue, but I'll open a discussion tracking the gen_init_cpio.c idea.

Nov 20 '22 13:11 mhx

dwarfs dwarfs copied to clipboard

Enhance `mkdwarfs` to support specifying a list of files to include (similar to `cpio`)

dwarfs
dwarfs copied to clipboard