mpifileutils icon indicating copy to clipboard operation
mpifileutils copied to clipboard

dfind: integrate into io-500

Open gonsie opened this issue 6 years ago • 5 comments

The IO-500 allows for the use of any find utility, which is the perfect opportunity to show off dfind. We just need to make sure it can emulate the following find arguments:

find $data_dir -newer $timestamp_file -size 3901c -name *01*

gonsie avatar Jan 30 '19 22:01 gonsie

In theory, dfind should support that as

dfind --newer $timestamp_file --size 3901 --name '*01*' $data_dir

adammoody avatar Jan 30 '19 23:01 adammoody

I agree that getting a good parallel find included into IO-500 would be very useful. The current approach is moving toward "implement a special-purpose program that just runs the IO-500 specific find command", which I don't think actually benefits end users vs. making dfind perform well and benefits end users.

Changing dfind.c to use getopt_long_only() instead of getopt_long() to parse the command-line options, then it will accept options with a single dash, like -newer, -size, and -name. That makes it easier to use dfind as a drop-in replacement for find. Also, GNU getopt already allows parsing options in a non-standard order, so specifying the command like:

dfind $data_dir -newer $timestamp_file -size 3901 -name '*01*'

should already work today. GNU getopt will parse all of the - options first, then put any non-options ($data_dir in this case) at the end for further processing.

The one potential issue of getopt_long_only() is that it doesn't allow compound short options (e.g. dfind -amc). However, that doesn't make sense for find anyway, because all options need an argument except -p, so there is no real drawback in this case.

adilger avatar Feb 08 '19 22:02 adilger

To enabling the users the start using dfind as easier as possible, I think dfind should use as similar options as possible like find. Thus, the following option type look better to me:

dfind $data_dir -newer $timestamp_file -size 3901c -name *01*

It would be nice if people can just replace the "find" with "dfind" in all find commands and gets better performance without any change to the command options.

If we read a little bit codes of findutils, we will know that the option parsing of find is slightly different from simple getopt_long_only. It first parses all leading options ([-H] [-L] [-P] [-D debugopts] [-Olevel]). And at the send step, it gets all options that looks like directories by checking whether the option start with '-' or not. And the final step is parsing the expression options.

In order to keep the similar options, similar option parsing process can be followed. And I think the option of "find" will well known by all Unix users, and it would benifit dfind if dfind follows the same type.

It would be nice too if dfind can support -exec option too. I think it would be very useful.

I am not so sure whether implementing the support for "expr1 -and expr2" and "expr1 -or expr2" is a urget requirement or not. It would be nice to have the support. But parsing the options and build the expression structure is complex. And at least according to my personal experience, not many people uses complex expression with AND and OR operations. Most of the time, I only implicitly use the logical AND, just like the example of dfind $data_dir -newer $timestamp_file -size 3901c -name *01*. So my feeling is, implementing the implicit logical AND support would be enough for most use cases.

LiXi-storage avatar Feb 09 '19 16:02 LiXi-storage

Li Xi, it isn't clear what additional benefits would be added from implementing a more complex parsing vs. just using getopt_long_only()? That change is trivial to make and gives 95% of the compatibility with common find usage.

I agree that -o is potentially useful, but we've lived with only getopt_long_only() with implicit AND for years in lfs find (though eg. --ost is implicitly using OR), so I don't think it is critical as a first approximation of the normal find option parsing.

adilger avatar Feb 10 '19 04:02 adilger

Sorry for misleading. I totally agree that getopt_long_only() can be used for dfind. We shall not reinvent the wheel since getopt_long_only() works well. What I was suggesting is, instead of use "--" option type, it would be easier for the user to use if dfind uses the same option pattern, i.e. :

dfind [leading_options] [path...] [expression]

And in order to do so, dfind can use the same argument parse process like find, i.e. 1) get the leading options 2) get the file/directory paths 3) parse the expression. And maybe in the final step of parsing the expression, getopt_long_only() is the best function to use.

Agreed that -o and ( expr ) expressions supported by find are not the most urgent thing to implement.

LiXi-storage avatar Feb 10 '19 05:02 LiXi-storage