perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

Access file timestamps with subsecond resolution as integer

Open rdiez opened this issue 5 years ago • 23 comments

I have written a script that compares filesystem timestamps, like GNU Make does. The trouble is, a resolution of 1 second is too low for my liking.

Linux syscall stat, and its glibc wrapper, provide nanosecond resolution. I believe that Windows' NTFS stores timestamps with nanosecond resolution too. I think it is safe to say that nanosecond resolution in filesystems is mature and here to stay.

In Perl, you have Time::HiRes::stat(), which has 2 issues:

  1. It unconditionally goes through a localtime conversion. This is problematic because some local times are ambiguous due to summer time changes.

  2. You get a floating-point value back, which is problematic due to precision issues when comparing with other timestamps. And you do not actually know what resolution the time actually has.

It would be nice if Perl provided access to the timestamps: a) As UTC. b) In nanosecond resolution. c) As an integer.

rdiez avatar Jun 26 '20 16:06 rdiez

Yes, this would be nice. I'm not quite sure how to present it to the user though.

Perhaps a dualvar of a double and string? (with the latter containing the full precision).

Leont avatar Jun 26 '20 16:06 Leont

I think it would be useful to provide such an API in some way certainly - though it would have to keep in mind that only some filesystems support nanosecond resolution, and some don't even support subsecond.

  1. It unconditionally goes through a localtime conversion. This is problematic because some local times are ambiguous due to summer time changes.

I can't reproduce this. The value of mtime, from either the core stat function or the one provided by Time::HiRes, is an epoch timestamp. This is a format that does not have a time zone; it is the number of seconds since January 1st, 1970 at midnight UTC (adjusted for leap seconds). (Regardless I think this is a good candidate for a separate issue, if you have a reproduction case).

Grinnz avatar Jun 26 '20 16:06 Grinnz

I can't reproduce this. The value of mtime, from either the core stat function or the one provided by Time::HiRes, is an epoch timestamp.

You are right, I am sorry I got this mixed up. So we only have 2 points left: (b) nanosecond resolution, and (c) as an integer.

If a filesystem does not support nanoseconds, it's no problem: Linux just returns a value of 0 nanoseconds if the filesystem is FAT32.

A "dualvar of a double and string", whatever that is, does not sound appealing. A floating-point number is always problematic to compare, and unnecessarily slow anyway. Think about adding time offsets: with every floating-point operation you lose precision. Depending on the overall value and the underlying platform, a floating-point value may not actually reach nanosecond precision.

And a string is also inconvenient and slow to handle. It would hopefully be zero-padded on the left. Otherwise, if you got 123456, do you assume that you only have microsecond precision, and not nanosecond? Would it not be easier for everybody to just return the nanosecond value as an integer? Like Linux does:

struct timespec { time_t tv_sec; /* seconds / long tv_nsec; / nanoseconds */ };

rdiez avatar Jun 26 '20 20:06 rdiez

Yea I think if this were to be done it'd have to be done by returning multiple values for the same reason that the Linux kernel does it. The nano-seconds themselves already takes 32bits minimum to represent, and we're quickly reaching the limits of a 32bit epoch. Those combined mean that you won't be able to represent the time very well in a single 64bit integer. That's also ignoring what will happen with 32bit perls that don't have 64bit integer support. I don't think it could be done without a different function to get that representation.

Maybe something like lstat_ns(), stat_ns() or nanostat(), nanolstat()? They'd have to have an extra value returned in the list so they wouldn't be fully compatible with stat() and lstat() no matter what. Putting the nanoseconds on the end of the list would likely make them the least likely to cause issues but wouldn't make much immediate sense as to the ordering of the elements. Another option would be to return something more than just an integer (say an array ref) for atime/ctime/mtime but that might be more trouble than it's worth too.

simcop2387 avatar Jun 26 '20 22:06 simcop2387

This can easily be implemented as a CPAN module, and that is probably the right place to explore such a thing.

Leont avatar Jun 27 '20 10:06 Leont

A CPAN module has an important drawback: it is not available everywhere. With the current Perl, my users only have to run the script. With a CPAN module, you have to teach your users how to install Perl modules. It is a barrier of entry.

Besides, this kind of 'stat' belongs next to the standard one in Perl, either the built-in stat, or the Time::HiRes variant. You should not really need extra modules for this, because the 'stat' functionality is already standard. Only substandard.

In my opinion, building a stat variant in Time::HiRes that uses floating-point is actually a mistake. Floating point in this scenario is a trap for the unwary. You do not really know what precision to expect anyway (it is not documented). You cannot store such a floating-point value in a file, and be certain that another system will read it and will have exactly the save value. Only Java gives you such floating-point guarantees.

In the case of Time::HiRes::stat, I had to arbitrarily cut the precision to milliseconds in the hope that it would work on all systems. Perl's "portability" is reduced to testing all over the place.

I would place the new integer-based, nanosecond-precision 'stat' next to the old one, and make it the recommended one. Or at least warn about floating-point issues and unknown precision in the Time::HiRes variant.

rdiez avatar Jun 27 '20 12:06 rdiez

Next to the standard one in core is well out of scope IMO, but adding it to Time::HiRes would work.

What about using bignum's for the time? That was one can retain arbitrary precision.

Leont avatar Jun 28 '20 14:06 Leont

I never had a need for bignum. It sounds slow, both at compilation/start up, and at runtime. Besides, what I just saw makes me a little worried:

All operators (including basic math operations) are overloaded.

What is the advantage of bignum? Will it avoid the need for a new call? I mean, can Time::HiRes::stat transparently return a bignum?

Say you get your number of seconds as an integer, and a separate number of nanoseconds as yet another integer. If you want to compare them for equality, like I need in my script, there is no need for any extra overhead.

Say I do want to compare them like timestamp1 < timestamp2, or to add a time delta to one of them. If I need a bignum, I just need to convert the number of seconds to a bignum, multiply it by 1,000,000,000, and then add the number of nanoseconds. Then I can add my time delta (or whatever) to that bignum. I would rather have this inconvenience, as having to always use bignum, whether I need it or not.

rdiez avatar Jun 28 '20 16:06 rdiez

Time::HiRes::stat should definitely not start returning a bignum, as you alluded it is much slower and can be dangerous for that reason.

But I think an alternate function to Time::HiRes::stat that returned Math::BigFloat objects for these values would be sensible.

Grinnz avatar Jun 28 '20 16:06 Grinnz

I am writing a tool that will process hundreds of thousands of files. Why do I need to worry whether Math::BigFloat could impose a significant penalty when processing the timestamps?

I do not understand why I even have to wonder whether using floating point arithmetic with Math::BigFloat could have the same precision issues as normal floating-pointer arithmetic or not.

Life can be so simple. Just return 2 integers. Like everybody else. If you really want, follow the super-easy steps "make a Math::BigFloat with the number of seconds, multiply by 1,000,000,000 and add nanoseconds, then enjoy your excellent Math::BigFloat". Or bignum. Or whatever.

I am getting desperate. I wish I had started my tool in Java.

rdiez avatar Jun 28 '20 17:06 rdiez

The entire point of Math::BigFloat is that it is arbitrary precision and does not use IEEE 754 floats. That said, there is no reason both variants can't be considered.

Grinnz avatar Jun 28 '20 17:06 Grinnz

What is the advantage of bignum?

It allows one to have the full nanosecond precision while also providing the familiar single value interface.

Why do I need to worry whether Math::BigFloat could impose a significant penalty when processing the timestamps?

You're doing a stat() system call. I rather doubt the cost of dealing with a bignum will be significant compared to that (but only experimental data can verify that)

Leont avatar Jun 28 '20 18:06 Leont

I cannot believe you are seriously considering a fat bignum/bigfloat object for a new stat call, just because it is more familiar or more convenient to return a single "scalar". My bet is, the moment my Perl script touches it, starting a JVM will probably feel faster. And afterwards, Perl has lost anyway.

About performance: The Linux stat syscall is pretty fast because it operates mainly on cached information. Nevertheless, it has been recognised that stat can be a bottleneck when you scan many disk files on big iron. I am talking about a serious number of files. That's partly the reason why there is a newish statx call where you can request just the information you need, in order to save time:

https://tech.feedyourhead.at/content/using-the-new-statx-system-call

I can image a future where Perl might return, in a single operation, an array with all file stat information of a given directory, but just the selected fields by virtue of statx, like only size and the 'last modified' timestamp (that is what my script needs at the moment). Something similar to what readdir can already do, if the caller expects an array back.

If you standarise on Math::BigFloat for such a timestamp, and the user requests all 3 timestamps, each file stat element will have 3 of those fat objects. When the Python guys then laugh about our performance, I'll come back here and remind you about your design decisions.

rdiez avatar Jun 28 '20 19:06 rdiez

I think you'd have a much more productive conversation if you avoided insulting language. You're talking to volunteers who are trying to help you.

karenetheridge avatar Jun 28 '20 23:06 karenetheridge

Please do not escalate terms. There is no "insulting language" above.

rdiez avatar Jun 29 '20 07:06 rdiez

I do not understand why your tone in the tickets in these perl5 issues is so hostile. I personally have reached a point that it annoys me so much that I just don't read them anymore.

Even if you have "good" points to make, please realize that the tone you use is not the one that will motivate people to give you answers you hope for. Even if I was able to "fix" every issue, I would not put my free time into effort to these tickets. I rather work on tickets that communicate in a tone that makes me happy.

I bet that also works for other people.

Just my ¢2

Tux avatar Jun 29 '20 07:06 Tux

I also find frustrating the way most tickets have been handled. I think most people will understand my frustration when they read the ticket history, and if not, that is my risk to take. I personally find it OK to show frustration, even to resort to sarcasm after a while. We are not machines.

But there is a difference between a "tone" or an attitude you dislike, and qualifying with very negative terms like "insulting language" or "so hostile". When I see that kind of exaggerated adjectives, I believe that you are actually the one being hostile. After all, if you really wanted to help me understand how being "nicer" would help me, you would have sent me a private e-mail, instead of dropping those terms in public.

I am also an unpaid, free-time volunteer writing free software for other people to use. I understand if you do not like me, the issues I raise, the solutions I propose, etc. I understand if you do not feel like helping. Or if you implement the fixes some other way. You are also free to ignore me. But there is no need to go over the line.

rdiez avatar Jun 29 '20 12:06 rdiez

Tux is correct. You are the one who wants changes, so the easiest way to not get anything done is to continue expressing your frustration at us.

Grinnz avatar Jun 29 '20 16:06 Grinnz

You are mistaken:

  1. Tux is not correct. I am not being hostile. Frustrated, yes, but nothing much higher than that. This is an important difference, see below.

  2. I do not want any changes for me. I have been contributing (gathering information and communicating is an effort) in the hope that this will make Perl better and will help other people. I do not need the changes myself, for I have identified the issues and worked around them.

All the issues I reported are real and/or reasonable. But I am not making any demands: the Perl community is of course free to take action or not. I do not mind (I know they are volunteers like I am), as long as I am not attacked personally (outside the technical realm) by getting labelled with terms likes "hostile" or "insulting language".

Well, yes, there is generally the expectation that maintainers will want to improve their project, whether users reporting issues are nice, well-behaved, or not. But I am not going to pursue this, other than contributing with information, tests or technical opinions in the particular issues I reported, to the best of my knowledge.

rdiez avatar Jun 29 '20 16:06 rdiez

I don't get how I can be incorrect in percieving you as being hostile. That is how I feel it. I cannot be wrong in how I feel something. But you don't have to care. I'll just do something else. Anyway, you've lost me as a possible help.

Tux avatar Jun 29 '20 18:06 Tux

rather than change the default stat function, can we at least fix File::stat ? that returns a structure with named fields, so adding e.g. mtime_ns should be trivial & backwards compatible, and aligns with the underlying structure that the function exposes.

feel free to bikeshed the actual name. it doesn't matter to me as long as it can be accessed easily & without loss in precision.

vapier avatar Jan 15 '23 00:01 vapier

feel free to bikeshed the actual name. it doesn't matter to me as long as it can be accessed easily & without loss in precision.

If your bikeshed is linux-colored, the extra nanosecond information is made available via names like st_mtimensec on systems which don't define appropriate values of _POSIX_C_SOURCE or _XOPEN_SOURCE

djerius avatar Dec 30 '23 01:12 djerius

feel free to bikeshed the actual name. it doesn't matter to me as long as it can be accessed easily & without loss in precision.

If your bikeshed is linux-colored, the extra nanosecond information is made available via names like st_mtimensec on systems which don't define appropriate values of _POSIX_C_SOURCE or _XOPEN_SOURCE

For netbsd hued sheds

stat(), lstat(), and fstat() conform to IEEE Std 1003.1-2004

which doesn't support sub-second resolution, but:

  NetBSD Extensions
     The following additional NetBSD specific fields are present:

           Type        Entry               Description
           long        st_atimensec        last access (nanoseconds)
           long        st_mtimensec        last modification (nanoseconds)
           long        st_ctimensec        last status change (nanoseconds)

FreeBSD doesn't provide alternate access to the nanosecond component of the st_?tim entries; you need to delve into their timespec structures directly.

OpenBSD has this:

For compatibility with previous standards, st_atime, st_mtime, and st_ctime macros are provided that expand to the tv_secs member of their respective struct timespec member. Deprecated macros are also provided for some transitional names: st_atimensec, st_mtimensec, st_ctimensec, st_atimespec, st_mtimespec, and st_ctimespec.

I'm unable find any recent MacOS manpages online, so I can't help there. I haven't looked at any other *nix flavors, and I've forgotten more than I ever knew about VMS.

Adding this capability (regardless of bikeshedding the name) may (will?) require configuration time probing. I don't see a path to determining what a platform supports solely through compile time macro definition checks. (And there may be systems which use timeval and not timespec, and so only support microseconds and not nanoseconds).

A quick peek at configure.ac for Python (which supports sub-second resolution in their os.stat implementation) shows they are using a configuration time check.

djerius avatar Dec 30 '23 04:12 djerius