tac icon indicating copy to clipboard operation
tac copied to clipboard

Question on justification

Open 89z opened this issue 4 years ago • 5 comments
trafficstars

On the main page, you give the example similar to:

tac large-file | grep something

But doesn't that run counter to the classic filter, sort idiom? What you're getting from grep is likely to be (much) smaller than the original input, so aren't you pointlessly reversing all of the input that grep discards?

89z avatar Aug 31 '21 04:08 89z

Ok, I need to clean up the README file a bit. From somewhere else in the doc:

This gives you "live" streaming of results and lets you terminate much sooner if you're only looking for the first n matches. e.g. tac --line-buffered access.log | grep foo will print its first match much, much sooner [...]

The idea being that if you might abort (^C) the pipeline after you find the result you are looking for, you can do so much sooner if the file is quickly reversed then searched in the desired order rather than if it's slowly searched in its entirety before the results are reversed and printed in the desired order.

That said, you shouldn't think of tac as a full-on sort (where the bottleneck isn't the actual sorting per se but rather the need to first buffer all the lines before you are able to sort them), but rather an "actually useful instance of using cat" (where the argument isn't really that it's slow so much as it is just that it's unnecessary).

mqudsi avatar Aug 31 '21 15:08 mqudsi

You're skipping over the important part - or maybe I didn't clarify sufficiently. If you run the pipeline to completion, tac | grep isn't faster. If you abort after finding the result you are looking for (and you are obviously looking for it in the last matches in the original file or you wouldn't be using tac in the first place), then tac | grep is faster.

To speak in more technical terms, I am comparing tac | grep --line-buffered | head -nX vs grep --line-buffered | tac --line-buffered | head -nX. Regardless of the tool in question (but especially for grep), what tac does (searching for one fixed byte) is fundamentally cheaper and faster than what grep does (probably searching for a regex). This isn't a small difference - it's a massive one. And it's not because this tac is better written than grep or rg or whatever but just because it is literally doing less work. But that's not even why this is faster.

Here's a real-world example: I have an access log that is hundreds of GiBs in size. I want the most recent 50 hits to a particular page. As a separate observation, these are closer to the end of the file than the start of the file (because I asked for most recent and the access log is appended to, therefore inherently in reverse chronological order). tac file | grep foo is going to find the results I want (the most recent 50) after reading a few gigabytes. I can abort it now (or use head to abort it automatically). grep foo file | tac is going to need to grep the entire file, hundreds of gigabytes, before its output can be reversed. Even if tac were 100x slower than grep, it would still be slower to do it this way. Even if I didn't care what order the results were printed in but I still wanted the most recent 50 - go ahead and take tac out of the picture altogether and pipe into a magical command that keeps only the last 50 lines: grep foo file | keep_last_50 is still going to have to read through the entire file to get the desired result.

FYI: I pushed an update to the README after my previous post. It now reads

... and if you are going to reverse the output, you will benefit most if you reverse it from the start, unless you are always going to run the command to completion ...

mqudsi avatar Aug 31 '21 16:08 mqudsi

It seems were at an impasse, as my position is that a claim of significant performance increase should be backed up by benchmarks, or at least reproducible tests. You dont provide either of those, either in the current README, or in this thread.

All thats provided are vague statements like "if you pipe some 100 GB file, then tac | grep is totally faster!". That may be true, but all anyone currently has, is your word thats its actually true. People could do their own tests of course, but without a specified reproducible test, who knows what result they will get. For a couple of examples of input, you could use the Linux source [1], Subtitle dump [2] or Wikipedia [3], or any number of other options.

  1. https://github.com/torvalds/linux
  2. http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz
  3. https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

89z avatar Aug 31 '21 16:08 89z

There is nothing that I said that's "my word" - it's objectively true by design (i.e. mathematically). I wanted to give you a benchmark and then gave up because tac first completed in 5s but grep first didn't finish 10 minutes later. Maybe I'll do it with one of the links you provided but that has nothing to do with whether or not the theory is sound even without the example.

mqudsi avatar Aug 31 '21 16:08 mqudsi

I wish you the best of luck. You can keep saying "this is mathematically true!", but unless people can reproduce your result, it doesnt mean anything. And for people to be able to do that, you have to specify the parameters you used for testing. That includes the exact input (either uploading your exact file that you used, or using public input, like examples I gave). This is why in the scientific community, they have peer review [1], as in general, it doesnt matter what one specific person was able to do, unless other people can reliably reproduce the result.

  1. https://wikipedia.org/wiki/Peer_review

89z avatar Aug 31 '21 16:08 89z