loki Promtail: Add compressed files support

What this PR does / why we need it: Adds to Promtail the ability to read compressed files. It works by:

Infer which compression format to use based on the file extension
Uncompress the file with the native golang/compress packages
Iterate over uncompressed lines and send them to Loki

Its usage is the same as our current file tailing. In the example below all files under logfiles folder are parsed/processed, regardless of being compressed or not. file1 and file2 are first uncompressed with Gunzip; file3 is parsed as it is.

scrape_configs:
  - job_name: simplejob
    static_configs:
      - targets: [localhost]
        labels:
          job: compressedfolder
          __path__: /logfiles/**.*

$ ls /logfiles
     file1.tar.gz file2.tar.gz file3.txt

Which issue(s) this PR fixes: Fixes #5956

Special notes for your reviewer: Another approach is to infer the compression format based on the header bytes but this approach is much simpler; in the future we can improve it if we think the community needs it.

Checklist

[ ] Documentation added
[ ] Tests updated
[ ] Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
[ ] Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Jul 18 '22 15:07 DylanGuedes

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 19 '22 14:07 grafanabot

@dannykopping @frittentheke thank you both for the fantastic reviews, I'll be working on those suggestions and other ones that might come along the week and I plan to try testing a few things with this too.

Just one thing, I think even if this doesn't add decent support for compressed log rotations, we should still proceed to at least have this, as a simpler implementation that is to be improved in a follow-up PR (I can start working on it exactly after this gets merged). My argument is that as of now Loki isn't supporting even the most straightforward compression scenarios, so this would at least add some support without a lot of burdens. For instance, I liked that with this PR you can now tell Promtail to scrape from a folder that has tons of compressed files and that you can keep appending new compressed files to it and all of them will be scraped.

Jul 25 '22 13:07 DylanGuedes

For instance, I liked that with this PR you can now tell Promtail to scrape from a folder that has tons of compressed files and that you can keep appending new compressed files to it and all of them will be scraped.

Yes. Question is though, in what order those files will be picked up? Does Promtail sort them by modification date? That would then at least keep as much order as possible.

Jul 25 '22 13:07 frittentheke

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 25 '22 13:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 25 '22 13:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 25 '22 13:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 25 '22 13:07 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Jul 25 '22 14:07 grafanabot

Yes. Question is though, in what order those files will be picked up? Does Promtail sort them by modification date? That would then at least keep as much order as possible.

Good call, thanks. I'll add this scenario to the list of things I'll be testing this week. Of course, even if I implement some kind of sorting there will be still edge cases because the managers will be checking things async but it would for sure be more stable than not having any sorting (which I'm not sure we have as of now).

Jul 25 '22 14:07 DylanGuedes

FYI: I did generate a file with ~400 MB, and compressed it with gunzip which reduced it to ~1 MB, and apparently the gunzip reader didn't really expand the whole file before iterating over it, which is great. I'll try one more time but this time with a bigger file (a few gbs).

Jul 26 '22 14:07 DylanGuedes

Yes. Question is though, in what order those files will be picked up? Does Promtail sort them by modification date? That would then at least keep as much order as possible.

Good call, thanks. I'll add this scenario to the list of things I'll be testing this week. Of course, even if I implement some kind of sorting there will be still edge cases because the managers will be checking things async but it would for sure be more stable than not having any sorting (which I'm not sure we have as of now).

I've tested/investigated it and no, Promtail doesn't follow any order. I tried to look for a quick solution to it but tbh it won't be that easy because:

The discovery is based on the filename/glob
Once the files to be tailed are discovered, the target manager iterates over them and sends them to a channel
They're processed as they're received on the channel

We can try sorting them by name before publishing them on the channel, but ordering them by creation date looks to be a challenge.

EDIT: nvm, I can actually read the creation/modification date from the name and sort things before hand

EDIT2: even after sorting the targets, they might be synced out of order because tailing is completely async. I implemented a solution and wrote a few unit tests but it will fail/is failing because tailing itself is async, so even if I start tailing on the right order, they might still process things in a different order.

Aug 01 '22 12:08 DylanGuedes

@dannykopping fyi: I finished the tests I was interested, anything else you'd like me to test/improve?

Aug 01 '22 14:08 DylanGuedes

We can try sorting them by name before publishing them on the channel, but ordering them by creation date looks to be a challenge.

EDIT: nvm, I can actually read the creation/modification date from the name and sort things before hand

EDIT2: even after sorting the targets, they might be synced out of order because tailing is completely async. I implemented a solution and wrote a few unit tests but it will fail/is failing because tailing itself is async, so even if I start tailing on the right order, they might still process things in a different order.

This seems to be a larger issue of only using file names and not the abstract inodes number and the modification date. See me comment about this at https://github.com/grafana/loki/issues/6503#issuecomment-1191957798

So this might be a fix / improvement for another day. But since Loki does not really loves older timestamps being ingested Promtail should do what is can to read from files targeting the same labels in proper order ...

Aug 04 '22 09:08 frittentheke

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Aug 15 '22 12:08 grafanabot

FYI: I did generate a file with ~400 MB, and compressed it with gunzip which reduced it to ~1 MB, and apparently the gunzip reader didn't really expand the whole file before iterating over it, which is great. I'll try one more time but this time with a bigger file (a few gbs).

Can we add a benchmark for these so the tests are repeatable, and we can see what the CPU/memory impact is?

Aug 24 '22 09:08 dannykopping

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Aug 24 '22 21:08 grafanabot

FYI: I did generate a file with ~400 MB, and compressed it with gunzip which reduced it to ~1 MB, and apparently the gunzip reader didn't really expand the whole file before iterating over it, which is great. I'll try one more time but this time with a bigger file (a few gbs).

Can we add a benchmark for these so the tests are repeatable, and we can see what the CPU/memory impact is?

I'm looking at it but it is more complicated than I was thinking. Out of curiosity, would you say I should generate a big file, compress it and bench everything programmatically, or were you thinking something different?

Aug 24 '22 21:08 DylanGuedes

FYI: I did generate a file with ~400 MB, and compressed it with gunzip which reduced it to ~1 MB, and apparently the gunzip reader didn't really expand the whole file before iterating over it, which is great. I'll try one more time but this time with a bigger file (a few gbs).

Can we add a benchmark for these so the tests are repeatable, and we can see what the CPU/memory impact is?

I'm looking at it but it is more complicated than I was thinking. Out of curiosity, would you say I should generate a big file, compress it and bench everything programmatically, or were you thinking something different?

Out of curiosity, would you say I should generate a big file, compress it and bench everything programmatically

Yeah this is what I'd love to see.

As an aside, I went looking in the Go source, and I was surprised to not find benchmarks in the GZIP pkg, although there are a couple:

compress [tags/go1.18.5] $ grep -rn Benchmark
flate/writer_test.go:16:func BenchmarkEncode(b *testing.B) {
flate/reader_test.go:34:func BenchmarkDecode(b *testing.B) {
lzw/writer_test.go:199:func BenchmarkEncoder(b *testing.B) {
lzw/reader_test.go:267:func BenchmarkDecoder(b *testing.B) {
bzip2/bzip2_test.go:238:func BenchmarkDecodeDigits(b *testing.B) { benchmarkDecode(b, digits) }
bzip2/bzip2_test.go:239:func BenchmarkDecodeNewton(b *testing.B) { benchmarkDecode(b, newton) }
bzip2/bzip2_test.go:240:func BenchmarkDecodeRand(b *testing.B)   { benchmarkDecode(b, random) }

What I was thinking we could do is:

add a text file to the test fixtures which is sufficiently big but highly compressible (a couple MB of server logs?)
in the benchmark, compress it and save to a file
start the benchmark timer and begin uncompressing the file
take note of allocs, MB/s, etc

We may see that one or two compression formats cannot be partially uncompressed and lead to unbounded memory growth, in which case we'll need to find an alternative to the stdlib packages, or put a big warning in the docs for that format.

Aug 25 '22 06:08 dannykopping

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0.4%
+               loki	0%

Sep 02 '22 20:09 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Sep 19 '22 13:09 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Sep 21 '22 00:09 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Sep 21 '22 10:09 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Sep 21 '22 12:09 grafanabot

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Sep 27 '22 13:09 grafanabot

Benchmark results with 3MB buffer size for decompression:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -tags integration,requires_docker -bench ^BenchmarkReadlines$ github.com/grafana/loki/clients/pkg/promtail/targets/file

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/clients/pkg/promtail/targets/file
cpu: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
BenchmarkReadlines/2000_lines_of_log_.tar.gz_compressed-12         	     768	   1559020 ns/op	 3063606 B/op	     239 allocs/op
BenchmarkReadlines/100000_lines_of_log_.gz_compressed-12           	      18	  63992058 ns/op	 5666200 B/op	   21157 allocs/op
PASS
ok  	github.com/grafana/loki/clients/pkg/promtail/targets/file	3.221s

Sep 27 '22 13:09 DylanGuedes

Benchmark results with 4096 buffer size:

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -tags integration,requires_docker -bench ^BenchmarkReadlines$ github.com/grafana/loki/clients/pkg/promtail/targets/file

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/clients/pkg/promtail/targets/file
cpu: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
BenchmarkReadlines/2000_lines_of_log_.tar.gz_compressed-12         	     933	   1261038 ns/op	   60287 B/op	     236 allocs/op
BenchmarkReadlines/100000_lines_of_log_.gz_compressed-12           	      19	  58614711 ns/op	 2536763 B/op	   20279 allocs/op
PASS
ok  	github.com/grafana/loki/clients/pkg/promtail/targets/file	3.167s

Sep 27 '22 13:09 DylanGuedes

@dannykopping Merging as I addressed all your great suggestions. I'll be working on the zip compression on a follow-up PR :smile:

Sep 27 '22 14:09 DylanGuedes

loki loki copied to clipboard

Promtail: Add compressed files support

loki
loki copied to clipboard