goaccess icon indicating copy to clipboard operation
goaccess copied to clipboard

Goaccess silently discards log data

Open fgoepel opened this issue 2 years ago • 11 comments

As a test, I tried generating a report for a months worth of logs (744 files totaling 6.1 GB) using the latest version like this:

goaccess --log-format W3C --no-query-string --anonymize-ip --ignore-crawlers -o test.html u_ex2105*.log

For some reason the report only goes up to the 22nd of this particular month and is missing everything after that. Indeed the last output printed is this:

[RENDERING u_ex21052207.log] {15536} @ {29/s}}/s}

There is no error or warning message output. I tried setting --max-items and --keep-last to something high, to see if it's running into some limit that would need to be raised, but this didn't change anything.

Is this a bug or is it hitting some kind of internal limit? If this is hitting some kind of internal limit it would be good to have some indication that this is the case and ideally an option to raise it.

It does appear to work when doing the report incrementally, e.g. like this:

for x in *.log; do echo -en "\r$x"; goaccess "$x" --log-format W3C --no-query-string --anonymize-ip --ignore-crawlers --restore --persist --db-path ./may --process-and-exit ; done
goaccess --log-format W3C --no-query-string --anonymize-ip --ignore-crawlers --restore --db-path ./may -o test.html

But it's a quite a lot slower and also produces a 1 GB database, which seems excessive. (Is there a way to reduce that?)

Am I missing something here?

fgoepel avatar Dec 10 '21 16:12 fgoepel

Hi @shado23

There are a number of tests I have in mind that might help you.

First, check the --keep-last option is set in your goaccess.conf. This option restricts the number of days that will be stored. If nothing is defined or zero value than is ok.

Second... You can select only 1 of the files and process it with and without the --ignore-crawlers option and check the difference between the results. Is this difference relevant or very self? So maybe the crawlers have a lot of access to your server and it's a bad idea to remove them from your statistics. Maybe here's your problem.

Third... Use the --invalid-requests= option and see if there are a large number of requests in this output. If there is then you have a problem with parse (--log-format incorrect) or the log file is getting mixed up. This number can also be checked against the value of the Failed Requests field in your html report.

And finally... I believe you are aware of why you use disk persistence. It only makes sense if you do a cumulative data processing day-by-day. So, for this accumulation to be correct, you must remove the old files, if any, before starting a new statistic. The --keep-last option also has an effect on this statistic. The persistence format is already pretty compact. It has been re-evaluated and reduced in version 1.5.

I hope I helped you.

Feel free to add any information from new tests you do. The tip is to compare the fields that appear at the top of the report. They will already give you a direction on the origin of your problems.

0bi-w6n-K3nobi avatar Dec 20 '21 12:12 0bi-w6n-K3nobi

Thank you for your detailed suggestions and please excuse the delayed reply.

First, check the --keep-last option is set in your goaccess.conf. This option restricts the number of days that will be stored. If nothing is defined or zero value than is ok.

The config file was the stock one from the docker container which doesn't have keep-last set. I've now tried explicitly setting it to 'none', but that doesn't change anything.

Second... You can select only 1 of the files and process it with and without the --ignore-crawlers option and check the difference between the results. Is this difference relevant or very self? So maybe the crawlers have a lot of access to your server and it's a bad idea to remove them from your statistics. Maybe here's your problem.

I've tried running without this setting, but it doesn't change anything either.

Third... Use the --invalid-requests= option and see if there are a large number of requests in this output. If there is then you have a problem with parse (--log-format incorrect) or the log file is getting mixed up.

I've tried this now and it comes up empty, so that's not it either.

And finally... I believe you are aware of why you use disk persistence. It only makes sense if you do a cumulative data processing day-by-day.

Well, I was just trying to see if that would be a viable alternative. If it worked without the disk persistence that would be easier to handle for us, I suspect. The disk persistence might be an alternative to storing the logs, but I'm not sure if the data format is guaranteed to stay stable. Nevertheless it's instructive that processing the same data incrementally works, while it gets truncated when processed directly.

The persistence format is already pretty compact. It has been re-evaluated and reduced in version 1.5.

Fair enough, I just thought that it was still quite large at 1/6th of the original log files. We could compress the db, but then it might be easier to just compress the original log files instead. I think I was expecting the db to just contain aggregated counters, similar to a RRD database, but it appears to store all individual unaggregated requests (presumably to allow for reprocessing the same log file without double counting).

I've now tried something else:

This works:

cat u_ex2105*.log | goaccess --log-format W3C --no-query-string --anonymize-ip -o test.html -

This doesn't:

goaccess --log-format W3C --no-query-string --anonymize-ip -o test.html u_ex2105*.log

This leads me to believe that the issue must be somewhere in the file handling code.

Taking a cursory look at the code I noticed that apparently there is an undocumented limit of filename arguments (MAX_FILENAMES) that was set to 512 since then raised to 3072 in b383da55dc76af52f48842d4955c70ae12692fce. I would strongly recommend that you actually add some error handling to print a warning and return an error code if this gets hit, because silently throwing away data is not acceptable behaviour in my book. It would also be a good idea to have this limit be mentioned in the documentation.

Edit: It seems I was mistaken about using the latest version, I was actually using 1.5 (via :latest) which docker didn't automatically update for some reason. I can confirm that it doesn't truncate with the 1.5.4 version, but it looks like even though the limit was raised it would still silently truncate the result once it's exceeded.

fgoepel avatar Jan 06 '22 17:01 fgoepel

Hi @shado23

Thank you for yours tests. I understand your views.

Well... compressing text files will be possible get between 7% to 13% of original size. However, I believe you did forget to add time for token and processing these files. Maybe storage space not be a problem, but time for processing and so organize and sort the data. This is the idea behind on-disk persistence.

Yeah. Maybe possible add some error handling for about you mentioned. Hi @allinurl, Do you maybe can check this? About warning mencioned above.

0bi-w6n-K3nobi avatar Jan 06 '22 22:01 0bi-w6n-K3nobi

@0bi-w6n-K3nobi Thanks for those tips and @shado23 for the update. Let me look into displaying a message when the limit is reached. Also, how did you end up updating to v1.5.4? You said it didn't through Docker so I wonder if there's something going on with the latest build...

allinurl avatar Jan 07 '22 02:01 allinurl

Hi @shado23

Some tips for you:

You can use SquashFS for store your LOGs. Ok... It is maybe need some work for your. You must be compress the LOGs into a SquashFS file and use loop-back device (or fuse) for mount it.

For on-disk persistence, you maybe use BtrFS filesystem with compression. And so do you gain some space.

0bi-w6n-K3nobi avatar Jan 07 '22 17:01 0bi-w6n-K3nobi

I agree. I've done some testing to compress certain data that goaccess stores and I can say that I haven't found a good way of compressing URLs. I'll need to look further and see if there's something out there that can help. @0bi-w6n-K3nobi have you seen anything that may help to compress such data. e.g., URLs, requests (long and short)?

allinurl avatar Jan 07 '22 17:01 allinurl

Hi @allinurl

Humm... I can try some ideas here.

First, @allinurl... Do you research about shortener URL algorithm? Maybe be only a key for HASH table... and wouldn't help us.

Second... Some kind of Run-length Encoding? Maybe progressive RLE from greater URLs to shorter... in same way from text compressors? Its processing maybe be at saving to disk phase.

Or at same order that appear at HASH table... so, you can compress URLs like if they were a continuous text. And so will load they and decompress at same order.

LZO is greater because you simplicity... Remenber here that it was used at space probe, implemented at micro-controller.

And more stupid algorithm will can be reduce bits like to 6 bits. ( Windows 3.1 PCX image file format )

0bi-w6n-K3nobi avatar Jan 07 '22 17:01 0bi-w6n-K3nobi

I have not looked at LZO, do you have some docs on this that may be of help? I was looking at this post, which seems interesting too with a c++ implementation. Ideally we would implement this upon processing the log but I guess worse case scenario, when data is persisted and restored?

allinurl avatar Jan 08 '22 03:01 allinurl

Hi @allinurl

The post that you mentioned seems great!

You can implement, of course, at processing log phase... And certainly it will save memory consumption... but also maybe will be overload time for processing. Perhaps do you will need implement and test them... and verify if saving memory, and therefore HASH tables access and update, compensate shorter URL processing.

If I understood correctly, the algorithm/function will be have direct access to values, and so no need more any HASH-or-calculation for indexing. Maybe, It is possible to chain directly to HASH table which counts ULRs. Or access a second table (lookup table) that points to URL count table. This maybe compensate any extra time dispensed for shorter URL function.

0bi-w6n-K3nobi avatar Jan 08 '22 21:01 0bi-w6n-K3nobi

Well... LZO homepage is here.

The miniLZO is best bet for me.... But for now, verifying the code, I am sorry to suggest it.

0bi-w6n-K3nobi avatar Jan 08 '22 21:01 0bi-w6n-K3nobi

@0bi-w6n-K3nobi Thanks for those tips and @shado23 for the update. Let me look into displaying a message when the limit is reached.

Thanks. That's much appreciated.

@0bi-w6n-K3nobi Also, how did you end up updating to v1.5.4? You said it didn't through Docker so I wonder if there's something going on with the latest build...

I think that was most likely user error on my part. Issuing a docker pull allinurl/goaccess:latest fixed it in any case. I'm not really sure when docker decides to automatically repull the tag or not.

fgoepel avatar Jan 10 '22 00:01 fgoepel

Hi @allinurl

This is topic some old, and my subject too... But I found some interesting about compression... I think that possible use at critical conditional, i.e. for large volume of Data / LOGs.

This post and there have content about, which they called, compression for short text strings.

Basic, they build a dictionary of words. This is very reminiscent of a type of LZW, but in a simpler way querying tables.

Well, I think this may have some relevance for you.

See you soon.

0bi-w6n-K3nobi avatar Dec 06 '23 10:12 0bi-w6n-K3nobi

@0bi-w6n-K3nobi Thanks for sharing this info. I vaguely remember giving it a shot before, but I can't quite recall why it didn't stick. Let me give it another go and see what I discover.

allinurl avatar Dec 08 '23 00:12 allinurl

PS:

Well... I think the best attempt is to build a little fixed dictionary, for ex. like words from UserAgent. In this case too try compress version number using 1byte for each number.

But, not everything is good... Do you will need a byte/bits prefix for signal whether the next piece of text is in the dictionary or if it is literally written.

0bi-w6n-K3nobi avatar Dec 10 '23 14:12 0bi-w6n-K3nobi

@0bi-w6n-K3nobi Are you suggesting using something similar for rapid browser/OS searches or string matching? For instance, perhaps utilizing bsearch()?

allinurl avatar Dec 12 '23 02:12 allinurl

Hi @allinurl

Yeap. Maybe is possible.

Or at moment of parse from OS and Browser, And use additional variables that will content number or enum values that corresponding/equivalent to the OS/Browser detected. So, You would perform 2 jobs for the cost of 1.

Well... You can imagine the rest.

Hmmm... You do ever use pointer and strings copied... Perhaps duplicate this UserAgent string and so already do reducing/compress with enum/integer as well as reducing Browser version number.

Again, I think that You can avail partially the OS/Browser parse too for this.

0bi-w6n-K3nobi avatar Dec 13 '23 12:12 0bi-w6n-K3nobi