matomo-log-analytics
matomo-log-analytics copied to clipboard
Add an option to only process log entries that haven't been processed before
I want to use Matomo with log analytics only. My Nginx logs are rotated every week, but I want my reports to be updated much earlier, e.g. every hour. If I just feed the same log file with already reported visits to the importer, I will have duplicated entries, so I need to either rotate logs every hour (very inconvenient) or somehow prevent logs from being imported twice. Based on what I could find, there is currently no easy way to do this.
This pull request solves this by tracking the latest visit timestamp found in an imported log file and then saving it to a file specified in a --timestamp-file
option. On the next run this timestamp is loaded at startup and all visits before or on this timestamp are ignored (like --exclude-older-than
, but inclusive, since the log with equal timestamp was already parsed).
This kind of solves https://github.com/matomo-org/matomo-log-analytics/issues/144.
I've put initial_timestamp
(loaded from the file at the beginning) in the config and latest_timestamp
(updated after every log record) in the stats. This can be moved elsewhere if it's not the best place.
I've also added some lines to the summary to print the status of the timestamp-based filtering, and included the older/newer than filtering too since it's related:
Logs import summary
-------------------
85 requests imported successfully
36 requests were downloads
10627 requests ignored:
73 HTTP errors
2 HTTP redirects
0 invalid log lines
345 filtered log lines
0 requests did not match any known site
0 requests did not match any --hostname
153 requests done by bots, search engines...
10054 requests to static resources (css, js, images, ico, ttf...)
0 requests to file downloads did not match any --download-extensions
Processed logs since: 2018-11-06 19:05:12 +0000
Saved last timestamp: 2018-11-07 11:59:42 +0000
I also tweaked the printing there to remove extra empty lines (more than 2 newlines are compacted into 2) - this was already a problem before, as the space between the 2nd and 3rd section was bigger than between 1/2 and 3/4 because of %(sites_ignored)s
, but was made more visible with the date filtering section added.
Works fine here.
The only problem I see is that the timestamp file gets updated even when using --dry-run
.
This is definitely a useful enhancement. Would certainly love to see that in the Matomo Log Importer 👍 .
I resolved merge conflicts with 4.x-dev (commit 6f66f962fcc985e998c9c6f1bf4522067bc5c07a) here: https://github.com/strager/matomo-log-analytics/tree/timestamp
ping. I've been using my version of this patch for a while and I've been happy with it.