matomo-log-analytics icon indicating copy to clipboard operation
matomo-log-analytics copied to clipboard

Add an option to only process log entries that haven't been processed before

Open mackuba opened this issue 6 years ago • 4 comments

I want to use Matomo with log analytics only. My Nginx logs are rotated every week, but I want my reports to be updated much earlier, e.g. every hour. If I just feed the same log file with already reported visits to the importer, I will have duplicated entries, so I need to either rotate logs every hour (very inconvenient) or somehow prevent logs from being imported twice. Based on what I could find, there is currently no easy way to do this.

This pull request solves this by tracking the latest visit timestamp found in an imported log file and then saving it to a file specified in a --timestamp-file option. On the next run this timestamp is loaded at startup and all visits before or on this timestamp are ignored (like --exclude-older-than, but inclusive, since the log with equal timestamp was already parsed).

This kind of solves https://github.com/matomo-org/matomo-log-analytics/issues/144.

I've put initial_timestamp (loaded from the file at the beginning) in the config and latest_timestamp (updated after every log record) in the stats. This can be moved elsewhere if it's not the best place.

I've also added some lines to the summary to print the status of the timestamp-based filtering, and included the older/newer than filtering too since it's related:

Logs import summary
-------------------

    85 requests imported successfully
    36 requests were downloads
    10627 requests ignored:
        73 HTTP errors
        2 HTTP redirects
        0 invalid log lines
        345 filtered log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        153 requests done by bots, search engines...
        10054 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

    Processed logs since: 2018-11-06 19:05:12 +0000
    Saved last timestamp: 2018-11-07 11:59:42 +0000

I also tweaked the printing there to remove extra empty lines (more than 2 newlines are compacted into 2) - this was already a problem before, as the space between the 2nd and 3rd section was bigger than between 1/2 and 3/4 because of %(sites_ignored)s, but was made more visible with the date filtering section added.

mackuba avatar Nov 17 '18 15:11 mackuba

Works fine here. The only problem I see is that the timestamp file gets updated even when using --dry-run.

cweiske avatar Oct 05 '19 11:10 cweiske

This is definitely a useful enhancement. Would certainly love to see that in the Matomo Log Importer 👍 .

DevDavido avatar Dec 31 '21 00:12 DevDavido

I resolved merge conflicts with 4.x-dev (commit 6f66f962fcc985e998c9c6f1bf4522067bc5c07a) here: https://github.com/strager/matomo-log-analytics/tree/timestamp

strager avatar Feb 10 '22 03:02 strager

ping. I've been using my version of this patch for a while and I've been happy with it.

strager avatar Jun 28 '22 23:06 strager