taupage icon indicating copy to clipboard operation
taupage copied to clipboard

Scalyr can't keep up with our logs

Open mo-gr opened this issue 6 years ago • 9 comments

One of our applications is creating a lot of logs. They are mostly access logs, so there isn't much we can do to reduce the amount of logs. We create so much logs that the scalyr agent is having trouble processing them. We very frequently see lines like  

10:53:03.131 skipper /var/log/scalyr-agent-2/agent.log  2018-11-14 09:53:03.131Z WARNING [core] [log_processing.py:1734] [error="skipForTooFarBehind"] Skipped copying 105580980 bytes in '/var/log/application.log' due to: Too far behind end of log.  Num of bytes to end is 105580980. 

The way I understand this, scalyr is frequently dropping many MB of logs. This results in very frequent 1-2min long gaps in the scalyr logs. Is there anything we (or you) can do about this?

mo-gr avatar Nov 14 '18 10:11 mo-gr

/cc @femueller @vwiessner @christianberg

mikkeloscar avatar Nov 14 '18 11:11 mikkeloscar

We currently observe this on Taupage-AMI-20181101-120344 (ami-0c8c1409048d397a5)

mo-gr avatar Nov 14 '18 11:11 mo-gr

@mo-gr IMHO you should reduce the log volume, because you create a lot of I/O in a latency critical application affecting the whole business. What you can do is to use https://opensource.zalando.com/skipper/reference/filters/#disableaccesslog to disable logging on some of the routes.

szuecs avatar Nov 14 '18 12:11 szuecs

i think these access logs all have to be collected for compliance or similar reasons. Sampling is not an option in this case.

aryszka avatar Nov 14 '18 13:11 aryszka

@aryszka who said that? And "I think" is not a good base for a decision like this ;)

@ChristianLohmann @mrandi do you know if this is true or can find someone responsible to answer the question if team pathfinder has to log all accesslogs in the main shop http router? This creates a lot of I/O and costs a lot of money. Additionally there is now a technical challenge that has to be solved, if this is the case.

szuecs avatar Nov 14 '18 14:11 szuecs

@szuecs I'd argue that this is a bug or misconfiguration that should be fixed in Scalyr agent. What's the point of a logging service that can't keep up with a reasonable logging volume for one of our most important applications? And why do the users now have to babysit something as simple as logging and try to customise it on a per-route basis?

aermakov-zalando avatar Nov 14 '18 16:11 aermakov-zalando

@aermakov-zalando I let team logging answer your question. In the end technically you want to do sampling for high volume logs, if this is possible, because of compliance is different.

szuecs avatar Nov 14 '18 16:11 szuecs

I'd suggest not having this discussion in a public GitHub repo. I'll follow up via email.

christianberg avatar Nov 14 '18 21:11 christianberg

@szuecs plz create an internal ticket for this. Thx!

mrandi avatar Nov 19 '18 15:11 mrandi