scale icon indicating copy to clipboard operation
scale copied to clipboard

Investigate logging refactor

Open dfaller opened this issue 6 years ago • 5 comments

Investigate refactoring the current logging system:

  1. Should work as part of a "whole cluster" logging system
  2. Logs for system components/tasks should be able to be isolated from algorithm tasks.
  3. Needs to allow us to attach specific meta-data fields per Mesos task
  4. Needs to provide a good way for the Scale UI to display portions of the log and support paging in and out.

dfaller avatar Jan 02 '18 20:01 dfaller

  1. solution needs to support hot reload of logging levels for scale system components

Fizz11 avatar Jan 09 '18 17:01 Fizz11

It sounds like we are able to use the offset within the file stream to ensure we are able to maintain message ordering with sub-millisecond precision solely using Filebeat. We do need to ensure log rotation at the Mesos sandbox level doesn't cause issues here.

This could be easily tested by hooking filebeat up to a task that sprays incrementing numbers as fast as possible to force a size based file rotation. We then could perform our ordered log retrieval from Elasticsearch to ensure order was maintained.

gisjedi avatar Jan 18 '18 17:01 gisjedi

  1. We should investigate if filebeat supports injection of arbitrary metadata via key value pairs stored in a sidecar file. If we could just write a file into the sandbox that could provide additional static metadata appended to each filebeat message that would be much preferred over munging the taskids to extract desired metadata.

gisjedi avatar Jan 18 '18 17:01 gisjedi

Unfortunately there does not appear to be any way to add contextual metadata directly to the filebeat stream beyond the basic values (host, source, etc.). Based on my current understanding, my recommendation is to retain Logstash within our architecture for the purposes of streaming message manipulation. We MUST at least be able to tie the Mesos Task ID to the message, which we can extract from the source field.

gisjedi avatar Jan 22 '18 17:01 gisjedi

Phase 1:

  1. Document Filebeat pre-reqs for Scale
  2. Update Logstash with highly-available deployment in bootstrap
  3. Update Scale Task IDs to include metadata currently in Syslog tag
  4. Add metadata extracted from Task ID via Logstash to messages sent to Elasticsearch
  5. Upgrade plan for Elasticsearch from 2.x to 6.x
  6. Update UI to support tailing logs directly from Elasticsearch

Phase 2

  1. Deploy Filebeat automatically as part of Scale deployment.

Metadata:

  • Job Execution
  • Job Type
  • System / Algorithm Task
  • STDERR / STDOUT

gisjedi avatar Jan 22 '18 20:01 gisjedi