vector icon indicating copy to clipboard operation
vector copied to clipboard

Optimise file checkpoints by deleting checkspoints for old stopped monitored files

Open Brabalawuka opened this issue 6 months ago • 1 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

If the monitored files are deleted, the checkpoint will still be there in case same files are renamed / moved back. However, if the file is already deleted for xxxx length of time such as 7days. The checkpoint file should be tidied up and optimised to reduce the size of the checkpoint file.

Use Cases

When vector is deployed to monitor a fixed path of logs where logs are generated and deleted continuously. I observed that the checkpoints of the files that vector monitors will increase infinitely. This is especially true when its deployed as a daemonset where a volume mount is used to collect all pod's log that all pods are sharing a same hostpath (different sub path). The files are generated fast and deleted after one day. Those old logs are already deleted but there is no way to remove the checkpoints of the vector.

Attempted Solutions

There is no such thing as a temp solution. I could use a script to tidy up the checkpoint file but the checkpoint file does not contain any information but the checksum of the monitored files / stopped-monitored files. So I env could not implement a customised optimiser to resolve this issue.

Proposal

Add an option to delete non existent files checkpoints that exist in the checkpoint file. If the checkpoints's updated date was not modified since xxx days ago

e.g. After the vector is launched/traversing checkpoints, if it found a checkpoint refers to file that does not exist in the monitored path, remove that checkpoint if the last modified time was 7 days ago.

References

No response

Version

vector 0.34.0

Brabalawuka avatar Jan 29 '24 09:01 Brabalawuka

Thanks for this @Brabalawuka . I was aware of this issue, but didn't realize we didn't have a GitHub issue tracking it.

jszwedko avatar Jan 29 '24 15:01 jszwedko

I'd love to see something implemented for this as well.

Our stop-gap architecture as we rework our logging stack is to rsync access logs from hundreds of edge nodes to central Vector 'ingest' nodes. The files are usually small - on the order of a few megabytes each - and the name of each file is unique. Vector deletes the files one second after processing them, so days worth of entries in checkpoints.json is just hurting us. Over the course of even a few days, checkpoints.json on each ingest node grows to be ~300MB.

As you can imagine, this generated significant amounts of disk IO as the file was rewritten over and over. Moving it to a tmpfs helped in the short term, but obviously the file is going to continue to grow and grow. Short of just stopping vector, removing the file and then starting vector, is there a less invasive short-term solution that can be implemented on our end?

zdykstra avatar May 20 '24 15:05 zdykstra