butido icon indicating copy to clipboard operation
butido copied to clipboard

Database optimizations

Open matthiasbeyer opened this issue 4 years ago • 1 comments

The database can be slimmed a bit:

  • [ ] ENV can be fetched from the git history of the package repository
  • [ ] The script can be recomputed from the git history of the package repository
  • [ ] The package can be fetched from the git history of the package repository

This information is stored in the DB right now, but does not have to be.

Backwards compatibility might be an issue, though.

matthiasbeyer avatar Oct 19 '21 07:10 matthiasbeyer

After some critical thinking about this:

ENV

ENV is not that important anyways, as it is only a tiny bit of data that we store in the DB. Nothing to worry about IMO.

Job Script and Package

Recomputing the script is expensive. Not because of the computing of the actual script, but because we need to walk the git revisions and then load the packages from an old revision. That wouldn't be too hard either, as long as the package data type layout doesn't change. As soon as it changes, we need to provide several parser backends for each version of the package data type layout ... and that would just increase the code complexity of butido itself (by a LOT).

Same goes for the package itself, of course.

But, what would be easily doable and (IMO) a way better solution, would be to have another service available (next to the database), which is responsible for storing these logs. Some kind of object store, which can deduplicate and compress text very well - for example if we store logs line-wise, building the same package 10 times would not result in 10 times the log stored, but only one time plus the differences to the other logs (basically what git does with trees).

But that's details! The basic idea is having an object store. Three implementations that I know of are ELK stack, which is Elastic... not open source anymore AFAICT, so not an option I guess. There is OpenSearch, which is the amazon fork of Elastic... not sure whether that's an option or whether it is actually opensource at all! Graylog, but it uses Elasticsearch itself,... maybe not a viable option either. But that's the general direction to go to, I guess.

Maybe there are better alternatives to this... I don't know.


Good to keep in mind would also be that logs should be structured in the future. Not only text lines, but an attached timestamp for example would be beneficial in the long run, too!

matthiasbeyer avatar Nov 17 '21 13:11 matthiasbeyer

our database keeps growing but it's fine this way to keep the data would likely cause more trouble than to get benefits here maybe better clean the database of the logfiles of N years old entries the rest does not take that much space

christophprokop avatar Oct 01 '24 21:10 christophprokop