autoclean performance
The autoclean-once command is really slow and uses a lot of memory. Especially the memory consumption is concerning and could become a serious problem for bigger nodes.
We are running a CLN node with Postgres database and after a month of operation, I wanted to clean failed forwards from the database, because that data isn't relevant for us.
I ran autoclean-once and that command did take > 1 hour to run and allocated quite a bit of memory that wasn't released after the command was done. Also, it seems like autoclean removes rows from the database either individually or in very small batches, which is not optimal for nodes running with a synchronous replication Postgres database.
Can confirm the relatively high memory usage. This is with an sqlite database.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
448811 nil 20 0 2619072 2.4g 0 S 6.3 3.9 1:29.69 autoclean
autoclean-once did not take an hour+ in my case, though I do have a beefy machine.
"cleaned": 1802687,
though I do have a beefy machine.
So do we. But I think Postgres with sync replication adds lot of latency when you delete one entry, commit and then have to wait for the network to ensure that the replica is in sync. It's not bottlenecked by CPU, RAM or disk.
My personal CLN with SQLite was much faster doing an autoclean
Autoclen is @rustyrussell baby, maybe he has an idea on how to speed up this. I can take a look regarding the memory usage btw
So, autoclean specifically does delete individually, but also disables transactions opportunistically (they'll still occur if other things need transactions, but if it's back-to-back autoclean they'll be merged). This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.
However, it will happily throw a million simultaneous delete commands at CLN, if it finds that many records! That's going to take a lot of memory. We should stage it into (say) no more than 1000 at once, then wait until those are processed. This will require a rework of the infrastructure inside the plugin, however.
This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.
I am not exactly sure what you mean by that. When I tell autoclean to for example delete all failed forwards older than one day, there is no control needed apart from WHERE state = ? AND resolved_time < ?.
However, it will happily throw a million simultaneous delete commands at CLN, if it finds that many records! That's going to take a lot of memory.
This implies that the memory consumption grows linearly with the number of records to delete, which could make it unusable for big nodes with lots of entries that aren't removed regularly
This implies that the memory consumption grows linearly with the number of records to delete, which could make it unusable for big nodes with lots of entries that aren't removed regularly
Correct but this is a bug of our implementation, it is not a limitation of the solution. We just should iterate over the entry by batch, so load the first 1k and run the code on it.
I will put it on my todo :)
This allows for a much more controlled deletion than if we restricted it to simple database queries, though it's slower.
What criteria are you imagining that you can't code in the WHERE clause of a DELETE statement? Pulling every record from the database table, only to turn around and issue millions of individual DELETE statements, is simply not the way to do it.
which could make it unusable for big nodes with lots of entries that aren't removed regularly
Indeed. I cannot use the autoclean plugin because it OOMs my machine.
Any updates? @vincenzopalazzo
Was this fixed by https://github.com/ElementsProject/lightning/pull/7298?
Yes, was fixed by #7298 and released in v24.08