spline
spline copied to clipboard
Spline - automatic data retention?
Is there any easy way to configure automatic retention for spline data? For example automatically delete entires older than 3 months ?
No, not yet. It could be added in the future though, it was slightly mentioned a couple of times in some local discussions.
Ok, I think it might be useful in big setups. We have hundreds of jobs running everyday and dozens of jobs running every hour pretty sure we would need to scale arangodb quite often without retention
Any idea if retention can be somehow added in arangodb itself ?
Well...not in ArangoDB natively (afaik), but you could use your custom script. However this approach is kinda fragile and dirty, and I wouldn't recommend doing it. Especially in a busy environment. To do it correctly and avoiding breaking the database consistency you need to respect the graph structure and cascade the deletion properly.
On the other hand if you could find a logical time gap in your data pipelines, big enough (e.g. one hour), you can try to cut the DB in the middle or even closer to the end of that time gap. Simply delete all documents with _created < my_timestamp
from all collections excepts dataSource
, all in one transaction.
This is still a hacking approach but having the idle time frame large enough it practically should not create any problems.
Also keep in mind that the described approach might not work in Spline 0.6+
Thx for detailed answer :)
No way to find any time gap on our Hadoop it's processing 24/7/365 something all the time :)
Awesome. I see you case will be a really great test case for Spline :))
Thanks for responding. Do we have any tentative timeline for its release?
We'll try to implement something simple in 0.5.4 which should be out in the course of the month. We are addressing performance bottlenecks in this release, so this feature logically fits here.
Solution:
- Add a command to the Admin CLI to remove the lineage data older than given date.
- ~Add an option to the Gateway to do the same thing periodically based on a provided cron expression.~
After the second thought I realized that having CLI command for removing old data makes it unnecessary to make it a Spline Server responsibility. System cronjobs can be used instead to simply run java -jar admin-cli.jar db-prune ...
The issue is blocked by arangodb/arangodb#12693
Moving it to 0.6.0 as it most likely needs to wait until the ArangoDB issue is resolved.
@wajda Do we have any workaround as of now? Our lineage is now growing at ~ 2GB on daily basis and its getting difficult to sustain at this rate. Ours is a 24 X 7 workload running all days. We are fine if the lineage older than 20 days is broken and to handle it we will can limit on pagination query to go back only for 20 days at max.
Yes, the blocking ArangoDb issue has been resilved and we'll return to this ticket soon.
hi @wajda
We are working to create a foxx service which will take care of Spline's arango collections cleanup, and the pseudocode for the same is as follows:
-
Based on a threshold value (say in days), remove all 'progress' documents (& 'progressOf' edges) older than the threshold days
-
Identify all orphan execution plans (to which no 'progressOf' edge points to) a. Delete all collections and edges belonging to the the orphan execution plans b. Delete all orphan execution plans
-
Identify all orphan data sources (to which no 'affects' or 'depends' edge points to) a. Delete all such orphan data sources
We will update you on how our testing goes and if any challenges are discovered during the same
Kindly let us know if you have any other requirements for us to consider
Thank you!
cc: @pratapmmmec @vishalag001
@Aditya-Sood , thank you for your efforts! Yes that is roughly how it should work. There is however a few pitfalls that you need to be aware of:
-
Identify all orphan data sources.
Remember that the DataSource has a many-to-many relationship with the Read/Write operations and are inserted before the operations that use them, so there is a risk to delete a data source that has just been created and the operation connected to it has yet to be inserted. Basically we could be potentially dealing with the partially committed transactions and low isolation level, which unfortunately could both occur in the ArangoDB cluster setup. If such a rare case occurs you could leave the database in a logically inconsistent state as there is no foreign key constraints in ArangoDB and it not stop you from adding an edge pointing to a non-existing node. So be careful with removing data source nodes.
-
Identify all orphan execution plans
The same as above but with respect to the execution event -> execution plan relationship. Depends on how you use Spline, especially if there are standing jobs somewhere executed once in awhile that generates the same execution plan (with the same hash based UUID) you might need to account for late events.
- In the upcoming Spline 1.0.0 GA we are planing to support application level transactions to remediate for lack of full ACID (especially A and I) in the ArangoDB cluster. To accomplish that we will introduce a certain protocol that every producer and consumer must follow to guarantee proper Tx isolation. This will involve a special metadata associated with every node and edge in the database. So the script that you write will either be not working after upgrade, or it will be working but would not be properly isolated, making the above mentioned problem more likely as we are abandoning Arango native Tx support and will only be relying on our own application level transactions.
Here's an old branch where I started adding data retention commands to the Spline Admin CLI - https://github.com/AbsaOSS/spline/tree/feature/spline-684-data-retention Unfortunately due to a critical bug that was discovered in the ArangoDB during working on that functionality the work was suspended. The bug has been later resolved, so I believe we will soon resume the work on the feature. This is just FYI. But of course any help and feedback is more than welcome! :)
hi @wajda Thank you for highlighting the issue with lack of transaction isolation!
As a workaround we have decided to extend the threshold filter from the progress events to both the executionPlan and dataSource collections as well - this should allow the cleanup script to ignore any pseudo-orphan objects that have been created recently, and pick only those objects which are at least thresholdDays old
Kindly let us know if you foresee any other concerns with the same
Also the js program is now ready and we will be testing it on our arango db this week, will update you post the same
Thanks
cc: @pratapmmmec @vishalag001
Awesome! Thanks a million.
hi @wajda We developed a straightforward script to iterate over all the collections, filter out relevant documents and then remove them; something like as follows:
db._query('FOR collectionEle IN @@cols FILTER collectionEle._created < @purgeLimitTimestamp FOR orphanExecPlanID IN @arr FILTER collectionEle._belongsTo == orphanExecPlanID REMOVE collectionEle IN @@cols',
{
'@cols': collections[i],
'arr': orphanExecPlanIDsOlderThanThresholdDaysArray,
'purgeLimitTimestamp': purgeLimitTimestamp
}
)
However its execution turned out to be very slow, with the arangosh timing out during the processing of large collections such as operation
We tried restricting the purge window to just 1 days worth of data, but even then it timed out on the uses
collection
This could be explained by the fact that this approach is primarily dependent on using the _belongsTo
attribute for identifying the relevant documents, and since this attribute is not indexed for most collections the script ends up going through all the documents in it
As a workaround we are working on a graph traversal based approach to identify the _key
s of all the edges and vertices linked to the orphan execution plans (since it is an indexed attribute), and then use that for removing the documents
Will update once we have had a chance to test the new approach Hopefully the graph traversal will offer the necessary performance gain required for the deletion to go through without arangosh timing out
Thanks!
cc: @pratapmmmec @vishalag001
Did you try to create indices for _belongTo
in all collections? Usually traversals are much slower than document style queries, especially on indexed attributes. That's why _belongsTo
was created.
Oh I was under the impression that traversals would be much faster since all the relevant collections (vertices + edges) belonging to a particular plan are being picked up together in one go?
Thanks for clarifying, will re-try the original document queries script after creating an index on _belongsTo
and update
Oh I was under the impression that traversals would be much faster since all the relevant collections (vertices + edges) belonging to a particular plan are being picked up together in one go?
No, unfortunately it doesn't work like that. Traversals basically go node by node conditionally jumping over edges. They are good for querying graphs not because they are faster, but because you don't have any other options and just have to travel through edges to find nodes you are looking for.
hi @wajda
adding indexes on _belongsTo
on the the remaining collections did the trick, we were able to get all the queries to run within the default 20 min server timeout threshold
however the subsequent query to identify orphan dataSources seems to need much more time than that (despite the query attributes being indexed), so we are tweaking the timeout threshold to see if it can be made to complete in a reasonable amount of time will update post that
Just out of my curiosity, what's the size of your collections?
Hi @wajda , Thanks for your contribution to this issue, I have kept an eye on this feature for a long time. It's quite tough to clean up data on a big dataset. I think it's the main reason why several people request the data retention feature. what's the rough ETA of this feature?
Unfortunately I cannot give any ETA. On one side, it's not a priority feature for our company at the moment. But, I can say that a lot of things are changing right now, and there is a chance for the project to receive some boost very soon. On the other side, as I tried to explain above this feature depends on another feature that introduces a set of fundamental changed to the way how transactions are managed, and that one has to be finished first. I hope to give some update on plans say next month.
hi @wajda Apologies for the delay in response, we were caught up in some other tasks
Our total spline database size was around ~100GB on the system that we were trying to cleanup
We dealt with the dataSource
objects deletion by taking two steps:
- Trim down the
affects
anddepends
collections down to the target duration that we finally want, before initiating search for stale data sources (this reduced the overall dataset size that had to be traversed to identify the stale sources) - Rewrite the query being used to identify stale
dataSource
objects - there was an optimisation opportunity since using an equality operation on the non-indexed_id
attribute was performing better than a string match on the indexed_key
attribute
Combining this with the previous script changes have allowed us to trim a day's worth of stale objects in 1-1.5 hrs
Following is the final cleanup script that we are using: https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce
Could you please review the script and let us know if there are any changes that you would like?
Post that we would like to discuss integrating the cleanup functionality with the project, possibly as a configurable service added to the Admin utility
Thank you @Aditya-Sood. I will review it this week and will get back to you.
@Aditya-Sood, we've reviewed your script and found it generally correct. But I've got a few minor questions/suggestions about it:
- Stage 1:
-
FILTER progEvent.timestamp
vs.FILTER progEvent._created
in theFOR progEvent IN progress
. The difference is thattimestamp
denotes the event time, while_created
denotes the processing time. Although having a large enough time window there is little to no practical difference, but I guess it logically it would be more correct to use the former. - Using
_created
onprogressOf
. Again, while it might be highly unlikely but still there is a risk of grabbing an edge that belongs to a living node just because its processing time happened to be 1ms earlier than the adjacent progress node. Which could happen as there is no predetermined order of both entities creation and the_created
time might not be exactly the same for all nodes inserted by a single logical transaction. This is the same reason why we can't just wipe all the docs from all collections in one go by the_created_
predicate alone. I would suggest removeprogressOf
by looking at its relation to theprogress
being removed. Use either traversal1 OUTBOUND prog progressOf
or a simple document queryFOR progOf IN progressOf FILTER progOf._from == prog._id
that could potentially be faster for the case.
-
- At the stage 2 where you collect orphan execution plan IDs:
- Why do you need a second loop. Wouldn't it better to collect them at the stage 1 when removing progress and progressOf nodes?
- When checking for node orphanage, why do you use
NOT IN
and require two-step process? Can't you simply count the number of adjacent edges and compare it with0
?
- What do you mean by "using an equality operation on the non-indexed
_id
attribute was performing better than a string match on the indexed_key
attribute"? In ArangoDB attributes_id
,_key
,_from
and_to
are always indexed.