spline icon indicating copy to clipboard operation
spline copied to clipboard

Spline - automatic data retention?

Open DaimonPl opened this issue 4 years ago • 30 comments

Is there any easy way to configure automatic retention for spline data? For example automatically delete entires older than 3 months ?

DaimonPl avatar May 28 '20 10:05 DaimonPl

No, not yet. It could be added in the future though, it was slightly mentioned a couple of times in some local discussions.

wajda avatar May 28 '20 11:05 wajda

Ok, I think it might be useful in big setups. We have hundreds of jobs running everyday and dozens of jobs running every hour pretty sure we would need to scale arangodb quite often without retention

Any idea if retention can be somehow added in arangodb itself ?

DaimonPl avatar May 28 '20 11:05 DaimonPl

Well...not in ArangoDB natively (afaik), but you could use your custom script. However this approach is kinda fragile and dirty, and I wouldn't recommend doing it. Especially in a busy environment. To do it correctly and avoiding breaking the database consistency you need to respect the graph structure and cascade the deletion properly.

On the other hand if you could find a logical time gap in your data pipelines, big enough (e.g. one hour), you can try to cut the DB in the middle or even closer to the end of that time gap. Simply delete all documents with _created < my_timestamp from all collections excepts dataSource, all in one transaction. This is still a hacking approach but having the idle time frame large enough it practically should not create any problems. Also keep in mind that the described approach might not work in Spline 0.6+

wajda avatar May 28 '20 12:05 wajda

Thx for detailed answer :)

No way to find any time gap on our Hadoop it's processing 24/7/365 something all the time :)

DaimonPl avatar May 28 '20 12:05 DaimonPl

Awesome. I see you case will be a really great test case for Spline :))

wajda avatar May 28 '20 12:05 wajda

Thanks for responding. Do we have any tentative timeline for its release?

pratapmmmec avatar Jul 03 '20 07:07 pratapmmmec

We'll try to implement something simple in 0.5.4 which should be out in the course of the month. We are addressing performance bottlenecks in this release, so this feature logically fits here.

wajda avatar Jul 03 '20 08:07 wajda

Solution:

  1. Add a command to the Admin CLI to remove the lineage data older than given date.
  2. ~Add an option to the Gateway to do the same thing periodically based on a provided cron expression.~

wajda avatar Aug 10 '20 09:08 wajda

After the second thought I realized that having CLI command for removing old data makes it unnecessary to make it a Spline Server responsibility. System cronjobs can be used instead to simply run java -jar admin-cli.jar db-prune ...

wajda avatar Aug 17 '20 17:08 wajda

The issue is blocked by arangodb/arangodb#12693

wajda avatar Sep 23 '20 15:09 wajda

Moving it to 0.6.0 as it most likely needs to wait until the ArangoDB issue is resolved.

wajda avatar Sep 23 '20 16:09 wajda

@wajda Do we have any workaround as of now? Our lineage is now growing at ~ 2GB on daily basis and its getting difficult to sustain at this rate. Ours is a 24 X 7 workload running all days. We are fine if the lineage older than 20 days is broken and to handle it we will can limit on pagination query to go back only for 20 days at max.

pratapmmmec avatar Nov 22 '21 09:11 pratapmmmec

Yes, the blocking ArangoDb issue has been resilved and we'll return to this ticket soon.

wajda avatar Nov 22 '21 09:11 wajda

hi @wajda

We are working to create a foxx service which will take care of Spline's arango collections cleanup, and the pseudocode for the same is as follows:

  1. Based on a threshold value (say in days), remove all 'progress' documents (& 'progressOf' edges) older than the threshold days

  2. Identify all orphan execution plans (to which no 'progressOf' edge points to) a. Delete all collections and edges belonging to the the orphan execution plans b. Delete all orphan execution plans

  3. Identify all orphan data sources (to which no 'affects' or 'depends' edge points to) a. Delete all such orphan data sources

We will update you on how our testing goes and if any challenges are discovered during the same

Kindly let us know if you have any other requirements for us to consider

Thank you!

cc: @pratapmmmec @vishalag001

Aditya-Sood avatar Jul 28 '22 14:07 Aditya-Sood

@Aditya-Sood , thank you for your efforts! Yes that is roughly how it should work. There is however a few pitfalls that you need to be aware of:

  1. Identify all orphan data sources.

Remember that the DataSource has a many-to-many relationship with the Read/Write operations and are inserted before the operations that use them, so there is a risk to delete a data source that has just been created and the operation connected to it has yet to be inserted. Basically we could be potentially dealing with the partially committed transactions and low isolation level, which unfortunately could both occur in the ArangoDB cluster setup. If such a rare case occurs you could leave the database in a logically inconsistent state as there is no foreign key constraints in ArangoDB and it not stop you from adding an edge pointing to a non-existing node. So be careful with removing data source nodes.

  1. Identify all orphan execution plans

The same as above but with respect to the execution event -> execution plan relationship. Depends on how you use Spline, especially if there are standing jobs somewhere executed once in awhile that generates the same execution plan (with the same hash based UUID) you might need to account for late events.

  1. In the upcoming Spline 1.0.0 GA we are planing to support application level transactions to remediate for lack of full ACID (especially A and I) in the ArangoDB cluster. To accomplish that we will introduce a certain protocol that every producer and consumer must follow to guarantee proper Tx isolation. This will involve a special metadata associated with every node and edge in the database. So the script that you write will either be not working after upgrade, or it will be working but would not be properly isolated, making the above mentioned problem more likely as we are abandoning Arango native Tx support and will only be relying on our own application level transactions.

wajda avatar Jul 28 '22 15:07 wajda

Here's an old branch where I started adding data retention commands to the Spline Admin CLI - https://github.com/AbsaOSS/spline/tree/feature/spline-684-data-retention Unfortunately due to a critical bug that was discovered in the ArangoDB during working on that functionality the work was suspended. The bug has been later resolved, so I believe we will soon resume the work on the feature. This is just FYI. But of course any help and feedback is more than welcome! :)

wajda avatar Jul 28 '22 15:07 wajda

hi @wajda Thank you for highlighting the issue with lack of transaction isolation!

As a workaround we have decided to extend the threshold filter from the progress events to both the executionPlan and dataSource collections as well - this should allow the cleanup script to ignore any pseudo-orphan objects that have been created recently, and pick only those objects which are at least thresholdDays old

Kindly let us know if you foresee any other concerns with the same

Also the js program is now ready and we will be testing it on our arango db this week, will update you post the same

Thanks

cc: @pratapmmmec @vishalag001

Aditya-Sood avatar Aug 03 '22 05:08 Aditya-Sood

Awesome! Thanks a million.

wajda avatar Aug 03 '22 09:08 wajda

hi @wajda We developed a straightforward script to iterate over all the collections, filter out relevant documents and then remove them; something like as follows:

db._query('FOR collectionEle IN @@cols FILTER collectionEle._created < @purgeLimitTimestamp FOR orphanExecPlanID IN @arr FILTER collectionEle._belongsTo == orphanExecPlanID REMOVE collectionEle IN @@cols',  
        {
            '@cols': collections[i],
            'arr': orphanExecPlanIDsOlderThanThresholdDaysArray,
            'purgeLimitTimestamp': purgeLimitTimestamp
        }
    )

However its execution turned out to be very slow, with the arangosh timing out during the processing of large collections such as operation

We tried restricting the purge window to just 1 days worth of data, but even then it timed out on the uses collection

This could be explained by the fact that this approach is primarily dependent on using the _belongsTo attribute for identifying the relevant documents, and since this attribute is not indexed for most collections the script ends up going through all the documents in it

As a workaround we are working on a graph traversal based approach to identify the _keys of all the edges and vertices linked to the orphan execution plans (since it is an indexed attribute), and then use that for removing the documents

Will update once we have had a chance to test the new approach Hopefully the graph traversal will offer the necessary performance gain required for the deletion to go through without arangosh timing out

Thanks!

cc: @pratapmmmec @vishalag001

Aditya-Sood avatar Aug 09 '22 12:08 Aditya-Sood

Did you try to create indices for _belongTo in all collections? Usually traversals are much slower than document style queries, especially on indexed attributes. That's why _belongsTo was created.

wajda avatar Aug 09 '22 12:08 wajda

Oh I was under the impression that traversals would be much faster since all the relevant collections (vertices + edges) belonging to a particular plan are being picked up together in one go?

Thanks for clarifying, will re-try the original document queries script after creating an index on _belongsTo and update

Aditya-Sood avatar Aug 10 '22 04:08 Aditya-Sood

Oh I was under the impression that traversals would be much faster since all the relevant collections (vertices + edges) belonging to a particular plan are being picked up together in one go?

No, unfortunately it doesn't work like that. Traversals basically go node by node conditionally jumping over edges. They are good for querying graphs not because they are faster, but because you don't have any other options and just have to travel through edges to find nodes you are looking for.

wajda avatar Aug 10 '22 12:08 wajda

hi @wajda adding indexes on _belongsTo on the the remaining collections did the trick, we were able to get all the queries to run within the default 20 min server timeout threshold

however the subsequent query to identify orphan dataSources seems to need much more time than that (despite the query attributes being indexed), so we are tweaking the timeout threshold to see if it can be made to complete in a reasonable amount of time will update post that

Aditya-Sood avatar Aug 14 '22 16:08 Aditya-Sood

Just out of my curiosity, what's the size of your collections?

wajda avatar Aug 14 '22 17:08 wajda

Hi @wajda , Thanks for your contribution to this issue, I have kept an eye on this feature for a long time. It's quite tough to clean up data on a big dataset. I think it's the main reason why several people request the data retention feature. what's the rough ETA of this feature?

hugeshi avatar Aug 22 '22 08:08 hugeshi

Unfortunately I cannot give any ETA. On one side, it's not a priority feature for our company at the moment. But, I can say that a lot of things are changing right now, and there is a chance for the project to receive some boost very soon. On the other side, as I tried to explain above this feature depends on another feature that introduces a set of fundamental changed to the way how transactions are managed, and that one has to be finished first. I hope to give some update on plans say next month.

wajda avatar Aug 22 '22 09:08 wajda

hi @wajda Apologies for the delay in response, we were caught up in some other tasks

Our total spline database size was around ~100GB on the system that we were trying to cleanup

We dealt with the dataSource objects deletion by taking two steps:

  1. Trim down the affects and depends collections down to the target duration that we finally want, before initiating search for stale data sources (this reduced the overall dataset size that had to be traversed to identify the stale sources)
  2. Rewrite the query being used to identify stale dataSource objects - there was an optimisation opportunity since using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute

Combining this with the previous script changes have allowed us to trim a day's worth of stale objects in 1-1.5 hrs

Aditya-Sood avatar Sep 21 '22 06:09 Aditya-Sood

Following is the final cleanup script that we are using: https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce

Could you please review the script and let us know if there are any changes that you would like?

Post that we would like to discuss integrating the cleanup functionality with the project, possibly as a configurable service added to the Admin utility

Aditya-Sood avatar Sep 21 '22 06:09 Aditya-Sood

Thank you @Aditya-Sood. I will review it this week and will get back to you.

wajda avatar Sep 26 '22 10:09 wajda

@Aditya-Sood, we've reviewed your script and found it generally correct. But I've got a few minor questions/suggestions about it:

  1. Stage 1:
    • FILTER progEvent.timestamp vs. FILTER progEvent._created in the FOR progEvent IN progress. The difference is that timestamp denotes the event time, while _created denotes the processing time. Although having a large enough time window there is little to no practical difference, but I guess it logically it would be more correct to use the former.
    • Using _created on progressOf. Again, while it might be highly unlikely but still there is a risk of grabbing an edge that belongs to a living node just because its processing time happened to be 1ms earlier than the adjacent progress node. Which could happen as there is no predetermined order of both entities creation and the _created time might not be exactly the same for all nodes inserted by a single logical transaction. This is the same reason why we can't just wipe all the docs from all collections in one go by the _created_ predicate alone. I would suggest remove progressOf by looking at its relation to the progress being removed. Use either traversal 1 OUTBOUND prog progressOf or a simple document query FOR progOf IN progressOf FILTER progOf._from == prog._id that could potentially be faster for the case.
  2. At the stage 2 where you collect orphan execution plan IDs:
    • Why do you need a second loop. Wouldn't it better to collect them at the stage 1 when removing progress and progressOf nodes?
    • When checking for node orphanage, why do you use NOT IN and require two-step process? Can't you simply count the number of adjacent edges and compare it with 0?
  3. What do you mean by "using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute"? In ArangoDB attributes _id, _key, _from and _to are always indexed.

wajda avatar Oct 13 '22 09:10 wajda