arcadedb icon indicating copy to clipboard operation
arcadedb copied to clipboard

Reuse RID of deleted records

Open lvca opened this issue 2 years ago • 7 comments

Discussed in https://github.com/ArcadeData/arcadedb/discussions/1203

Originally posted by lvca August 12, 2023 ArcadeDB does not reuse the assigned RIDs. Never. Even when the record is deleted. This has the advantage of determining if a link is broken (the RID points to a non-existent record), useful in case of unidirectional edges.

In other cases reusing the RIDs allows reusing all the space in pages. In some use cases that are heavily based on deletes, this feature can save a lot of space and improve overall performance.

I'd like to make this configurable, so the user can decide per bucket/type if the RIDs are reusable.

Phase 1 (scheduled for 24.1.1):

  • [x] After a record delete, the page should be compressed by shifting the records to the left making space for new records
  • [x] Delete records should have pointers = 0 and not valid pointers with zero length buffers. This would save precious bytes and would make the page 100% reusable
  • [x] Update the total record in the page counter when the last record is deleted: cycle left the pointers while the pointers are 0. Then update the total records on the page
  • [x] Consider calling compressPage() at commit time only if a page has been updated or deleted (insert does not create holes)
  • [ ] More random stress tests with heavy multi-threads load to check for bugs on reclaimed space (https://github.com/ArcadeData/arcadedb/issues/1402)
  • [ ] New settings for recycling the RIDs at bucket/type/global levels

Phase 2 (it will be separated into a new issue):

  • [ ] When an insert/update needs a new page, scan the previous pages to find a hole to reuse. This allows to execute this scan only at new pages, so in the best case 1 time every 2048 inserted records
  • [ ] Consider saving holes on a different structure to be quickly reused without a scan
  • [ ] To properly recover the space occupied by shrunk multi-page records (a record spans over multiple pages and then shrinks in size)
  • [ ] Double-check if the space is 100% reclaimed in the linked list used by edges

lvca avatar Dec 19 '23 05:12 lvca

This branch is already in good shape. The records are fully recycled, and I optimized the record management on the page. We have some statements in the documentation that we never recycle RIDs, so we should prepare the docs for this change in 24.1.1.

Still missing one point to make the recycle fully working and the last as optional in this phase to speed up things, but not necessary for the 1st release.

lvca avatar Dec 27 '23 00:12 lvca

@lvca, as I see you're started considering space economy, a very "most-wanted" missing feature of orientdb was to be able to compact files.. Can this be a case in ArcadeDb anytime soon?

tolgaulas avatar Dec 27 '23 06:12 tolgaulas

@tolgaulas Correct, ArcadeDB will be able to reuse 100% of the space over time. This first issue will be able to reclaim all the space used and deleted. Indexes already supported compaction since the beginning. What is still missing (after this issue) is:

  1. To properly recover the space occupied by shrunk multi-page records (a record spans over multiple pages and then shrinks in size)
  2. Double-check if the space is 100% reclaimed in the linked list used by edges

(Updated the main issue with these new sub-tasks)

lvca avatar Dec 27 '23 15:12 lvca

@tolgaulas You're right about OrientDB, some structures didn't allow to reuse the whole allocated space, especially with indexes and ridbags.

lvca avatar Dec 27 '23 15:12 lvca

This PR is in good shape. The missing issue to start thinking about a possible merge (still to test how well works with old legacy databases) is how to manage deleted RIDs in index.

Our LSMTree is highly optimized for ArcadeDB. Any LSMTree has 2 parts (1) the last page which is mutable, and (2) all the other pages are immutable. In ArcadeDB the records that have been deleted have an entry on a more recent page with a negative RID to mark the record as deleted.

Since now record ids are reused, this causes a conflict if a record has been inserted, removed and inserted back.

lvca avatar Apr 11 '24 23:04 lvca

After further testing the index is fine, it already supports multiple entries with the same RID. The problem is with the test case ConsoleAsyncInsertTest.testOrderByAfterDeleteInsert which uses the async API. For some reason, the new records inserted after a delete are not updating one index.

lvca avatar Apr 14 '24 13:04 lvca

Some news. It's nothing related to Asynch, but seems more about how the index manages recycled RIDs when they end up on multiple pages. Still digging into the specific issue. This could be the last issue before doing some porting tests with old databases.

lvca avatar Jun 25 '24 21:06 lvca