bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

How to recover Bookies that completely run out of storage space?

Open RaulGracia opened this issue 3 years ago • 2 comments

QUESTION We have done recent tests to observe the behavior of Bookies when exhausting its storage space. In one test, we got the 3 Bookies (10GB journal, 10GB ledger) in read-only mode (expected). But after some time (perhaps after a GC cycle, which may require extra storage), we observed that the 3 Bookies were in CrashLoopBackoff (Kubernetes deployment):

# k get po
NAME                                                READY   STATUS             RESTARTS   AGE
bookkeeper-bookie-0                                 0/1     Running            247        20h
bookkeeper-bookie-1                                 0/1     CrashLoopBackOff   253        20h
bookkeeper-bookie-2                                 0/1     CrashLoopBackOff   239        20h

The reason for Bookies not being able to start is that there is no space left on device to start some of the internal Bookie processes:

2022-03-21 17:51:11,309 - ERROR - [BookieJournal-3181:Journal@1146] - I/O exception in Journal thread!
2022-03-21 17:51:11,309 - ERROR - [BookieJournal-3181:Journal@1146] - I/O exception in Journal thread!java.io.IOException: No space left on device 
at java.base/java.io.UnixFileSystem.createFileExclusively(Native Method) at java.base/java.io.File.createNewFile(File.java:1035) 
at org.apache.bookkeeper.bookie.JournalChannel.<init>(JournalChannel.java:159) 
at org.apache.bookkeeper.bookie.JournalChannel.<init>(JournalChannel.java:117) 
at org.apache.bookkeeper.bookie.Journal.run(Journal.java:963)
2022-03-21 17:51:11,310 - INFO  - [BookieJournal-3181:Journal@1158] - Journal exited loop!

The question is: is there a suggested recovery procedure if we find Bookies in this situation? One constraint to any potential solution to this problem: we need the data of the impacted Bookies to be available, as the system that uses Bookkeeper requires accessing it at least once.

We have considered resizing the Bookie volumes. My understanding is that, if we achieve this, that would solve the problem and Bookies would be able to boot. In case of a volume not being resizable, we have also considered adding new journal/ledger directories to the Bookies that are backed up on new volumes, but I don't know if this would work (we may need to play around with the Cookie and metadata of the Bookies).

Having a procedure to deal with this problem (if it does not exist) in the documentation would be great as well.

RaulGracia avatar Mar 28 '22 09:03 RaulGracia

@RaulGracia Add or remove ledger and journal directory is supported, I think we lack of the doc about this. I will open a new pr to add doc about how to update cookie. The step is as below:

  1. Get bookie server instanceid: bin/bookkeeper shell whatisinstanceid
  2. Generate cookie version file: bin/bookkeeper shell cookie_generate -i ${instanceid} -j ${all journal directories} -l ${all ledger directories} -o VERSION ${bookie server id}.
  3. Copy VERSION file to all journal and ledger directories.
  4. update cookie in ZOOKEEPER: bin/bookkeeper shell cookie_update --cookie-file /data3/bookkeeper/ledgers/current/VERSION
  5. Restart bookie server.

gaozhangmin avatar Apr 07 '22 06:04 gaozhangmin

Assuming you cannot add a new disk/expand existing disk:

  • can the bookie start as readonly? (forceReadOnlyBookie = true) I hope it won't try to create a journal in this case.
  • Journal disk out of space: you can clean up older copies of journals, reduce journal file size to fit into the available space, reduce journalMaxBackups
  • ledger disks out of space: reduce logSizeLimit to fit two entrilogs into the available space (some other params may need to be adjusted too), maybe disable entryLogFilePreallocationEnabled

I haven't tried this in awhile, I'd have to repro and spend more time than I have now to provide detailed instructions, so treat this as rough ideas.

dlg99 avatar May 04 '22 23:05 dlg99