Data files in ... differ from in-memory data
This is a repeat of #3386 and #2567. We verified that code changes for those tickets are currently on the running system.
In this case we had two separate tservers complaining about the list of files in-memory vs on disk being different (line 1098 of Tablet.java, Accumulo 2.1.1). The difference in both tservers showed the same I file (bulk load file) as being in the metadata table but not in memory. They complained at about the same time which tells me that this happened when the file was bulk loaded into the system. I scanned one of the tablets for data that was unique in this I file and it did show up telling me that the in-memory list was eventually consistent with the metadata table. Hence this is not a critical bug (i.e. no data lost, and all data available for query). However whenever this happens it cost man-hours trying to verify as such.
You are using BulkImport v1 or v2 ?
You are using BulkImport v1 or v2 ?
v1
Opened #3574 about making the tablet log an info when the consistency check goes from bad to good. Next week I can look at the bulk import code and look for race conditions w.r.t. to the consistency check.
The tserver code for adding bulk imports looks good, the counters for the consistency check cover the metadata table update and in the in memory update as seen here. Some of the false positives with this check in #3392 were caused by these counters not covering the metadata table and in memory updates, so double checked that again. So not seeing any race conditions so far.
OK, so yesterday we had a very long day. In the end this is what happened. The metadata table had a tserver that minor compacted and the root metadata got updated but in-memory did not for many metadata tablets on that tserver. Subsequently, when the metadata tablets migrated many hours later, the mutations came back into play, which resulted in many non-metadata tablets having the same issue of inconsistency between what was in-memory and what was in the metadata. At this point we decided that we needed to restart accumulo, which we did and then we had legions of missing files--those which had been, correctly, GC'd before. It took us all afternoon and evening to figure out what actually happened. I now think this issue of in-memory inconsistency needs to be addressed immediately.
Here is a thought. As soon as metadata and in-memory are determined to be out of sync, the tablet is unloaded or in the extreme the tserver is halted.
The issue is seen with #3396, #3392, #3386 and #2567 running
I chatted with @ivakegg about this issue and the following is a summary of what I learned.
- metadata tablet major compaction creates file A1 (adding entry to the root tablet)
- root tablet flushes
- metadata tablet minor compaction creates file F2 (adding entry to root tablet)
- metadata tablet major compaction deletes files F2,A1 and adds file A3 (single mutation to root tablet)
- File F2 is still visible when scanning root tablet and file A1 is not. The consistency check sees this because F2 is no longer in the tablets in memory set of files.
- Tablet server serving metadata tablet dies
- Metadata tablet reloads and now sees file F2, which bring back some deleted data.
- metadata tablet major compacts F2 and some other files. This successfully deletes the tablets F2 entry from the root tablet
Not sure if it's worth noting, but the creation of F2 and the compaction of F2 and A1 occurred in rapid succession. They had the same timestamp in the log.
I traced through 6 or 7 instances of this phenomenon and they all have the same pattern regardless of whether its a metadata or non-metadata tablet affected.
- Files are added to the tablet via import, minc, or majc.
- The corresponding metadata tablet is flushed.
- A new file is added via import or minc and IMMEDIATELY (same timestamp) major compacted along with some files created in 1.
- The tserver reports the inconsistency 2 to 15 minutes later
- The corresponding metadata tablet is flushed.
It is always the file added immediately before the major compaction with the inconsistency and never any of the files added prior. Every file that has become inconsistent I've looked at follows this pattern; However, there are files that follow this pattern that do not become inconsistent.
#3721 is a possible cause for this, that was opened based on a chat with @hlgp.
Linking issues #4001 and #4047 which have been contributing to the in-mem diff issues as a result of bulk import operations.
Talked with @ivakegg. The diff in-mem log messages seemed to have stopped appearing since #4001 and #4047 were merged into 2.1.3.
Closing this issue