trino
trino copied to clipboard
Missing manifest file when releasing Hive Metastore's the lock has failed
HiveMetastoreTableOperations using aquireLock and releaseLock while committing to iceberg existing table.
releaseLock happens in the finally block and that step can also fail, and that result into that missing manifest file.
Steps To Reproduce:
CREATE TABLE tbl (
id integer,
name varchar,
part varchar
)
WITH (
location = '/path',
format = 'PARQUET'
);
INSERT INTO tbl VALUES (1,'product', 'part1');
if HiveMetastoreTableOperations's line failed to execute at Metastore level. It can fail for Metastore timeout or if lock is missing from HMS side it can throw NoSuchLockException which is an Exception and not RuntimeException
Then after the failure, insert/select or any query on this table will fail, and table becomes unusable:
INSERT INTO tbl VALUES (2,'product', 'part1');
Query 20220829_214404_00026_j9tvf, FAILED, 3 nodes
Splits: 68 total, 67 done (98.53%)
0.47 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20220829_214404_00026_j9tvf failed: Failed to get status for file: //path/tbl/metadata/snap-5290494244823571353-1-34a427a1-054c-4ecf-a49b-f3e3ef9203a0.avro
Reason would be when the release lock step was failed, the Iceberg's commit handles only RuntimeException so in case of NoSuchLockException being thrown then it has not done any step.
So, looks like the new snapshot is created and pointing to a new manifest file in the previous step, but that file is missing/not created.
related issues: https://github.com/trinodb/trino/issues/14104 https://github.com/trinodb/trino/issues/12581
Thanks for reporting @osscm! I have managed to reproduce this locally and have prepared a fix, for which I will open a PR shortly.
Incidentally, this same issue has already come up in the Iceberg hive-metastore module and was fixed accordingly by PR. Here's what the code fragment currently looks like.
I would propose to keep it simple and keep it in sync with the iceberg hive-metastore connector solution.
Please let me know what you think.
@osscm I see that releaseTableLock() operation throws a TrinoException when dealing with an TException (NoSuchLockException extends TException).
@marton-bod the exception is supressed in the https://github.com/apache/iceberg/pull/987 , but the table still remains locked right?
@findinpath In the worst case scenario yes, it's possible that the table remains locked until the metastore housekeeper eventually runs and cleans out the stale lock (which could be up to 5 minutes IIRC). However, that's still preferable to throwing an exception and breaking the table, as it currently happens.
I see a couple options here, instead of throwing an exception:
- just log the error message, like
HiveTableOperationsdoes - Check what the exception type is. If it's NoSuckLock, we can just log it and proceed. If it's an intermittent issue like not reaching the HMS, we could opt to retry the unlock call X times. However, this could make the commit operation take longer
What do you think?
@osscm I see that
releaseTableLock()operation throws aTrinoExceptionwhen dealing with anTException(NoSuchLockExceptionextendsTException).@marton-bod the exception is supressed in the apache/iceberg#987 , but the table still remains locked right?
yes table can still remain locked from the DB perspective, that's why I was also not sure if we just log it, will it be ok.
@findinpath In the worst case scenario yes, it's possible that the table remains locked until the metastore housekeeper eventually runs and cleans out the stale lock (which could be up to 5 minutes IIRC). However, that's still preferable to throwing an exception and breaking the table, as it currently happens.
I see a couple options here, instead of throwing an exception:
- just log the error message, like
HiveTableOperationsdoes- Check what the exception type is. If it's NoSuckLock, we can just log it and proceed. If it's an intermittent issue like not reaching the HMS, we could opt to retry the unlock call X times. However, this could make the commit operation take longer
What do you think?
I'm thinking one option is like this:
Step 1.
finally {
try {
thriftMetastore.releaseTableLock(lockId);
} catch (Exception e) {
// NoSuchLockException has been retried
// Any other exception will be swallowed, assuming the underlying metastore/db will clean up the acquired lock
logger.warn("Cannot handle it in the Iceberg's commit flow, so swallowing it", e);
}
}
Step 2. Allow Trino to retry on NoSuchLockException by removing it at https://github.com/trinodb/trino/blob/master/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/ThriftHiveMetastore.java#L1607-L1613
Not a DB expert but, some references on DB locking: MySQL: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout Postgres: https://pgpedia.info/l/lock_timeout.html
@osscm
Re step 1: I agree, that is in line with the error handling in HiveTableOperations and keeps things simple. Intermittent network issues are retried until we get a response from HMS. No exception should be thrown given that the commit succeeded.
Re step 2: Do you see any situation where NoSuchLockException could be intermittent? With single node HMS, I think if you get NoSuchLock once, you'll always get that for subsequent calls, but let me know if you've come across any corner cases in distributed setups. I'm just a bit worried that if in 99% of the cases we don't have such a corner case, then we are retrying fruitlessly and hurting the commit performance for no real benefit
Note: The locking in this happens via HMS locks, Iceberg requests a table-level EXCLUSIVE lock from HMS (which blocks other tables from committing concurrently), and that's what needs to be released from the HMS via its API.
Missing manifest file when releasing Hive Metastore's the lock has failed
I think the locking pattern in HiveMetastoreTableOperations follows what Iceberg Hive Catalog on Spark does.
@osscm do you know how the unlocking problem is handled there?
@findepi looks like in the case of Spark, it uses the HiveTableOperations -- provided by Iceberg and if we see it's catching the Exception while unlocking to log(warn) it, that's it. This is not 100% full proof.
https://github.com/apache/iceberg/blob/a7e67d09a7f05821cafdafba740e9efbdbdadf9c/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L704-L710
I think @marton-bod has also mentioned this path.
In slack, there was a discussion to re-use Iceberg's HiveTableOperations in Trino as well.
@osscm Re step 1: I agree, that is in line with the error handling in
HiveTableOperationsand keeps things simple. Intermittent network issues are retried until we get a response from HMS. No exception should be thrown given that the commit succeeded.Re step 2: Do you see any situation where
NoSuchLockExceptioncould be intermittent? With single node HMS, I think if you get NoSuchLock once, you'll always get that for subsequent calls, but let me know if you've come across any corner cases in distributed setups. I'm just a bit worried that if in 99% of the cases we don't have such a corner case, then we are retrying fruitlessly and hurting the commit performance for no real benefit
NoSuchLockException if there is a distributed HMS, then in a few of the situations it might be possible, where Lock is being acquired but while releasing lock using a new client. A new client can connect to a different backend node of HMS and throws NoSuchLockException intermittently which can be found after a few retries.
Though I think it will be ok for now not to retry for NoSuchLockException unless we see it happening quite frequently for different teams.
Note: The locking in this happens via HMS locks, Iceberg requests a table-level EXCLUSIVE lock from HMS (which blocks other tables from committing concurrently), and that's what needs to be released from the HMS via its API.
My understanding is, the lock can be released automatically if the session which acquired the lock is not active and it has passed some threshold time.
@findepi @alexjo2144 @findinpath Could you please let us know if there are any concerns if we follow the similar approach of logging the exception as in Spark like @marton-bod and @osscm mentioned?
@findepi @alexjo2144 @findinpath If you get time, can you please share your thoughts about the approach we have discussed? thanks! If we are ok, then can have PR rolling as well.
After handling https://github.com/trinodb/trino/issues/14104 , the logic throws now a CommitStateUnknownException when doing the replaceTable operation.
https://github.com/trinodb/trino/blob/15dd728c6d2f865bc9ece9672a32df9314b7a4e7/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/hms/HiveMetastoreTableOperations.java#L91-L96
So the operation which potentially can corrupt the table succeeded.
Regarding the lock release question. Indeed, failing to unlock can generate the unwanted situation where the table gets corrupted as well because the metadata file gets removed.
it's possible that the table remains locked until the metastore housekeeper eventually runs and cleans out the stale lock (which could be up to 5 minutes IIRC)
In the absence of a more fine-grained mechanism of handling this kind of failure in iceberg we probably need to swallow the exception in order to avoid removing a metadata file which is already commited to the metastore.
https://github.com/apache/iceberg/blob/b9bbcfb0e4d8f2605d3c6fb2543cefdd5d09524d/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L406-L414
@marton-bod do you want in the near future to put up a PR ?
cc @electrum
Thanks a lot for your input @findinpath! Let's go with this approach then. I'll let @osscm open a PR for this, hopefully he can put it up very shortly
This issue is rather critical because it renders the Iceberg table corrupt. Do note that this should be handled with a higher priority in order to have the fix available in the next Trino version.
Sure, that makes sense. Let me open a PR now to get the ball rolling