hadoop-lzo icon indicating copy to clipboard operation
hadoop-lzo copied to clipboard

lzo.index.tmp files not deleted

Open gszjulcsi opened this issue 11 years ago • 4 comments

We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

gszjulcsi avatar Jan 29 '14 09:01 gszjulcsi

We have not seen this in our self-hosted environment. Might be due something EC2 specific. Do you have any theories about the root cause? gszjulcsi [email protected] wrote:We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

—Reply to this email directly or view it on GitHub.

dvryaboy avatar Jan 29 '14 10:01 dvryaboy

Meanwhile we have noticed that these index.tmp files disappeared. We suspect that was an s3 eventual consistency issue, namely it took s3 too long (cc. 7 hours) to maintain consistency.

2014-01-29 dvryaboy [email protected]

We have not seen this in our self-hosted environment. Might be due something EC2 specific. Do you have any theories about the root cause? gszjulcsi [email protected] wrote:We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

--Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com/twitter/hadoop-lzo/issues/87#issuecomment-33571495 .

gszjulcsi avatar Jan 29 '14 10:01 gszjulcsi

I see. Well perhaps it would make sense to add a filter to the lzo input formats so they ignore these temp files and you don't get an error. Feel free to send a pull request with such a change, we will be happy to take a look.

dvryaboy avatar Jan 29 '14 11:01 dvryaboy

excluding .tmp files is a good fix.

There are other subtle issues with S3 because of these delays e.g. https://github.com/kevinweil/elephant-bird/issues/309

rangadi avatar Jan 29 '14 16:01 rangadi