storm-hdfs icon indicating copy to clipboard operation
storm-hdfs copied to clipboard

Implementing an opaque transactional HdfsState

Open rangatdt opened this issue 10 years ago • 1 comments

Thanks much for this contrib.

First, could you confirm that the current HdfsState implementation is non-transactional state and so there is no guarantee that data gets written to HDFS exactly once?

Second, wanted your opinion on implementing an opaque transactional state for writes to HDFS:

A naive implementation of maintaining the state of the file as of previous batch separate from the current file will likely be expensive to implement without the support of file appends. For instance, in a sample implementation, every batch of writes will end up in its own file with no batching efficiencies for downstream consumers.

An alternative implementation could be to have the file f and the prev batch b as 2 separate files where f is always in an open state while b is written afresh and closed for every batch. The name of the file to store b could itself be the "previous tx id". When the current write happens with a different tx id, the execute() function reads b and writes to f. It then deletes b. A new file with the current tx id is created and this stores the current b.

At the time of rotation, b is read and written to f which is rotated away. File storing b is emptied since we will need the tx id for the next write.

When the current write happens with the same tx id as the previous attempt, then b is overwritten with the current batch's data. In particular, f is untouched.

Appreciate your feedback Thanks -Ranga

rangatdt avatar Jan 09 '15 06:01 rangatdt

In case someone else is interested in what came about from this, I have documented our findings and eventual approach over at http://blog.thedatateam.in/2015/02/guaranteeing-exactly-once-load.html

rangatdt avatar Feb 15 '15 11:02 rangatdt