glusterfs-hadoop icon indicating copy to clipboard operation
glusterfs-hadoop copied to clipboard

GlusterVolume modification to wrap getPos() functionality in underlying ...

Open jayunit100 opened this issue 12 years ago • 10 comments
trafficstars

This PullRequest Wraps calls to get pos and records the last value, returning that last value in the case that the underlying stream closes. This fix is essential for distributed mapreduce jobs that use the mapred.* API, because that API expects a working call to FSInputStream.getPos() even after the stream has been read: And because the RawLocalFileSystem buffers that stream (unlike DFSInputStream, which handles doesnt wrap the stream), the buffer closes and sets the underlying stream to null . See BZ 100362 for details.

jayunit100 avatar Sep 16 '13 01:09 jayunit100

A typical file I/O workflow goes: openFile(); fileIOOperations() closeFile()

any type of I/O after closeFile() is invalid. GlusterVolume class is correctly throwing an exception if an attempt is made to access the file's stream position after it's been closed. there is no stream open, thus there is no valid position after the stream is closed, and calling getPos() on a closed stream means the open/read/close flow is broken further up the calling chain.

i dont like nurfing up the code by catching exceptions and not reporting them. the responsibility is on the caller NOT to attempt file I/O operations after its closed the file or stream. if the caller attempts to do so, we're justified throwing exception.

i would like to see the open(), read(), close(), getPos() code thats exhibiting this incorrect I/O workflow before OKing guarding against the behavior at a file system level.

childsb avatar Sep 18 '13 17:09 childsb

To clarify brad's comment: IT is true that the bug can be thought of as existing in the RR implementation in mapred version of MultiFileWordCount, and not necessarily a bug in our shim. This allows you to call "getPos()" so that the behaviour is similar to DFSClient stream - always returning a valid long. without the patch, the version of MultiFileWordCount in hadoop-examples will not run. Im in agreement that the overlying "bug" is in the older MultiFileWordCount impl, but nevertheless, the older mapred.* MultiFileWordCount impl does work on HDFS due to the way the DFSClient is implemented.

jayunit100 avatar Sep 18 '13 18:09 jayunit100

The offending code in RR for MultiFileWordCount violates the access pattern, here is why. (i agree its not the right way to use a stream). But in any case, here is the code path brad requests to see:

By the way, some of this code is JDK dependant also, see how streams get set to null whenclosed : http://grepcode.com/file_/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/BufferedReader.java/?v=source

  1. RR calls getPos() BEFORE and AFTER reading a record, in the TrackingRecordReader implementation which in mapred.

  2. At the first call, before the underlying stream is closed, the getPos() returns something valid. Then, the stream is closed, and the implementation in the MultiFielWordCount custom RR returns the result of reading getPos from the "currentStream".

  3. But, since currentStream is actually wrapped by buffered stream, which sets currentStream to null when its closed, the 2nd call to getPos() fails in NPE.

So the code path is that

#In the MapTask in mapred.* framework, hadoop 1.X TrackingRecordReader calls MultiFileWordCountRecordReader.getPos

#In the MultiFileWordCount MapReduce job - in hadoop-examples for 1.X MultiFileWordCountRecordReader.getPos calls currentStream getPos

public long gc long getPos() throws IOException { etPos() throws IOException { long currentOffset = currentStream == null ? 0 : currentStream.getPos(); return offset + currentOffset; }

and currentStream has set to null already, by the overlying BufferedReader : since the currentReader is closed and currentreader is an instance of BufferedReader :

#Why is NPE thrown ? Because of the way bufferedreader works... Finally, from src/examples/org/apache/hadoop/examples/MultiFileWordCount.java currentReader=new BufferedReader(new InputStreamReader(currentStream));

jayunit100 avatar Sep 18 '13 18:09 jayunit100

will you walk through error/code in TrackingRecordReader by line number?

childsb avatar Sep 18 '13 19:09 childsb

From ./src/examples/org/apache/hadoop/examples/MultiFileWordCount.java in hadoop 1.2:

151 public long getPos() throws IOException { 152 long currentOffset = currentStream == null ? 0 : currentStream.getPos(); 153 return offset + currentOffset; 154 }

The error is thrown at getPos(), because cuurentStream is not null, but the inputStreamReader is null. makes sense?

jayunit100 avatar Oct 07 '13 16:10 jayunit100

per the code you pasted in, if the stream is null "0" is assumed; if its not then position is queried.

this shows that the MuleFileWordCount is guarding somewhat that a position is returned even if a stream doesn't exist, so i dont see why the GlusterVolume code should be doing the same thing.

childsb avatar Oct 07 '13 17:10 childsb

Alas, your right ! The problem is that currentStream is not null.
However, the InputStreamReader IS set to null.
That is the problem :) !

jayunit100 avatar Oct 07 '13 17:10 jayunit100

does it make sense now? The currentStream still exists, but the overlying wrapper (the Buffer) SETs the input stream reader to null.

Thus, when the input stream reader is accesssed, a NPE is thrown. But the logic in this method only checks the current stream.

this is NOT a problem in HDFS, because in HDFS, the underlying stream doesnt proxy calls to getPos().

jayunit100 avatar Oct 07 '13 17:10 jayunit100

https://issues.apache.org/jira/browse/MAPREDUCE-5572 <--- corresponding jira to point to the root cause which is a custom RecordReader which relies on undefined FileSYstem semantics

jayunit100 avatar Oct 07 '13 22:10 jayunit100

https://bugzilla.redhat.com/show_bug.cgi?id=1003626

childsb avatar Oct 09 '13 19:10 childsb