hdfs icon indicating copy to clipboard operation
hdfs copied to clipboard

Add an option for determine read/write timeout for namenodes

Open kimtree opened this issue 4 years ago • 10 comments

In most cases for the datanodes, the user can set read/write timeout by using SetDeadline to the Readers. But for namenodes, we can only use a single DialContext when initializing the namenode connection.

The issue that I faced was, I was able to connect to the namenode without any timeout issues, also works fine for write function. But, we had an issue that namenode can't respond to the request. At this point, the client hangs. ref: https://github.com/colinmarc/hdfs/blob/master/internal/rpc/namenode.go#L209

I thought It would be great if we add options for determining read/write timeout value for the namenodes to prevent hanging when the namenode isn't available for responding.

Please let me know if I misunderstood the concept :)

kimtree avatar Sep 14 '20 11:09 kimtree

Hi @kimtree, thanks for the suggestion.

I think this may be the simplest way to solve the problem, but I've also been considering adding context support for some methods, eg RenameContext(ctx context.Context, oldpath, newpath string), with support for cancellation. Would that work in your case?

colinmarc avatar Sep 14 '20 12:09 colinmarc

Hi @colinmarc, It's good to hear that you've been considering adding context support! I also made context support by wrapping the library function in our project like below.

func open(ctx context.Context, client hdfs.Client, path string) (*FileReader, error) {
reader, err := client.Open(path)
if err != nil {
	return nil, err
}

if deadline, ok := ctx.Deadline(); ok {
	if err := reader.SetDeadline(deadline); err != nil {
		return nil, fmt.Errorf("failed to set deadline: %w", err)
	}
}

return reader, nil
}

This way of using context is not complete since client.Open function is out of control. Thus, our problem occurs when executing c.namenode.Execute from client.Open function. Are you going to add context support when calling c.namenode.Execute function? If so, It would solve our problem!

kimtree avatar Sep 14 '20 12:09 kimtree

Exactly. I would change the signature of Execute to take a context.

colinmarc avatar Sep 14 '20 13:09 colinmarc

Yes, That's a good idea. So when do you plan to add this feature?

kimtree avatar Sep 14 '20 13:09 kimtree

Currently my thinking would be to add RenameContext etc for a point release, and then change the old Rename etc for a 3.0 release. I'm not working on context support right now, and therefore can't give you an estimate on when it may land. If you'd like to take a stab at it, that would be more than welcome!

colinmarc avatar Sep 14 '20 13:09 colinmarc

This way of using context is not complete since client.Open function is out of control. Thus, our problem occurs when executing c.namenode.Execute from client.Open function. Are you going to add context support when calling c.namenode.Execute function? If so, It would solve our problem!

That said, this example illustrates where it would be tricky. The call to Execute (for getBlocks) actually happens lazily, in Read. We could move that up, but that might introduce subtle changes in behaviour.

colinmarc avatar Sep 14 '20 13:09 colinmarc

Yeah, that's true. The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward. To the users, It might be confused since the timeout occurs both namenode and datanode and couldn't understand the concept of how timeout works at a glance. (Timeout includes actions both namenode and datanode or the only actions for each namenode and datanode)

I'm not sure my approach is still acceptable in the meantime waiting for v3, Is it possible to accept this PR and release a minor version for v2?

kimtree avatar Sep 14 '20 13:09 kimtree

The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward.

Hm, maybe Open is actually one of the functions that doesn't need a context parameter. We could just use the read deadline for the getBlocks. Or is that too confusing?

colinmarc avatar Sep 14 '20 13:09 colinmarc

My problem wasn't from the FileReader since we can set deadline when reading the data later. In my case, our namenode failures when executing info, err := c.getFileInfo(name) from Open function and that causes hang while reading the response from the namenode. So, It'll also block the later operations like creating FileReader. That's why I add an option for read/write timeout for namenode specifically.

kimtree avatar Sep 14 '20 13:09 kimtree

Exactly. I would change the signature of Execute to take a context.

really excited about that.

lnn1988 avatar Feb 25 '22 08:02 lnn1988