hdfs
hdfs copied to clipboard
Add an option for determine read/write timeout for namenodes
In most cases for the datanodes, the user can set read/write timeout by using SetDeadline to the Readers. But for namenodes, we can only use a single DialContext when initializing the namenode connection.
The issue that I faced was, I was able to connect to the namenode without any timeout issues, also works fine for write function. But, we had an issue that namenode can't respond to the request. At this point, the client hangs.
ref: https://github.com/colinmarc/hdfs/blob/master/internal/rpc/namenode.go#L209
I thought It would be great if we add options for determining read/write timeout value for the namenodes to prevent hanging when the namenode isn't available for responding.
Please let me know if I misunderstood the concept :)
Hi @kimtree, thanks for the suggestion.
I think this may be the simplest way to solve the problem, but I've also been considering adding context support for some methods, eg RenameContext(ctx context.Context, oldpath, newpath string), with support for cancellation. Would that work in your case?
Hi @colinmarc, It's good to hear that you've been considering adding context support! I also made context support by wrapping the library function in our project like below.
func open(ctx context.Context, client hdfs.Client, path string) (*FileReader, error) {
reader, err := client.Open(path)
if err != nil {
return nil, err
}
if deadline, ok := ctx.Deadline(); ok {
if err := reader.SetDeadline(deadline); err != nil {
return nil, fmt.Errorf("failed to set deadline: %w", err)
}
}
return reader, nil
}
This way of using context is not complete since client.Open function is out of control. Thus, our problem occurs when executing c.namenode.Execute from client.Open function. Are you going to add context support when calling c.namenode.Execute function? If so, It would solve our problem!
Exactly. I would change the signature of Execute to take a context.
Yes, That's a good idea. So when do you plan to add this feature?
Currently my thinking would be to add RenameContext etc for a point release, and then change the old Rename etc for a 3.0 release. I'm not working on context support right now, and therefore can't give you an estimate on when it may land. If you'd like to take a stab at it, that would be more than welcome!
This way of using
contextis not complete sinceclient.Openfunction is out of control. Thus, our problem occurs when executingc.namenode.Executefromclient.Openfunction. Are you going to addcontextsupport when callingc.namenode.Executefunction? If so, It would solve our problem!
That said, this example illustrates where it would be tricky. The call to Execute (for getBlocks) actually happens lazily, in Read. We could move that up, but that might introduce subtle changes in behaviour.
Yeah, that's true. The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward. To the users, It might be confused since the timeout occurs both namenode and datanode and couldn't understand the concept of how timeout works at a glance. (Timeout includes actions both namenode and datanode or the only actions for each namenode and datanode)
I'm not sure my approach is still acceptable in the meantime waiting for v3, Is it possible to accept this PR and release a minor version for v2?
The reason why I specified read/write timeout especially for the datanode is to handle failure from the namenode, not lazily loaded Read or the action afterward.
Hm, maybe Open is actually one of the functions that doesn't need a context parameter. We could just use the read deadline for the getBlocks. Or is that too confusing?
My problem wasn't from the FileReader since we can set deadline when reading the data later. In my case, our namenode failures when executing info, err := c.getFileInfo(name) from Open function and that causes hang while reading the response from the namenode. So, It'll also block the later operations like creating FileReader. That's why I add an option for read/write timeout for namenode specifically.
Exactly. I would change the signature of
Executeto take a context.
really excited about that.