When calling nfs_read or nfs_write and passing in a `count` that is too large, the `count` value is quietly changed
Hi Ronnie,
I was using libnfs 4.0.0 and I recently wanted to switch to the latest version to see how much performance zero-copy can improve. I wrote a simple test program that writes 1G of data to the nfs server. This program works fine in libnfs 4.0.0, but not in the master.
#define BUF_SIZE 1024 * 1024
int main()
{
struct nfs_context *nfs = nfs_init_context();
if (nfs == NULL)
{
printf("failed to init context\n");
return 1;
}
int ret = nfs_mount(nfs, "192.168.3.81", "/kjl");
if (ret != 0)
{
printf("Failed to mount nfs share : %s\n", nfs_get_error(nfs));
return 1;
}
char buf[BUF_SIZE];
memset(buf, 1, BUF_SIZE);
struct nfsfh *fd;
ret = nfs_open(nfs, "file1", O_CREAT | O_TRUNC | O_RDWR, &fd);
if (ret != 0)
{
printf("Failed to open file : %s\n", nfs_get_error(nfs));
return 1;
}
int i;
for (i = 1; i <= 1024; i++) // Write 1G
{
ret = nfs_write(nfs, fd, buf, BUF_SIZE); // master
// ret = nfs_write(nfs, fd, BUF_SIZE, buf); libnfs4.0.0
if (ret != BUF_SIZE)
{
printf("Failed to write data: expected to write %d, but actually wrote %d.\n", BUF_SIZE, ret);
return 1;
}
}
nfs_close(nfs, fd);
nfs_destroy_context(nfs);
return 0;
}
Run the results using libnfs 4.0.0 : Failed to write data: expected to write 1048576, but actually wrote 524288.
I checked the master branch code and found that when count is greater than writemax, the count value will be changed to writemax, and the writemax in my environment is exactly 524288.
The caller does not know that count has been changed in nfs_write. When the caller checks the function return value and finds that the actual data size written is less than the expected value, the caller will be full of doubts, just like me. The README file only states that we have limited a single NFS WRITE request, but does not mention any limits on nfs_write.
The captured packet:
I think using nfs_write should be like using the linux write system call. When calling nfs_write and passing in a count value that is too large, multiple NFS WRITE requests are sent inside the function, and the amount of data written by each WRITE request does not exceed writemax. This is transparent to the caller, and I see that libnfs 4.0.0 does it this way. What do you think?
Yeah, that was removed since it caused more problems than it solved. A big issue with that implementation is that for performance reasons it tried to issue all the individual reads concurrently. That was a huge mis-design because it did not guarantee that the bytes would be accessed sequentially (tripping up the sequential read/write detector in the server) as well as it could result in not just short i/o but i/o ranges with holes in them :-(
For example, what should be returned to the application if only some of the data could be written? say a write gets split into 1, 0-999 2, 1000-1999 and 3, 2000-2999 and only 1 and 3 is written but 2 could not be.
So instead it only does one PDU right now and it sometimes truncates it to a short i/o in the library (and at other times it might be truncated to a short i/o in the server).
Just like read(2), an application must check how many bytes were actually read and handle a short read when they happen.
What you said makes sense, but this also brings up a problem: the application cannot be concurrent, which will also cause performance problems.
The application needs to encapsulate read and write functions:
int application_write(struct nfs_context *nfs, struct nfsfh *nfsfh, size_t count, void *buffer)
{
int write_max = ::nfs_get_readmax(nfs);
int written_size = 0;
while (count > 0)
{
int write_count = count;
if (write_count > write_max) write_count = write_max;
int r = ::nfs_write(nfs, nfsfh, buf + written_size, write_count);
if (r < 0) return r; // error occurred
if (0 == r) break;
written_size += r;
count -= r;
}
return written_size;
}
Or can you solve the problem by writing non-concurrently inside nfs_write? But this doesn't seem to be a perfect solution either, it's really a headache
I can try to add a wrapper to try to do large i/o by sending the pdus one at a time until either the whole requested size has been fulfilled, or abort as a short read/write on first error.
I can try to add that next weekend.
Thank you
Leave it open until I finish adding it. Else I forget.
You can try current master now. It should try to read all the data the app requested now for nfs_read and nfs_pread
Okay, I'll try it over the weekend and let you know the result.
I tested nfs_read and nfs_pread with large IO and they are ok. Thank you.
I would like to ask whether the subsequent 'write' function will support large IO