noobaa-core
noobaa-core copied to clipboard
Spark runs leave nsfs object "busy"
Environment info
- NooBaa Version: master 8/10 (includes issue 6698)
- Platform: Kubernetes
Actual behavior
- Spark query spark.sql("select count(*) from web_returns where wr_return_ship_cost > 2.0")
Expected behavior
- File is not busy after spark finishes
Steps to reproduce
- See More information - maybe aborted read?
More information - Screenshots / Logs / Other output
I am running Spark which reads a > 128MB object as a series of ranged reads. I have narrowed my test to one file (instead of all 25) and a dead simple query (select count(*) from web_returns where ...), the where clause being there to force a bypass of any indexes so it has to read the whole file. After the run is complete (soon after and 10 minutes later) lsof shows a bunch of (I also ran the lsof check 10 minutes later and had the same values)
COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME
node 3253532 root 35r REG 0,48 1053529104 1056781 /nsfs/fs10/gkafmtest/spark-s100/tpcds/web_returns.dat
The full summary shows this
cl1 - EP wr2fq - read_object 14 abort 14 lsof 154 (154/14 = 11)
cl2 - EP 5vllk - read_object 11 abort 9 lsof 99 ( 99/ 9 = 11)
the first line means node cl1 endpoint ending wr2fq had 14 read_objects issued, 14 aborts issued and 154 entries of the corresponding file in lsof same for second line except for other endpoint
What is really interesting is that for both the endpoints lsof entries = 11 x aborts issued (exactly) - so I assume some sort of correlation there.
For completeness that summary data was collected like this (noobaa logs were > noobaa-endpointname.log)
[root@clgk-noob-scl-1 ~]# grep -c read_object noobaa*.log
noobaa-endpoint-9cc895c5f-5vllk.log:11
noobaa-endpoint-9cc895c5f-wr2fq.log:14
noobaa-endpoint-lsof-2.log:0
noobaa-endpoint-lsof.log:0
[root@clgk-noob-scl-1 ~]# grep -c abort noobaa*.log
noobaa-endpoint-9cc895c5f-5vllk.log:9
noobaa-endpoint-9cc895c5f-wr2fq.log:14
noobaa-endpoint-lsof-2.log:0
noobaa-endpoint-lsof.log:0
for i in clgk-noob-scl-1 clgk-noob-scl-2 clgk-noob-scl-3 clgk-afm-wk ; do echo $i; ssh $i lsof 2>&1 | grep -E "web_returns" | grep -v "no pwd" | wc -l; echo; done
I rarely see read_object_stream completions I think it is a race condition between "aborted" and "read_stream = await". At the time of the abort, read_stream is undefined, so read_stream.close() is not called
req.on('aborted', () => {
dbg.log0('request aborted:', req.path);
if (read_stream && read_stream.close) read_stream.close();
});
res.on('error', err => {
dbg.log0('response error:', err, req.path);
if (read_stream && read_stream.close) read_stream.close();
});
read_stream = await req.object_sdk.read_object_stream(params, res);
if (read_stream) {
read_stream.on('error', err => {
dbg.log0('read stream error:', err, req.path);
res.destroy(err);
});
read_stream.pipe(res);
}
Spark apparently issues requests in a rather rude way - one that triggers this issue (next post)
Spark will break the processing of a large enough object into manageable chunks and issue ranged reads as shown below
Aug-11 1:32:22.448 [Endpoint/7] [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream { file_path: '/nsfs/fs10/gkafmtest/spark-s100/tpcds/web_returns.dat', start: 881436160, end: 1053529104 }
Spark in this case is probably interested in reading from 881436160 to 925507968 bytes. Once that is satisfied, I’m guessing the worker “hangs up” (closes the connection?)
Here is a plot of the read_object_stream requests from Spark. Note they all end at 1053529104 bytes (EOF)
To be fair, it might not be Spark, it might be Spark + HDFS driver (which Spark uses to access s3).
I don't know if this is helpful or not. This is a plot of logged times (when reads were logged, when aborts were logged). Since there is no way to associate an abort with its read_object_stream the vertical pairs of points are not linked to each other. It is likely however that each batch of 8 points is linked with the points above it (Spark is issuing 24 jobs to 8 workers at a time). Note this whole job is done in 3-4 seconds. Also it does appear that each read (Spark is only interested in 44,071,808 bytes per read - even though one is issued for around 1GB (start=0 end=EOF (entry 16 in the SQL Select plot)). That is consistent with the response deltas being flat for each group of 8 requests. So this graph also supports the "read 44MB and disconnect" theory.
It also implies "not all reads were aborted globally at one time"
Hi @nimrod-becker , this defect was never screened. Can we have it screened atleast to know impact of it. Our test environments aren't using Spark workload, so such issues(if still present) would directly be seen at customer end only.
I think this should be deferred. I think the underlying issue was AFM compatibility. If the object is busy, it will mess up AFM eviction, so this should be looked at when AFM is supported
Hi @romayalon , Please mark it for Enhancements. Thanks