删除alluxioRuntime时,需要清理alluxioworker目录
问题描述:
运行了一次Alluxio的JNRFuse后,之后切换成JNIFuse后,出现了dataloss的error
从应用侧观察到的错误是
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found. (0) Data loss: truncated record at 142388333 [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]] [[input_processing/IteratorGetNext]] [[cluster_5_1/merge_oidx_1/_2655]] (1) Data loss: truncated record at 142388333 [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]] [[input_processing/IteratorGetNext]] 0 successful operations. 0 derived errors ignored. I0812 06:10:41.991964 140419573827392 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found. (0) Data loss: truncated record at 142388333 [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]] [[input_processing/IteratorGetNext]] [[cluster_5_1/merge_oidx_1/_2655]] (1) Data loss: truncated record at 142388333 [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]] [[input_processing/IteratorGetNext]] 0 successful operations. 0 derived errors ignored.
但是在alluxio-fuse可以看到的日志如下,描述是block id为3489660931,期望的大小16MB,实际的大小是7MB
2020-08-12 06:07:19,191 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,131072,57671680: java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS. at com.google.common.base.Preconditions.checkState(Preconditions.java:842) at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275) at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264) at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187) at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326) at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298) at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245) at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298) at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150) 2020-08-12 06:07:19,196 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,4096,57671680: java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS. at com.google.common.base.Preconditions.checkState(Preconditions.java:842) at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275) at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264) at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187) at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326) at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298) at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245) at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298) at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
登录到对应节点发现
1.该block大小确实为7.1MB 2.怀疑该block的创建时间是在之前JNRFuse创建的且不完整
[root@iZuf68sywkiky95veylv1yZ alluxio]# cd alluxioworker/ [root@iZuf68sywkiky95veylv1yZ alluxioworker]# ls -ltr |grep 3489660931 -rwxrwxrwx 1 root root 7341943 8月 12 11:37 3489660931
问题:
1.什么情况下block会写不完整.目前不存在存储空间不足的情况 2.每次部署新的alluxio集群时,会删除掉之前的缓存block文件吗?
这个问题有最新进展吗?删除掉alluxioruntime后,好像缓存目录并不会清空