euler icon indicating copy to clipboard operation
euler copied to clipboard

基于Spark的graph_data_parser数据生成问题

Open ZunwenYou opened this issue 5 years ago • 8 comments

Spark的executor用HDFSWriter生成part_x.dat二进制文件,部分part读取报“data error”的错误; 我们排除了数据格式不对可能性(用生成的json文件,单机生成dat文件这种方式是OK的) 现象如下:

  1. 读取失败的part都是在解析最后若干行出错
  2. 部分失败的part再一次加载训练的时候,load又不出错 image

Update: Spark executor的Core改成1,问题就解决了。 是Writer的flush出现问题了吗?

ZunwenYou avatar May 08 '19 06:05 ZunwenYou

ping @yangsiran

ZunwenYou avatar May 08 '19 06:05 ZunwenYou

This issue can be fixed by adding hflush function in HDFSWriter class. And also, you should call the hflush function after everything is done.

intoraw avatar May 08 '19 09:05 intoraw

@pgplus1628 As showed in last post, writer will flush after every record is written.

ZunwenYou avatar May 08 '19 09:05 ZunwenYou

@ZunwenYou oh, I mean hflush.

intoraw avatar May 08 '19 11:05 intoraw

@pgplus1628 you are right.

ZunwenYou avatar May 08 '19 11:05 ZunwenYou

@ZunwenYou 屏幕快照 2019-06-19 下午7 50 35 这是我的 spark 写 dat 文件的代码,然而写文件的代码好像并没有被执行,请问是什么原因? 求教

arsenezhang avatar Jun 19 '19 11:06 arsenezhang

@arsenezhang rdd need a action to trigger lazy operation. you have to execute resultRDD.count()

ZunwenYou avatar Jun 23 '19 03:06 ZunwenYou

您好,我用spark生成训练数据一直解析有问题,能否劳驾发一份spark生成训练数据的代码给我呢^_^

ziyang599 avatar Oct 21 '20 08:10 ziyang599