VisualDL icon indicating copy to clipboard operation
VisualDL copied to clipboard

OverflowError: cannot fit 'int' into an index-sized integer

Open geekhch opened this issue 3 years ago • 10 comments

使用运行命令:

visualdl --logdir output/combine_all_0411131554_paddle/ --host 0.0.0.0

运行visualdl后,报错如下:

VisualDL 2.2.3
Traceback (most recent call last):
  File "/home/hechanghong/miniconda3/envs/paddle2.1/bin/visualdl", line 8, in <module>
    sys.exit(main())
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/server/app.py", line 177, in main
    _run(args)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/server/app.py", line 156, in _run
    app = create_app(args)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/server/app.py", line 65, in create_app
    api_call = create_api_call(args.logdir, args.model, args.cache_timeout)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/server/api.py", line 250, in create_api_call
    api = Api(logdir, model, cache_timeout)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/server/api.py", line 65, in __init__
    self._reader = LogReader(logdir)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/reader.py", line 89, in __init__
    self.load_new_data(update=True)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/reader.py", line 354, in load_new_data
    self.add_remain()
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/reader.py", line 294, in add_remain
    remain = self.reader.get_remain()
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/record_reader.py", line 106, in get_remain
    for item in self._reader:
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/record_reader.py", line 60, in __next__
    self._reader.get_next()
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/reader/record_reader.py", line 40, in get_next
    event_str = self.file_handle.read(header_len)
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/io/bfile.py", line 592, in read
    self.buff, self.continuation_token = self.fs.read(
  File "/home/hechanghong/miniconda3/envs/paddle2.1/lib/python3.8/site-packages/visualdl/io/bfile.py", line 121, in read
    data = fp.read(size)
OverflowError: cannot fit 'int' into an index-sized integer

geekhch avatar Apr 18 '22 04:04 geekhch

看样子是日志里的数据出问题了,能不能给我们发一份日志文件进行调试一下

rainyfly avatar Apr 19 '22 08:04 rainyfly

看样子是日志里的数据出问题了,能不能给我们发一份日志文件进行调试一下 可以的,有一个8.4Mb的日志文件,请问怎么发给你们呢?

geekhch avatar Apr 19 '22 08:04 geekhch

看样子是日志里的数据出问题了,能不能给我们发一份日志文件进行调试一下

您好,我把有问题的日志放到QQ邮箱中转站了,可以直接下载

geekhch avatar Apr 19 '22 08:04 geekhch

image 这是我解析日志里面每一条数据的字节长度,报错的时候的位置如上所示,有一条数据的字节长度是10734638070951275615,在这之上还有几条数据长度为0的。我猜是从这里开始写入的东西开始出现问题,不知道你记录的是什么数据呢。估计是数据长度为0的这里就开始写入混乱了,在解析的时候才会将不是表明数据长度的字节解析为了数据长度,10734638070951275615这个数是8个字节unsigned类型才能表示,8个字节的signed类型表示不了这个值,可能因此才报了Overflow的错误吧

rainyfly avatar Apr 20 '22 02:04 rainyfly

但是我只调用了writer.add_scalar(f'{k}_eval_loss', loss_dict[k], global_step[k]) 这一个数据记录API,不存在多进程写入冲突,loss也同步使用日志打印是没什么问题的,中途突然出错,会不会是visualdl的缓存bug之类的原因呢?

geekhch avatar Apr 20 '22 13:04 geekhch

请问global_step[k]存的值是什么

rainyfly avatar Apr 22 '22 09:04 rainyfly

global_step的所有操作如下,应该是没有问题的

global_step = defaultdict(int)
for epoch in range(num_epoches):
    for task, data in dataloader:
        writer.add_scalar(f'{task}_train_loss', loss.item(), global_step[task])
        global_step[task] += 1

geekhch avatar Apr 22 '22 09:04 geekhch

看起来是挺正常的,dataloader是你们自己写的dataloader是么,这个task是任务的名称。 这个问题是百分百能够复现的么, print(f'{task}_train_loss', loss.item(), global_step[task]) 这个东西到文件会出现异常么。

rainyfly avatar Apr 22 '22 09:04 rainyfly

可以尝试在writer.add_scalar上面加一行,print,然后程序跑的时候重定向标准输出到一个文本文件中去,如果报错的时候就知道是哪一行没能在LogWriter里面写成功。如果通过这种方法找到了问题的原因,还请告知一下我们写哪一句时候有问题

rainyfly avatar Apr 22 '22 09:04 rainyfly

是paddle.io.Dataloader。项目最近的代码版本确实几乎都能复现,感谢你的建议,后续重跑实验后再来反馈

geekhch avatar Apr 23 '22 06:04 geekhch