xlearn icon indicating copy to clipboard operation
xlearn copied to clipboard

Generating the .bin file requires much more memory than the expectation

Open enerai opened this issue 5 years ago • 15 comments

For example, a 50G dataset might correspond with a 80G bin file. If I have 128G memory, actually I cannot train the dataset, because during generating the .bin file, the required memory will excess more than 128G memory (50G+80G+other) and the process will be interrupted. Would it be possible to further optimize the .bin file generation to improve the utilization of memory, such as making out the 80G bin file.

enerai avatar Oct 22 '18 02:10 enerai

@enerai Yes, you are right. To speedup the file I/O, xLearn need more memory space for data reading. I am considering if xLearn can read data file line by line, instead of reading the whole data into memory at the beginning.

aksnzhy avatar Oct 22 '18 14:10 aksnzhy

@enerai Yes, you are right. To speedup the file I/O, xLearn need more memory space for data reading. I am considering if xLearn can read data file line by line, instead of reading the whole data into memory at the beginning.

If xlearn can read data line by line, the availability of large dataset will be improved considerably. Hope to see the improvement in the next version!

enerai avatar Oct 22 '18 16:10 enerai

@enerai I will fix this feature in this weekend. Thank you !

aksnzhy avatar Oct 23 '18 12:10 aksnzhy

@enerai Hi, I fix this problem this weekend. You can re-built it from the latest code and have a try! Thank you!

aksnzhy avatar Oct 29 '18 04:10 aksnzhy

@enerai Hi, I fix this problem this weekend. You can re-built it from the latest code and have a try! Thank you!

Thanks for your hard work! I will try it today and make sure that everything works well. @aksnzhy

enerai avatar Oct 29 '18 04:10 enerai

@enerai Hi, I fix this problem this weekend. You can re-built it from the latest code and have a try! Thank you!

@aksnzhy following the install instruction shown in https://xlearn-doc.readthedocs.io/en/latest/install/index.html, I cannot obtain xlearn_train and xlearn_predict.

enerai avatar Oct 29 '18 04:10 enerai

Hi, it is a mistake and you need to re-clone the code and build it. Make sure the version you choose is 0.3.7

Enerai [email protected]于2018年10月29日 周一下午12:57写道:

@enerai https://github.com/enerai Hi, I fix this problem this weekend. You can re-built it from the latest code and have a try! Thank you!

@aksnzhy https://github.com/aksnzhy following the install instruction shown in https://xlearn-doc.readthedocs.io/en/latest/install/index.html, I cannot obtain xlearn_train and xlearn_predict.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/aksnzhy/xlearn/issues/167#issuecomment-433788401, or mute the thread https://github.com/notifications/unsubscribe-auth/AVnEB9ngg2QGIaWrfXh757YZntmgxz6yks5upoqcgaJpZM4XyyDo .

aksnzhy avatar Oct 29 '18 07:10 aksnzhy

I fixed this mistake right now.

aksnzhy avatar Oct 29 '18 07:10 aksnzhy

I fixed this mistake right now.

你好,我重新下载源码编译,生成了libxlearn_api.so文件,替换了之前的版本,在运行的时候,打印出来的版本号是0.3.7应该是替换ok了,内存还是消耗完了,最后直接卡死了。我的数据文件是8G,本机内存8G。

Muniuliuma avatar Oct 31 '18 01:10 Muniuliuma

@Muniuliuma 数据文件 8G,内存也是 8G,这种情况是有可能把内存用完的。原因:数据经过解析之后不一定还是8G,因为还有一些辅助空间需要申请,比如模型的内存。除此之外,机器当前可用内存可能没有 8G,因为还有系统其他进程在消耗内存。所以一般建议还是要把内存预留足够一些。这个 issue 更新之前,xLearn 需要使用 16G 的内存来处理 8G 数据,现在只需要 8G 左右就可以了。

aksnzhy avatar Oct 31 '18 08:10 aksnzhy

@Muniuliuma 数据文件 8G,内存也是 8G,这种情况是有可能把内存用完的。原因:数据经过解析之后不一定还是8G,因为还有一些辅助空间需要申请,比如模型的内存。除此之外,机器当前可用内存可能没有 8G,因为还有系统其他进程在消耗内存。所以一般建议还是要把内存预留足够一些。这个 issue 更新之前,xLearn 需要使用 16G 的内存来处理 8G 数据,现在只需要 8G 左右就可以了。

那就是本机可用内存不能小于数据集的大小吗,转换成二进制文件的时候,能不能一批批的去读取数据

Muniuliuma avatar Oct 31 '18 08:10 Muniuliuma

@Muniuliuma 现在就是一批一批去读数据。但是最终生成的二进制文件也可能比 8G 更大。比如文本文件里的一个 label “1” 就是一个 char,但是转成 int 后变成了 4个char,所以二进制文件可能更大。

aksnzhy avatar Oct 31 '18 10:10 aksnzhy

@Muniuliuma 二进制的好处并不是节省内存,而是大大减少数据序列化的开销,这个比文件IO费时多了。

aksnzhy avatar Oct 31 '18 10:10 aksnzhy

@Muniuliuma 二进制的好处并不是节省内存,而是大大减少数据序列化的开销,这个比文件IO费时多了。

现在是分批读取的吗,那一批读取大小是多少,我8g的训练集,内存还是会爆掉,导致转化成二进制文件失败。我在用libffm的时候,它也会转化成二进制文件,但是他消耗内存2个g左右,并且转化很快。

Muniuliuma avatar Nov 01 '18 05:11 Muniuliuma

请问怎么能输入多个文件呢?

liyi193328 avatar Mar 01 '20 09:03 liyi193328