wormhole icon indicating copy to clipboard operation
wormhole copied to clipboard

Readable model dump.

Open BaiGang opened this issue 8 years ago • 17 comments

Hi,

Currently all learning methods in wormhole save resulted models in binary format. This is pretty well in cases of solving machine learning competitions, i.e training and predicting both using wormhole components. However in more general cases when we train the models offline and want to apply them in an online component (in our case it's a server running on JVM), the binary format results in some inconvenience. So a readable model output in text format (or other exchangeable format such as protobuf) is highly expected.

Thanks, Gang

BaiGang avatar Nov 18 '15 02:11 BaiGang

I address the readable dump of DiFacto model by parsing the binary file saved via SaveModel, i.e Save in KVStore and IVal AdaGradEntry in DiFacto.

Ideally we can abstract the Entry data and the internal storage in KVStore using protobuf. This will make io implementations neat and make our model results exchangeable in various language and platforms.

BaiGang avatar Nov 20 '15 05:11 BaiGang

So my proposal above is mainly related to ps-lite. I'll try it out and make a WIP pull request there.

BaiGang avatar Nov 20 '15 05:11 BaiGang

yeah, that's good suggestion.

i'll add a tool to convert the binary model into an ascii format.

at the same time, i'm trying to refact fm into a separate repo called dmlc/difacto, with two major changes

  1. having a single machine multiple threads implementation, which should process data <100GB easily on a single machine. and also will be easy to have python/R bindings
  2. switch to the dev branch of ps-lite, which is a simplified version of the master branch. mxnet is using it now and it works well

i hope to get it done in a week.

mli avatar Nov 28 '15 22:11 mli

Very nice, Look forward to the changes :)

CNevd avatar Nov 29 '15 00:11 CNevd

Thanks and looking forward to the changes. : )

BaiGang avatar Nov 30 '15 13:11 BaiGang

Any update on this?

I'm also interested in the refactor of ps-lite. It has no update for two months. So is it finalized?

BaiGang avatar Dec 30 '15 09:12 BaiGang

@BaiGang "I address the readable dump of DiFacto model by parsing the binary file saved via SaveModel". Can you share me the parsing method? Thanks.

formath avatar Jul 01 '16 10:07 formath

see dump.cc

CNevd avatar Jul 01 '16 12:07 CNevd

@BaiGang @mli
When I dump the model to text format, I found original feature ids are converted into new ids (large numbers). If I want to keep the original feature ids in model, how do I make it work? Thanks!

toughJack avatar Aug 24 '16 10:08 toughJack

there is a revert key id function, I guess it is called in the data reader On Wed, Aug 24, 2016 at 3:37 AM Xiaoqiang Feng [email protected] wrote:

@BaiGang https://github.com/BaiGang @mli https://github.com/mli

When I dump the model to text format, I found original feature ids are converted into new ids (large numbers). If I want to keep the original feature ids in model, how do I make it work? Thanks!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/dmlc/wormhole/issues/43#issuecomment-242022478, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4fOh8TDMC4sKo4x5hG9lwtbN_BU8ks5qjB8GgaJpZM4GkXpd .

mli avatar Aug 24 '16 15:08 mli

@toughJack Maybe you should change code in localizer.h like this.

else if (sizeof(I) == 8) {
#pragma omp parallel for num_threads(nt_)
    for (size_t i = 0; i < idx_size; ++i) {
      //pair_[i].k = ReverseBytes(blk.index[i]);
      pair_[i].k = blk.index[i];
      pair_[i].i = i;
    }

formath avatar Aug 25 '16 02:08 formath

@formath @toughJack see issues/8 just comment //pair_[i].k = ReverseBytes(blk.index[i]); will make ranges of servers imbalanced if your max key is small

CNevd avatar Aug 25 '16 04:08 CNevd

you manually set the max_key, so the servers will only partition that key range On Wed, Aug 24, 2016 at 9:41 PM CNevd [email protected] wrote:

@formath https://github.com/formath @toughJack https://github.com/toughJack see issues/8 https://github.com/CNevd/Difacto_DMLC/issues/8 just comment //pair_[i].k = ReverseBytes(blk.index[i]); will make ranges of servers imbalanced if your max key is small

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/wormhole/issues/43#issuecomment-242279567, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4Z4OM_YImOreDvnID5CcrS-tfAyHks5qjRz-gaJpZM4GkXpd .

mli avatar Aug 25 '16 04:08 mli

@mli yes:)

CNevd avatar Aug 25 '16 04:08 CNevd

@CNevd Good suggestion. I always generate balanced uint64 feature id offline, so miss that. If max key is small, setting max_key is truly right.

formath avatar Aug 25 '16 06:08 formath

@mli I noticed that you mentioned single machine multiple threads implementation of FM. "1. having a single machine multiple threads implementation, which should process data <100GB easily on a single machine. and also will be easy to have python/R bindings" I did not find any manual for single machine multiple threads version. I wonder whether it works ? If it works, how to set the relative parameters and run? Thanks

toughJack avatar Aug 25 '16 09:08 toughJack

  1. just run multiple workers on the same machine
  2. try to use lbfgs implemented on dmlc/difacto

On Thu, Aug 25, 2016 at 2:11 AM, Xiaoqiang Feng [email protected] wrote:

@mli https://github.com/mli I noticed that you mentioned single machine multiple threads implementation of FM. "1. having a single machine multiple threads implementation, which should process data <100GB easily on a single machine. and also will be easy to have python/R bindings" I did not find any manual for single machine multiple threads version. I wonder whether it works ? If it works, how to set the relative parameters and run? Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dmlc/wormhole/issues/43#issuecomment-242325406, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZv4RX6j348wdvN1PUh2jIk4NMfh79Kks5qjVxHgaJpZM4GkXpd .

mli avatar Aug 25 '16 17:08 mli