pocolm icon indicating copy to clipboard operation
pocolm copied to clipboard

train_lm.py

Open danpovey opened this issue 8 years ago • 6 comments

I want to create a train_lm.py that is a top-level interface to getting int-data and counts, and training an LM. It will support some common usage patterns and options but not necessarily 100% of the things you might want to do. Probably pruning should be a separate script still and not part of what train_lm.py does-- but train_lm.py could suggest command lines for pruning after building the LM, to make things easier.

danpovey avatar Jul 04 '16 18:07 danpovey

If I may suggest, a cleaning option would be nice, because right now it leaves behind a bunch of files that takes much much space.

vince62s avatar Jul 04 '16 19:07 vince62s

Some lower-level scripts already support a --clean option, but they may not be cleaning everything they should. Can you work out what are the main contributors to this space, and come up with some suggestions for what to clean, i.e. how to change the scripts? Right now my attention is taken up with issues related to transcript cleanup and segmentations in TEDLIUM. Dan

On Mon, Jul 4, 2016 at 12:24 PM, vince62s [email protected] wrote:

If I may suggest, a cleaning option would be nice, because right now it leaves behind a bunch of files that takes much much space.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/27#issuecomment-230344752, or mute the thread https://github.com/notifications/unsubscribe/ADJVu96Cd4nOmq8fYERbMQc1np7MYpggks5qSV4KgaJpZM4JEl9T .

danpovey avatar Jul 04 '16 19:07 danpovey

Biggest contributor is definitely the work folder in optimize_vocabsize_order. If I am not mistaken it can be cleaned after the second call of optimize_metaparameters.py Sorry I don't write in python so I have to leave this one to Ke or Zhouyang. The second contributor is work folder in lm_vocabsize_order it might be cleaned after make_lm_dir.py Less critical but still contributing is the split dir in counts_vocabsize_order

Hope this helps.

Just for order 5, on my 1.5 GB corpus, it would save 80GB ...

vince62s avatar Jul 04 '16 19:07 vince62s

Thanks. Ke and Zhouyang, can you please try to solve this problem? I was actually looking for something for you both to do that involves programming. You can work on it together; it doesn't matter to me who checks it in.

I think the right way to do this is to add a --cleanup option to get_objf_and_derivs.py and get_objf_and_derivs_split.py, which will cause them to remove all 'large' files [i.e. those containing stats]. It should default to true; we'll only want it to be set to false for debug purposes.

However, please note that get_objf_and_derivs_split.py has a '--need-model' option, which means that 'float.all' is needed in its output. In that case float.all should not be deleted. The script make_lm_dir.py may also call get_objf_and_derivs.py, and after you add the 'cleanup' you'll need to add a --need-model option to that script so that it will know not to remove 'float.all' when it's done.

Also, you should make sure all the scripts that call get_objf_and_derivs_.py have a --cleanup option and that they pass it in to get_objf_and_derivs_.py.

Obviously you should make sure that all this works.

Please also look online for a tutorial on the UNIX program 'find' and how you can use it to find files above a certain size. Also please understand how the program 'du' works and how you can use it to measure the size of a directory with all its files. These will both be needed to verify that all the big files are being removed.

On Mon, Jul 4, 2016 at 12:55 PM, vince62s [email protected] wrote:

Biggest contributor is definitely the work folder in optimize_vocabsize_order. If I am not mistaken it can be cleaned after the second call of optimize_metaparameters.py Sorry I don't write in python so I have to leave this one to Ke or Zhouyang. The second contributor is work folder in lm_vocabsize_order it might be cleaned after make_lm_dir.py Less critical but still contributing is the split dir in counts_vocabsize_order

Hope this helps.

Just for order 5, on my 1.5 GB corpus, it would save 80GB ...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/27#issuecomment-230347576, or mute the thread https://github.com/notifications/unsubscribe/ADJVu1DPpyViiKaYjtFg3F1GtqeV9N4pks5qSWUXgaJpZM4JEl9T .

danpovey avatar Jul 04 '16 20:07 danpovey

If you need another programming subject, there is one thing that could be very useful. Remember at the begining of the project we were talking about target size for the LM. Since it is very linear to the number of n-grams, it should be quite easy to set a rule which defines the link between the target size of the lm and the pruning factor. For an end user, it is far much more practical to give a size.

vince62s avatar Jul 04 '16 20:07 vince62s

This involves an element of line search also. I did have in mind to do this eventually, but it's a little harder. [the relationship will depend on the data and would need to be discovered during line search.] Dan

On Mon, Jul 4, 2016 at 1:37 PM, vince62s [email protected] wrote:

If you need another programming subject, there is one thing that could be very useful. Remember at the begining of the project we were talking about target size for the LM. Since it is very linear to the number of n-grams, it should be quite easy to set a rule which defines the link between the target size of the lm and the pruning factor. For an end user, it is far much more practical to give a size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/27#issuecomment-230351586, or mute the thread https://github.com/notifications/unsubscribe/ADJVu22xiw7uUsxFs1DyO-qs875ebCigks5qSW8OgaJpZM4JEl9T .

danpovey avatar Jul 04 '16 20:07 danpovey