sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Use full path for inputs.

Open everdark opened this issue 4 years ago • 7 comments

Currently if we specify model_dir the input x will need to be relative to that path, which could be counter intuitive. The scenario is that we put the training files in a data directory while we want the resulting model in another dedicated model directory. By expanding x to its full path at the very beginning of the function call we shall be able to avoid any ambiguity resolving the path of both data and model directories.

everdark avatar Jul 25 '20 03:07 everdark

Why don't you put the full path or use normalizePath in your training script?

jwijffels avatar Jul 27 '20 12:07 jwijffels

Why don't you put the full path or use normalizePath in your training script?

Mostly because I didn't expect to use full path in BOTH x and model_dir at all. Of course that makes thing very clear. Then maybe we shall be clear in document that the path should be full path, otherwise when model_dir is supplied, the path to x will be relative to it if not given in full path. To me it is just more user-friendly to support relative path in a intuitive way. :)

When I first use this function I immediately got the error of not finding x, I couldn't understand since file.exists(x) is definitely a TRUE. So I dived into the source code of the function sentencepiece and realized that x is indeed relative to model_dir. I believe this can be common for other users, too.

everdark avatar Jul 27 '20 12:07 everdark

I think there was another reason why I didn't put the full path there but I forgot why. Something about the sentencepiece C++ code or something about how to save objects but I forgot... Don't know what to do know.

jwijffels avatar Jul 27 '20 12:07 jwijffels

All I know is that spm_train doesn't have a separate argument to specify model output directory. By default it writes the model and vocab file to the current directory. However we are able to use --model_prefix to specify the model directory:

spm_train --input=data/sent.txt --model_prefix=model/spm

At least this works for sentencepiece 0.1.84. Not sure if this is related to your forgotten reason though...

everdark avatar Jul 27 '20 13:07 everdark

yes that was the reason. If you specify the full path to the file in model_prefix, as specified in the sentencepiece docs, I no longer correctly load the resulting model at the end of the sentencepiece call in https://github.com/bnosac/sentencepiece/blob/master/R/sentencepiece.R#L88. That's why I didn't specify the full path. Or it least I got lost in this path setup which was both imposed by sentencepiece and cran

jwijffels avatar Jul 27 '20 13:07 jwijffels

Let's give it another try. Copy-paste note from the new commit:


Decouple the args expert use from other arguments. The current design may confuse user about using model_dir and model_prefix with directory being involved. So we make it clear that when args is used, user will take full control of the underlying spm_train command line. Since the original program doesn't have verbose option so we keep only this argument to be still relevant to expert use.

Under the new design, the following two calls are equivalent:

sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm"))

Both supports relative path based on the same working directory.

Also I change the regex of model_prefix matching pattern from "model_prefix=.+ -" to "model_prefix=.+( -|$)" so --model_prefix can also be the last argument.

everdark avatar Jul 27 '20 14:07 everdark

Okay, but I think what I was trying to point at that model_prefix can be the full path to a file where to store the model, where the complexity came from. But thinking of this, this is probably not related to the use of x. The reason why capture.output was used was at request of cran to suppress training evolution.

jwijffels avatar Jul 27 '20 19:07 jwijffels