sentencepiece
sentencepiece copied to clipboard
Use full path for inputs.
Currently if we specify model_dir
the input x
will need to be relative to that path, which could be counter intuitive.
The scenario is that we put the training files in a data directory while we want the resulting model in another dedicated model directory.
By expanding x
to its full path at the very beginning of the function call we shall be able to avoid any ambiguity resolving the path of both data and model directories.
Why don't you put the full path or use normalizePath in your training script?
Why don't you put the full path or use normalizePath in your training script?
Mostly because I didn't expect to use full path in BOTH x
and model_dir
at all.
Of course that makes thing very clear. Then maybe we shall be clear in document that the path should be full path, otherwise when model_dir
is supplied, the path to x
will be relative to it if not given in full path.
To me it is just more user-friendly to support relative path in a intuitive way. :)
When I first use this function I immediately got the error of not finding x
, I couldn't understand since file.exists(x)
is definitely a TRUE
. So I dived into the source code of the function sentencepiece
and realized that x
is indeed relative to model_dir
.
I believe this can be common for other users, too.
I think there was another reason why I didn't put the full path there but I forgot why. Something about the sentencepiece C++ code or something about how to save objects but I forgot... Don't know what to do know.
All I know is that spm_train
doesn't have a separate argument to specify model output directory.
By default it writes the model and vocab file to the current directory.
However we are able to use --model_prefix
to specify the model directory:
spm_train --input=data/sent.txt --model_prefix=model/spm
At least this works for sentencepiece 0.1.84
.
Not sure if this is related to your forgotten reason though...
yes that was the reason. If you specify the full path to the file in model_prefix, as specified in the sentencepiece docs, I no longer correctly load the resulting model at the end of the sentencepiece call in https://github.com/bnosac/sentencepiece/blob/master/R/sentencepiece.R#L88. That's why I didn't specify the full path. Or it least I got lost in this path setup which was both imposed by sentencepiece and cran
Let's give it another try. Copy-paste note from the new commit:
Decouple the args
expert use from other arguments.
The current design may confuse user about using model_dir
and model_prefix
with directory being involved.
So we make it clear that when args
is used, user will take full control of the underlying spm_train
command line.
Since the original program doesn't have verbose
option so we keep only this argument to be still relevant to expert use.
Under the new design, the following two calls are equivalent:
sentencepiece("data/sentences.txt", model_prefix="spm", model_dir="model")
sentencepiece(args=c("--input=data/sentences.txt --model_prefix=model/spm"))
Both supports relative path based on the same working directory.
Also I change the regex of model_prefix
matching pattern from "model_prefix=.+ -"
to "model_prefix=.+( -|$)"
so --model_prefix
can also be the last argument.
Okay, but I think what I was trying to point at that model_prefix can be the full path to a file where to store the model, where the complexity came from. But thinking of this, this is probably not related to the use of x
. The reason why capture.output was used was at request of cran to suppress training evolution.