masakhane-mt
masakhane-mt copied to clipboard
Custom Data Notebook: Spaces in file paths can cause issues with bash commands
For example, /content/drive/My Drive/masakhane/$src-$tgt-$tag
can cause issues, but also the following situation caused an error for me:
source_file = f"/content/drive/My Drive/Research/Hani Machine Translation/hni_story_corpus/v2/hani_story_corpus_train.{source_language}"
target_file = f"/content/drive/My Drive/Research/Hani MachineTranslation/hni_story_corpus/v2/hani_story_corpus_train.{target_language}"
# They should both have the same length.
! wc -l $source_file
! wc -l $target_file
Mitigations we could do:
"MyDrive" instead of "My Drive" helps
Actually, it seems you can just change from using My Drive
to MyDrive
paths, which helps a lot so long as there aren't spaces elsewhere in the path, e.g. in my case where Hani Machine Translation
was in the path to train.eng
and train.hni
Add quotes around bash variables
For example
! wc -l "$source_file"
instead of wc -l $source_file
and `
! head "$source_file"* instead of ! head "$source_file"*
but this doesn't completely solve it, and can get complicated when we've got some of the more complex cases later in the notebook, like
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"
or within the yaml file:
#load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
Warn the user about whitespaces.
Add a section that checks all the paths for white spaces and warns the user that, maybe it would be easier if they just removed them?
Do all our file manipulations with Python
We could rewrite a lot of these to use pathlib
See also https://github.com/pjreddie/darknet/issues/1672 and https://stackoverflow.com/questions/56640534/cannot-open-train-txt-with-white-space-my-drivehe
Originally posted this on https://github.com/masakhane-io/masakhane-community/issues/25, whoops.
In my case I simply took the spaces out, and that prevented any issues. As in, I used /content/drive/MyDrive/
instead of /content/drive/My Drive/
, and also manually renamed my Hani Machine Translation
folder to HaniMachineTranslation
I'm currently testing whether I can get the whole notebook to run with spaces left in the path. I'm adding quotations around variables.
Ah, I think maybe I forgot that you can right-click the Drive name in Google Colab and rename it.
I think I changed my drive name to MyDrive
and forgot I had done so.
I will rename it again and see if it breaks.