Berkeley-Crossword-Solver
Berkeley-Crossword-Solver copied to clipboard
Training data issues
Hello, I encountered some file format issues while training the model.Now I have a batch of my own Clues and Answers data that I want to use for training, but I don't know how to use them in training.
- What format is the dataset in the following code?
bash train_scripts/biencoder/tfidf.sh path/to/dataset
- What are the specific formats of answers.jsonl and docs.jsonl?
python3 train_scripts/biencoder/get_tfidf_negatives.py \
--model path/to/dataset/tfidf/ \
--fills path/to/dataset/answers.jsonl \
--clues path/to/dataset/docs.jsonl \
--out path/to/dataset/ \
--no-len-filter
- What data was used by train.json and validation.json? Are they the ones posted on huggingface? However, there is a difference between the CSV on the huggingface and the JSON required here.
CUDA_VISIBLE_DEVICES=0 bash train_scripts/biencoder/train_bert.sh \
path/to/dataset/train.json \
path/to/validation/validation.json \
checkpoints/biencoder/
In summary, can you provide examples of training files required for each step of the training process so that we can rewrite our own training data format?
Thank you very much indeed.