Berkeley-Crossword-Solver icon indicating copy to clipboard operation
Berkeley-Crossword-Solver copied to clipboard

Training data issues

Open Melo-1017 opened this issue 10 months ago • 0 comments

Hello, I encountered some file format issues while training the model.Now I have a batch of my own Clues and Answers data that I want to use for training, but I don't know how to use them in training.

  • What format is the dataset in the following code? bash train_scripts/biencoder/tfidf.sh path/to/dataset
  • What are the specific formats of answers.jsonl and docs.jsonl?
python3 train_scripts/biencoder/get_tfidf_negatives.py \
    --model path/to/dataset/tfidf/ \
    --fills path/to/dataset/answers.jsonl \
    --clues path/to/dataset/docs.jsonl \
    --out path/to/dataset/ \
    --no-len-filter
  • What data was used by train.json and validation.json? Are they the ones posted on huggingface? However, there is a difference between the CSV on the huggingface and the JSON required here.
CUDA_VISIBLE_DEVICES=0 bash train_scripts/biencoder/train_bert.sh \
    path/to/dataset/train.json \
    path/to/validation/validation.json \
    checkpoints/biencoder/

In summary, can you provide examples of training files required for each step of the training process so that we can rewrite our own training data format?

Thank you very much indeed.

Melo-1017 avatar Mar 29 '24 13:03 Melo-1017