BLINK icon indicating copy to clipboard operation
BLINK copied to clipboard

Pre-train on Wikipedia dump: Questions about data

Open schelv opened this issue 4 years ago ā€¢ 7 comments

Hello,

Nice paper! šŸ˜ƒ I want to train the bi-encoder as described in section 5.2.2 of your paper and have some questions about the data that you used.

Can you clarify how the subset of the linked mentions is selected?

we pre-train our models on Wikipedia data. We use the May 2019 English Wikipedia dump which includes 5.9M entities, and use the hyperlinks in articles as examples (the anchor text is the mention). We use a subset of all Wikipedia linked mentions as our training data (A total of 9M examples).

What is the format of the input data for training the model? train_biencoder.py tries to load training data from a train.jsonl. Can you give a few example rows for such a file?

Is get_processed_data.sh used to process the data? The name would suggest so, lol. However the README.md of that folder says [deprecated], so I am not sure. (Maybe you could remove the deprecated code from the repository, and use a release tag instead for the old code.)

Could you upload the processed training data?

schelv avatar Jul 06 '20 14:07 schelv

I'm also facing the same issues. It seems the following parameters are required to train the biencoder:

"context_key"
"data_path"
"debug"
"eval_batch_size"
"eval_interval"
"evaluate"
"gradient_accumulation_steps"
"learning_rate"
"max_cand_length"
"max_context_length"
"max_grad_norm"
"num_train_epochs"
"output_path"
"path_to_model"
"print_interval"
"seed"
"shuffle"
"silent"
"train_batch_size"
"type_optimization"
"warmup_proportion"

I would like to know what values are suggested to reproduce the results published in the paper.

ruanchaves avatar Jul 13 '20 09:07 ruanchaves

@belindal @ledw a lot of people are interested to train BLINK with our data, it would be nice if the authors provide some instructions to train the models, including all the steps required.

Thanks!

bushjavier avatar Oct 07 '20 15:10 bushjavier

I'm facing the same issue. Has anyone figured out how to train on Wikipedia? I ran "get_processed_data.sh" but it did not produce the train.jsonl. Any pointers will help. Thanks.

gpsbhargav avatar Dec 13 '20 17:12 gpsbhargav

the training data we used can be download from http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl and http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl. The format of the data is described in https://github.com/facebookresearch/KILT :)

fabiopetroni avatar Dec 14 '20 09:12 fabiopetroni

the training data we used can be download from http://dl.fbaipublicfiles.com/KILT/blink-train-kilt.jsonl and http://dl.fbaipublicfiles.com/KILT/blink-dev-kilt.jsonl. The format of the data is described in https://github.com/facebookresearch/KILT :)

Hi @fabiopetroni, in the paper you mentioned: We train our cross-encoder model based on the top 100 retrieved results from our bi-encoder model on Wikipedia data. For the training of the cross-encoder model, we further down-sample our training data to obtain a training set of 1M examples.

Can you please provide this training sample for cross-encoder?

shzamanirad avatar Aug 17 '21 02:08 shzamanirad

I'm also facing the same issues. It seems the following parameters are required to train the biencoder:

"context_key"
"data_path"
"debug"
"eval_batch_size"
"eval_interval"
"evaluate"
"gradient_accumulation_steps"
"learning_rate"
"max_cand_length"
"max_context_length"
"max_grad_norm"
"num_train_epochs"
"output_path"
"path_to_model"
"print_interval"
"seed"
"shuffle"
"silent"
"train_batch_size"
"type_optimization"
"warmup_proportion"

I would like to know what values are suggested to reproduce the results published in the paper.

Hi, I think you can use the default values for some of the parameters: https://github.com/facebookresearch/BLINK/blob/master/blink/common/params.py

and for other parameters you change their values to what mentioned in the paper. For example: Bi-encoder (large) model Hyperparameter configuration for best model: learning rate=1eāˆ’5, batch size=128, max context tokens=32, epochs = 4. Cross-encoder (large) model Hyperparameter configuration for best model: learning rate=2eāˆ’5, batch size=1, max context tokens=32, epoch=1.

shzamanirad avatar Aug 17 '21 02:08 shzamanirad

With regards to training a new model with custom data, yes, it is indeed possible to do so. I would recommend first training a zero-shot learning (zeshel) model first just to get hang of the training. The scripts to download and pre-process zeshel data are in the repository. You can then replicate the same steps, bring your data in the same format as zeshel, modify any hyperparameters (such as context length or choice of bert base model) and train your own model.

abhinavkulkarni avatar May 05 '22 04:05 abhinavkulkarni