Iz Beltagy comments

Results 38 comments of


Iz Beltagy

How to get the full text of papers for scibert training?

We didn't release the full text of SciBERT, but you can try using the GORC corpus https://github.com/allenai/s2-gorc which we recently released. It is larger and cleaner than the SciBERT training...

Change for the ner fine-tuning

You will need to use the ner_finetune.json config. This was recently merged with master https://github.com/allenai/scibert/blob/master/allennlp_config/ner_finetune.json

How to apply sciBERT on binary classification in PyTorch

You would follow the same recipe except replacing `bert-base-cased` with one of the scibert models, `allenai/scibert_scivocab_uncased` for example. Side note, you might find better trainers in the HF examples https://github.com/huggingface/transformers/tree/master/examples/text-classification....

How to reproduce the result as Table 3 described in the paper?

Just follow the instructions in the readme, and you should be able to reproduce the results of frozen embeddings. The finetuning experiments require the code in this PR as well...

How to get predictions for realtion extraction task using SciBERT

can you try adding the argument `--use-dataset-reader` to your command line?

How to get predictions for realtion extraction task using SciBERT

Looks like you need to add a dummy predictor for it to work, something like: ``` @Predictor.register('dummy_predictor') class DummyPredictor(Predictor): pass ``` then in the command line add `--predictor dummy_predictor`

Save predictions on test data if test is enabled

1- The NER model predicts an IOB label per token in the sentence, which can be used at decoding time to find spans of entities 2- We use span-based f1...

How can I use sciBERT for Token Classification?

The code is a bit difficult to read without formatting, but the obvious issues are that you need to use `AutoModelForTokenClassification` and it is weird to do `encode(tokenize(tokenizer.decode(tokenizer.encode(string)))`. I think...

Size of repo is too large

maybe something like this will make it faster to clone https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository/52269934#52269934

Context embedding shows anomaly, independent of sentence and token

Interesting. I have seen the same pattern while training transformers for another project. I don't know why this is happening, but it doesn't seem to be a bug