Eric Malmi comments

Results 13 comments of


                                            Eric Malmi

The fine tuned T5 model on clang8

Hi, currently, we're not planning to release the model checkpoints, which would require an approval process for us. Regarding the hyperparameter settings: I'd need to double check with my co-author...

The fine tuned T5 model on clang8

Are you using the [default](https://github.com/google-research-datasets/clang8/blob/main/prepare_clang8_dataset.py#L36) flag value of `--tokenize_text=True`? This should ensure that the tokenization is consistent between sources and targets (although I haven't checked if the spaCy tokenizer will...

The fine tuned T5 model on clang8

I see, it looks like the detokenizer we used hasn't removed spaces before `~` characters, causing a small tokenization inconsistency between our sources and targets. Your performance could go up...

The fine tuned T5 model on clang8

We've looked more closely into the issue and found out that a likely cause for the differences in CoNLL-2014 scores is the selection of the target file: we used **alt/official-2014.combined-withalt.m2**...

The fine tuned T5 model on clang8

hi, in case there's not an option to directly specify `tokens_per_batch`, you can alternatively set batch size to `1048576 / 128 = 8192` sequences.

The fine tuned T5 model on clang8

Those of you who've had difficulties reproducing the CoNLL-14 results, please note the updated https://github.com/google-research-datasets/clang8/issues/3#issuecomment-991151706. In short, you should post-process your model outputs with [`retokenize.py`](https://github.com/google-research-datasets/clang8/blob/main/retokenize.py) to fix some tokenization discrepancies...

Generating Examples

Unfortunately it's not straightforward since the code uses some internal libraries. Nevertheless, Appendix A.1 should provide the details for replicating the code.

The size of the datasets

We use the [raw Lang-8 dataset](https://sites.google.com/site/naistlang8corpora/) with 237,843 English entries (each consisting of multiple sentences) while the dataset with 1,037,561 English sent-pairs that you're referring to probably corresponds to the...

Recall score for the kept tokens in SARI

Thanks for the quick reply! I understand the idea of inflating the counts, but it's not clear to me why, on line 58 ``` keeptmpscore2 += keepgramcountergood_rep[keepgram] / keepgramcounterall_rep[keepgram] ```...

Errors when bash run.sh

I would first check that the target files have been downloaded successfully by making sure that they contain the correct number of lines, see: https://github.com/google-research-datasets/clang8#data-format