Eric Malmi
Eric Malmi
Hi, currently, we're not planning to release the model checkpoints, which would require an approval process for us. Regarding the hyperparameter settings: I'd need to double check with my co-author...
Are you using the [default](https://github.com/google-research-datasets/clang8/blob/main/prepare_clang8_dataset.py#L36) flag value of `--tokenize_text=True`? This should ensure that the tokenization is consistent between sources and targets (although I haven't checked if the spaCy tokenizer will...
I see, it looks like the detokenizer we used hasn't removed spaces before `~` characters, causing a small tokenization inconsistency between our sources and targets. Your performance could go up...
We've looked more closely into the issue and found out that a likely cause for the differences in CoNLL-2014 scores is the selection of the target file: we used **alt/official-2014.combined-withalt.m2**...
hi, in case there's not an option to directly specify `tokens_per_batch`, you can alternatively set batch size to `1048576 / 128 = 8192` sequences.
Those of you who've had difficulties reproducing the CoNLL-14 results, please note the updated https://github.com/google-research-datasets/clang8/issues/3#issuecomment-991151706. In short, you should post-process your model outputs with [`retokenize.py`](https://github.com/google-research-datasets/clang8/blob/main/retokenize.py) to fix some tokenization discrepancies...
Unfortunately it's not straightforward since the code uses some internal libraries. Nevertheless, Appendix A.1 should provide the details for replicating the code.
We use the [raw Lang-8 dataset](https://sites.google.com/site/naistlang8corpora/) with 237,843 English entries (each consisting of multiple sentences) while the dataset with 1,037,561 English sent-pairs that you're referring to probably corresponds to the...
Thanks for the quick reply! I understand the idea of inflating the counts, but it's not clear to me why, on line 58 ``` keeptmpscore2 += keepgramcountergood_rep[keepgram] / keepgramcounterall_rep[keepgram] ```...
I would first check that the target files have been downloaded successfully by making sure that they contain the correct number of lines, see: https://github.com/google-research-datasets/clang8#data-format