Shawn Presser comments

Results 41 comments of


                                            Shawn Presser

[nlp_data] Add BookCorpus

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as `bookcorpusnew`, thanks to @vblagoje: https://github.com/huggingface/datasets/pull/856 So, huggingface is officially supporting this dataset now. The Eye also seems...

Support for other languages

> This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox...

ValueError: No template sub-directly with the name 'script'... when running tutorial build

I ran into the exact same error, and I happened to figure out a workaround: ``` ln -s /opt/homebrew/share/jupyter ~/Library/Python/3.9/share/jupyter ``` I figured this out by running `pip3 uninstall nbconvert`...

Here’s a download link for all of bookcorpus as of Sept 2020

@ZonglinY @richarddwang Sorry for the download problems. It should be fixed now. My server was running out of space due to 128GB of google cloud logs. Ideally the zip file...

Here’s a download link for all of bookcorpus as of Sept 2020

@SeanVody and everyone else: I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to...

Here’s a download link for all of bookcorpus as of Sept 2020

@jorditg It's mostly English, but if anyone discovers a trove of foreign .epub files, please DM me. I am quite interested in doing various foreign language versions. By the way,...

Here’s a download link for all of bookcorpus as of Sept 2020

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as `bookcorpusopen`, thanks to @vblagoje: https://github.com/huggingface/datasets/pull/856

enc.encoder["<|endoftext|>"] is wrong and nobody realizes it.

My main question is, were the OpenAI models trained with (a single token separating each document), or , which is how the BPE encoder generates it?

enc.encoder["<|endoftext|>"] is wrong and nobody realizes it.

Update: It turns out the answer is that OpenAI trained their models by separating texts using the single-token ``, whereas most fine-tuning code is based on nshepperd's repo (https://github.com/nshepperd/gpt-2) which...

Ensure websockets write all data; Always keep_alive every websocket

Okay, I'll look into this when I have some time. Your comments were thorough and helpful; thank you. > In particular, terminating the service because a websocket client got disconnected...