Shawn Presser
Shawn Presser
Happy to announce that bookcorpus was just merged into huggingface's Datasets library as `bookcorpusnew`, thanks to @vblagoje: https://github.com/huggingface/datasets/pull/856 So, huggingface is officially supporting this dataset now. The Eye also seems...
> This is something that I have been slowly piecing together. I have been gathering audiobooks and their text versions that are in the public domain (Project Gutenberg & LibriVox...
I ran into the exact same error, and I happened to figure out a workaround: ``` ln -s /opt/homebrew/share/jupyter ~/Library/Python/3.9/share/jupyter ``` I figured this out by running `pip3 uninstall nbconvert`...
@ZonglinY @richarddwang Sorry for the download problems. It should be fixed now. My server was running out of space due to 128GB of google cloud logs. Ideally the zip file...
@SeanVody and everyone else: I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to...
@jorditg It's mostly English, but if anyone discovers a trove of foreign .epub files, please DM me. I am quite interested in doing various foreign language versions. By the way,...
Happy to announce that bookcorpus was just merged into huggingface's Datasets library as `bookcorpusopen`, thanks to @vblagoje: https://github.com/huggingface/datasets/pull/856
My main question is, were the OpenAI models trained with (a single token separating each document), or , which is how the BPE encoder generates it?
Update: It turns out the answer is that OpenAI trained their models by separating texts using the single-token ``, whereas most fine-tuning code is based on nshepperd's repo (https://github.com/nshepperd/gpt-2) which...
Okay, I'll look into this when I have some time. Your comments were thorough and helpful; thank you. > In particular, terminating the service because a websocket client got disconnected...