bookcorpus
bookcorpus copied to clipboard
Crawl BookCorpus
You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21 it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can...
I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot `HTTP Error: 403 Forbidden` How...
Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149  When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times. The reason...
The download links provided for books3.tar.gz no longer work. Is there an updated host?
Hello, on `2022-12-17` I run the script `download_list.py` with modified number to page to `31430` which covered the last search page. Here is the updated [url_list.jsonl.zip](https://github.com/soskek/bookcorpus/files/10253541/url_list_20221217.jsonl.zip) There are `4544` entries...