bookcorpus icon indicating copy to clipboard operation
bookcorpus copied to clipboard

Crawl BookCorpus

Results 6 bookcorpus issues
Sort by recently updated
recently updated
newest added

You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21 it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can...

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot `HTTP Error: 403 Forbidden` How...

Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149 ![image](https://user-images.githubusercontent.com/59632/91907459-c0908880-ec5e-11ea-8b6e-b47709ab85f8.png) When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times. The reason...

The download links provided for books3.tar.gz no longer work. Is there an updated host?

Hello, on `2022-12-17` I run the script `download_list.py` with modified number to page to `31430` which covered the last search page. Here is the updated [url_list.jsonl.zip](https://github.com/soskek/bookcorpus/files/10253541/url_list_20221217.jsonl.zip) There are `4544` entries...