database
database copied to clipboard
Access to full move history of puzzle games or raw data dump
Hi,
I’d like to retrieve the full move history of all Lichess puzzle games. The API rate limits make it difficult to collect this data at scale.
Is there a raw dump or database export that includes puzzle games with their move histories, similar to the standard game database? If not, is there another recommended way to access this efficiently?
Thanks!
Hi
Well the games are available, technically speaking...
ATM there's no recommended way to get them. What do you need them for?
Thanks for the clarification. I understand that the games are available through the full archives. The issue on my side is that identifying only the puzzle-source games from the global database would require a large amount of additional filtering and processing, which I am trying to avoid.
My goal is to study how language models learn to play chess, and one of my ablations depends on having the full move history for the puzzle positions. Having a more direct way to access those games would be extremely valuable for this research.
Hi @taha-yassine. There is this older (~3M vs the current ~5.5M) puzzle dump which was compiled by @mcognetta and that contains game information: https://github.com/mcognetta/lichess-combined-puzzle-game-db
We could look into adding the game information to the hugging face version of the data (https://huggingface.co/datasets/Lichess/chess-puzzles)
You can export games by ID and by batches of 300 with this endpoint https://lichess.org/api#tag/Games/operation/gamesExportIds
The only limit is that you only make one request at a time.
I reckon you should be able to download a lot of games with that.
Hi @taha-yassine. There is this older (~3M vs the current ~5.5M) puzzle dump which was compiled by @mcognetta and that contains game information: https://github.com/mcognetta/lichess-combined-puzzle-game-db
Thanks for the suggestion. Unfortunately, the MEGA download link doesn't seem to be working anymore.
We could look into adding the game information to the hugging face version of the data (https://huggingface.co/datasets/Lichess/chess-puzzles)
I think this would indeed be very useful!
You can export games by ID and by batches of 300 with this endpoint https://lichess.org/api#tag/Games/operation/gamesExportIds
The only limit is that you only make one request at a time.
I reckon you should be able to download a lot of games with that.
I wasn’t aware of that endpoint, thanks! I tested it and each request takes ~10 seconds, so fetching all ~5M games would take around 50 hours. That seems manageable as a workaround.
Hi @taha-yassine
I recovered my old laptop and was able to find the dataset that was in the (now defunct) MEGA link. It is about 4.4Gb on disk when compressed and is a bit out of date (Sept 2022). However, it could still be helpful for you in that the game information should all still be the same, so to get an up-to-date dataset, I think you will just need to:
- download the entire puzzle database
- join the puzzle database with the game-puzzle database I created 2.1) update all of the puzzle information in my dataset with the new values (this is just a straight overwrite)
- extract the missing games
- download all of those games
- associate them with the missing puzzles in the dataset.
This would save you a lot of time (my dataset contains 2.9M of the 5.4M puzzles in the current dataset, and downloading the new dataset and updating those values can be done very quickly).
If you would find it useful, I am happy to work to send it to you. You can see the dataset format in this README: https://github.com/mcognetta/lichess-combined-puzzle-game-db?tab=readme-ov-file#example.
Hi @mcognetta, it'd be awesome if you could upload it somewhere for me to download.
I've sent you a message on Twitter (I don't want to share the link here, though I am not sure if it would actually cause me any issues). I will leave it up for another ~week since it is using a lot of my storage.
If anyone else wants access, please feel free to ping me here.
Thanks everyone for the help! @cakiki I’ll leave the issue open in case you want to track a potential future update to the HF dataset as you suggested, but feel free to close it if you prefer.