Performance checks and limits
I need to run performance benchmarks to see which method of importing would be the fastest to better advice users based on which scenario. Would importing 25 million torrents be better via magnetico2database? Or magnetico2bitmagnet?
What are the limits of those scripts?
I'm currently running magnetico2database from my sqlite dump of 29 millions torrents. From my work on https://framagit.org/Glandos/magnetico_merge I've tried some things on https://github.com/Glandos/magnetico2bitmagnet/tree/pg_perf:
- use a single cursor, with all insert in a single transaction
- remove all conflicts sources, by pre-dropping constraints (they must be re-used afterwards)
And honestly, I didn't see any differences for now. My system is very old (Intel Atom D2550, 4GB RAM, Samsung QVO 4TB), and the python script is taking 1/3 of one CPU. The postgresql worker is taking 2/3 of another one.
I'm trying to see how far I can go, but one good feature to implement could be a "resume", to avoid restarting from scratch every time. Currently, TQDM estimation is around 200 to 300 hours of import 😆
You did all 29 million in all a single transaction? You didn't run out of memory? 🤨
From the top of my head a resume function would be quite an issue, like for a .sqlite database we can just say 'just start at this index', but somehow we also gotta store the progress that has been made.
For a list of .torrents in a directory it would be harder and can't be done with a index, what if a files is removed or added?
I think it did cost me like a week to import ~20 million torrents, but that's what I recall from like a year ago.
After all, I ran the migration on my desktop machine, with a fresh postgresql installation (AMD Ryzen 4750G, 16GB RAM, SSD). After a heavy start at roughly 500 it/s, the final result from tqdm was: 29015903/29015903 [49:12:23<00:00, 163.80it/s]
It's hard to know if some things can get faster for initial import though.