what is causing the bottleneck in large files?
Hi guys,
Thanks for developing this awesome project. The Markdown conversion works great!
While working on this project, I noticed some big files in my PDFs, and those take a lot of time. I am not talking about x2 or x3; it's x100, at least compared to others. For example, a 300-page PDF can take up to 3-7 days, while a smaller file size is just like 2-3 minutes.
Since GPU servers are brutally expensive, I used pdftk test/test.pdf burst output test/test_page_%02d.pdf to convert my PDFs into pages. Then, I used NUM_DEVICES=1 NUM_WORKERS=3 marker_chunk_convert test test_mds to convert them all individually and merge them via a script,
1. but I was wondering if it affected the quality in any way?
This PDF was around 300 pages long and took 7 days to process. Now, it only takes 3-4 minutes, and it's done.
2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.
In addition, is it known what's taking so long in big files? I tried to use your paid API, and after a while, it kept returning 500 errors for some files; for others, I couldn't submit because PDFs were bigger than 2500 pages. I didn't keep the request_id to follow up on the API, and it seems there is no way to get them back.
3. or is it?
4. Would you please share some insights?
Bests, Ali.
Hi Ali - this is not behavior we've seen - are you converting on CPU? Do you mind sharing an example file?
Hi,
Here are some samples that takes too long:
- https://www.nist.gov/publications/report-technical-investigation-station-nightclub-fire-appendices-nist-ncstar-2-volume-2
- https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication340.pdf
- https://www.etsi.org/deliver/etsi_ts/138100_138199/13810103/17.09.00_60/ts_13810103v170900p.pdf
- https://www.etsi.org/deliver/etsi_TS/136500_136599/13652303/17.05.00_60/ts_13652303v170500p.pdf
I tried on different machines:
- CPU AMD between 250-370 cores
- GPUs 18x RTX 4090, or 8x H200, or 8x H100
- RAM 1TB to 2TB
- I used dedicated NVME SSD All the time, sometimes raid 0 which is usually faster than single one.
Bests, Ali.