chillzone First year scraper is outdated

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Aug 01 '24 16:08 shikharish

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Can you add the details of what has changed?

Aug 01 '24 17:08 harshkhandeparkar

@shikharish ?

Aug 06 '24 14:08 harshkhandeparkar

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Aug 06 '24 14:08 shikharish

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Can you send the new PDF?

Aug 06 '24 15:08 harshkhandeparkar

aut24.pdf

Aug 07 '24 06:08 shikharish

So, chillzone doesn't have proper data at the moment?

Sep 29 '24 02:09 proffapt

No.

Sep 29 '24 04:09 shikharish

Oh, so, when will we need to make the required changes?

Sep 29 '24 05:09 proffapt

Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?

Sep 29 '24 09:09 harshkhandeparkar

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

Oct 04 '24 11:10 shikharish

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?

Oct 05 '24 05:10 harshkhandeparkar

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Oct 05 '24 20:10 shikharish

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?

Oct 06 '24 09:10 harshkhandeparkar

libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....

Oct 06 '24 09:10 shikharish

What about onlyoffice?

Oct 06 '24 17:10 harshkhandeparkar

dont think so.

Oct 07 '24 13:10 shikharish

Hmm, in that case we should write a Dockerfile to run the scraper in.

Oct 07 '24 19:10 harshkhandeparkar