chillzone icon indicating copy to clipboard operation
chillzone copied to clipboard

First year scraper is outdated

Open shikharish opened this issue 1 year ago • 17 comments

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

shikharish avatar Aug 01 '24 16:08 shikharish

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Can you add the details of what has changed?

harshkhandeparkar avatar Aug 01 '24 17:08 harshkhandeparkar

@shikharish ?

harshkhandeparkar avatar Aug 06 '24 14:08 harshkhandeparkar

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

shikharish avatar Aug 06 '24 14:08 shikharish

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Can you send the new PDF?

harshkhandeparkar avatar Aug 06 '24 15:08 harshkhandeparkar

aut24.pdf

shikharish avatar Aug 07 '24 06:08 shikharish

So, chillzone doesn't have proper data at the moment?

proffapt avatar Sep 29 '24 02:09 proffapt

No.

shikharish avatar Sep 29 '24 04:09 shikharish

Oh, so, when will we need to make the required changes?

proffapt avatar Sep 29 '24 05:09 proffapt

Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?

harshkhandeparkar avatar Sep 29 '24 09:09 harshkhandeparkar

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

shikharish avatar Oct 04 '24 11:10 shikharish

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?

harshkhandeparkar avatar Oct 05 '24 05:10 harshkhandeparkar

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

shikharish avatar Oct 05 '24 20:10 shikharish

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?

harshkhandeparkar avatar Oct 06 '24 09:10 harshkhandeparkar

libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....

shikharish avatar Oct 06 '24 09:10 shikharish

What about onlyoffice?

harshkhandeparkar avatar Oct 06 '24 17:10 harshkhandeparkar

dont think so.

shikharish avatar Oct 07 '24 13:10 shikharish

Hmm, in that case we should write a Dockerfile to run the scraper in.

harshkhandeparkar avatar Oct 07 '24 19:10 harshkhandeparkar