uoft-scrapers icon indicating copy to clipboard operation
uoft-scrapers copied to clipboard

Important dates scraper

Open arkon opened this issue 9 years ago • 9 comments
trafficstars

We should scrape the important dates info off of places like the Faculty of Arts & Science or UTM websites.

EDIT: This is a better list

arkon avatar Apr 17 '16 02:04 arkon

i'd love to give the utm scraper a try, is there anything I should know/read about before I start?

anderson202 avatar May 11 '16 20:05 anderson202

@anderson202 yes please! Give it a go and if you have any questions, we can answer them.

I have a very basic wiki here with information: https://github.com/cobalt-uoft/uoft-scrapers/wiki but it really isn't a lot. Have a look around at other scrapers to see whats up.

For this one, UTMDates as the scraper name sounds appropriate.

We can also discuss the schema format we want to go with. Any ideas?

qasim avatar May 11 '16 22:05 qasim

@qasim I'm definitely a newbie to this so I'm not too sure how the format should be like.

Basic info we need would be the date and the detailed information regarding the day. Maybe we can list which academic session the date falls in as well.

A quick question, how should the scraper function? Should it scrape everything it can for upcoming dates, scrape only a specific session or a specific date?

anderson202 avatar May 11 '16 22:05 anderson202

+1 on including the session, I'm thinking something like:

{
  "date":String,
  "session":String,
  "events":[String]
}

It looks like the UTM mobile site has links to two years worth of data. I think the scraper can take a year parameter and then it'll scrape <year>5 and <year>9 for the two sessions available.

For example (year = 2016):

  • Summer:
    • http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20165
  • Fall/Winter:
    • http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20169

Edit:

Looks like they actually have data since the 2010-11 school year - http://m.utm.utoronto.ca/importantDates.php?mode=full&session=20105

kashav avatar May 11 '16 22:05 kashav

Wow I didn't even think of using the mobile site. It's so much cleaner.

I'll start working on it and see if I can contribute to this. Thanks.

Edit: @kshvmdn if I follow your format, wouldn't that return a bunch of files corresponding to each day? Would it be better to alter it some way and return a file for each session instead?

For example, would this work? { “session”:String, “dates”: [{“date”:String, “events”:String}, ...] }

anderson202 avatar May 11 '16 22:05 anderson202

I'll take the UTSGDates scraper!

kashav avatar May 11 '16 23:05 kashav

@anderson202 That's actually what we want! Take a look at the athletics and shuttle scrapers, they work the same way.

I got started on the UTSG scraper and I found it might be better to use the following format instead:

"date":String,
"session":String,
"events":[{
  "end_date"String, // some go on for more than a single day (i.e. winter break)
  "campus":String,
  "description":String
}]

This will allow us to merge events across campuses for each date, like we do with the athletics scraper (take a look at this). The API ends up being a lot cleaner this way.

kashav avatar May 12 '16 15:05 kashav

I think I have the UTM scraper done. But I'm not sure how the JSON files should be named. The ones I have currently is simply the date (or period) of the event as shown on the mobile site. Should I change it to a specific format before making a pull request?

anderson202 avatar May 12 '16 21:05 anderson202

We use the ISO 8601 format for dates. It isn't too hard to convert regular dates to this format, we do it in a lot of our scrapers, using Python's datetime module.

The files can take this date as the name.

kashav avatar May 12 '16 22:05 kashav