yacs icon indicating copy to clipboard operation
yacs copied to clipboard

Gallatin Course Search Organization

Open A1Liu opened this issue 6 years ago • 1 comments

Discussion about filtering data from Gallatin course search 'API'.

Lectures?

The course search API doesn't really have lectures or recitations, and it doesn't seem like Gallatin does that stuff. I guess we can just classify everything at Gallatin as a lecture.

Sections?

The API doesn't provide section registration numbers, but some exploratory data analysis seems to imply that the sections are almost entirely unique:

import pandas as pd
from get_new_data import get_data # available in the gallatin adapter folder
df = pd.read_json(get_data('fall.json'))
df = df.drop('totalMatches',axis=0)
df = df.reset_index(drop=True)
print(df.shape) # (852, 18)

After loading in data, I checked out the values for the section attribute:

section_values = df['section'].value_counts()
print(section_values)
# Sec Count
# 1     841
# 2       8
# 31      1
# 30      1
# 4       1

Notice that almost all of them are section 1; the rest seem to be almost arbitrarily numbered. I tried to figure out what was going on:

mask = df['section'] == 2
weird_course_sections = df[df['section']==2]
print(weird_course_sections[['course','year']])
#            course  ...   year
# 89   WRTNG-UG1560  ...   2018
# 331  WRTNG-UG1560  ...   2017
# 555  WRTNG-UG1560  ...   2016
# 714  WRTNG-UG1560  ...   2015
# 970  WRTNG-UG1560  ...   2014
# 561   FIRST-UG102  ...   2016
# 949   CORE-GG2402  ...   2014
# 55     LEAVE-XX11  ...   2018

The problem persists when filtering by year:

year_mask = df['year']==2018
weird_course_sections_2018 = df[year_mask]['section'].value_counts()
print(weird_course_sections_2018)
# Sec Count
# 1     204
# 2       2
# 31      1
# 30      1
# 4       1

Finally, I was able to narrow the search down to something useful:

weird_names_2018_mask = pd.np.logical_and(df['year']==2018,df['section'] >= 2)
weird_course_names_2018 = df[weird_names_2018_mask]['course'].value_counts()
print(weird_course_names_2018.index)
# Index(['LEAVE-XX11', 'WRTNG-UG1560'], dtype='object')

So 2 courses are responsible for the problem: LEAVE-XX11 and WRTNG-UG1560.

names = weird_course_names_2018.index
name_mask = pd.np.logical_or(df['course'] == names[0], df['course'] == names[1])
weird_courses = df[ name_mask ]
weird_year_mask = weird_courses['year'] == 2018
weird_courses_2018 = weird_courses[ weird_year_mask ]
print(weird_courses_2018)
#            course credit  ...                                  type  year
# 89   WRTNG-UG1560      4  ...   Advanced Writing Courses (WRTNG-UG)  2018
# 16     LEAVE-XX11      0  ...     Leaves and Sabbaticals (LEAVE-XX)  2018
# 55     LEAVE-XX11      0  ...     Leaves and Sabbaticals (LEAVE-XX)  2018
# 212    LEAVE-XX11      0  ...     Leaves and Sabbaticals (LEAVE-XX)  2018
# 224    LEAVE-XX11      0  ...     Leaves and Sabbaticals (LEAVE-XX)  2018
print(weird_courses_2018[['course','section']])
#            course  section
# 89   WRTNG-UG1560        2
# 16     LEAVE-XX11       31
# 55     LEAVE-XX11        2
# 212    LEAVE-XX11       30
# 224    LEAVE-XX11        4

I don't really know what to do from here - going to have to look further into how Gallatin structures its courses. Right now I want to simply filter out all the LEAVE-XX11 courses, which is probably fine. However, I don't know how to address the WRTNG-UG1560 course - I found that there are instances where multiple sections are offered in the same year:

print(weird_courses)
#            course                                  type  year
# 89   WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2018
# 331  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2017
# 418  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2017
# 555  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2016
# 649  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2016
# 714  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2015
# 757  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2015
# 970  WRTNG-UG1560   Advanced Writing Courses (WRTNG-UG)  2014
# 16     LEAVE-XX11     Leaves and Sabbaticals (LEAVE-XX)  2018
# 55     LEAVE-XX11     Leaves and Sabbaticals (LEAVE-XX)  2018
# 212    LEAVE-XX11     Leaves and Sabbaticals (LEAVE-XX)  2018
# 224    LEAVE-XX11     Leaves and Sabbaticals (LEAVE-XX)  2018

The problem is that there are no course registration numbers in the API! So with more than one section, the user doesn't know which section they're scheduling with. In the grand scheme of things, this is probably fine - we can probably just get Gallatin courses from the other adapter if it comes down to it. It's kinda annoying though.

Basically, Gallatin is weird

A1Liu avatar Nov 05 '18 04:11 A1Liu

@A1Liu Hmmm interesting!! Do we know anyone who attends Gallatin that could tell us how they register for courses? Maybe they just use the course and section number....

And multiple sections in the same year definitely isn't a problem, as long as there are no duplicate section numbers in the same term. I don't see any reason why we can't include WRTNG-UG1560. If we can get course registration numbers, awesome. If not, I think it is still useful as long as we have the section numbers. Right now registration numbers are required in the schema, but we can potentially work around that if we have to.

Bad-Science avatar Nov 06 '18 03:11 Bad-Science