course-content
course-content copied to clipboard
Centralize osf data links
It seems like poor design to have the links to datasets peppered throughout the tutorials. One solution would be a csv with columns "dataset", "url"
Then if we change the url for a dataset it would just be in one spot. It also means if the China team wants to store our data in China we can just change one file.
Would you be ok with this PR? if so I can work on the changes.
Seems like an ok idea, but we have a rule the tutorials can't use pandas, so downloading the centralized csv file and parsing it is going to be a pain.
I'm +0 on it. Good in principle, but work on the tutorials is fairly distributed right now, and it has the potential to cause conflicts. For our workflow, I don't think it saves much effort, though I appreciate how it might make your life easier.
why was pandas banned?
Is a utils.datafetch
module allowed that would have a function that took a dataset name and returned an NDARRAY (or whatever type is specified in the csv)?
I agree. I think this will save a lot of time for localization.
There are only a dozen parts (about 10-20) in the tutorials that involve data sets, and it doesn’t require lots of time or complex logic to change.
In addition, some data sets are downloaded directly and used, while others are preferentially loaded locally. I think this is not so good.
why was pandas banned?
not my decision, but the scope of libraries introduced to the students was intentionally limited to numpy/scipy/matplotlib/scikit-learn to avoid overwhelming them. this is most easily enforced by not having pandas in the environment file that the tests run against.
Is a utils.datafetch module allowed that would have a function that took a dataset name and returned an NDARRAY
We've discussed a few times implementing a centralized NMA helper function library, but those conversations never converged. With the colab workflow, you'd need to pip install it in each notebook.
So either we need to (A) copy-paste a little helper function everywhere we want to use this pattern, or (B) stand up a centralized utility library while the course is live. I'm -1 on (B), and concerned about the potential of subtle bugs in (A).
got it. I would have voted for a central NMA helper module in pypi.
ok, we will stick with our find/replace scripts.
got it. I would have voted for a central NMA helper module in pypi.
Agreed, I was a proponent (Although we'd be iterating on that just as fast as on the tutorials themselves, so pushing releases to pypi would have been a limiting factor, and versioning could have been a nightmare. You can point pip at github, but we'd have needed a whole separate CI pipeline, etc etc). Still strongly in favor of bringing that online for version 2.0, once we have a better sense of common patterns across the tutorials. Robust/abstract data loading is an obvious one.
There is a solution that doesn't (I think) have any of the problems you mentioned @mwaskom . Instead of putting the code in a module that needs to run in each notebook, we just have a public service like https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=us
and https://dataurl.neuromatchacademy.org?dataset=steinmetz&location=china
that returns the URL of where to fetch the data. The service can even point back to a CSV file in the course-content.
it is a bad solution in that the service is a new dependency. But it is so simple, it could just be an nginx configuration.