studyGroupLessons icon indicating copy to clipboard operation
studyGroupLessons copied to clipboard

Open Science Utility Belt

Open bkatiemills opened this issue 9 years ago • 53 comments

Open Science 101

This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

  • What skills are needed to practice open science?
  • What's missing / What's unnecessary? (aiming for < 12 sessions over a semester)
  • What's out there now? References! If you've seen related material, send it over.

Let us know your thoughts in the comments!

Sessions

  1. Introduction: what & why
    • What skills will be part of this series?
      • working openly through the entire process (not just warehousing things on the web afterwards) in order to leverage collaboration
      • emphasizing legibility of research outputs for the sake of reuse & reproducibility
    • Why do these things matter?
      • lit review on citation benefits, efficacy benefits, retraction scandals & efficiency.
    • Sources: Working Open guide, TBD
  2. Open Data I: Standards & Legibility
    • What is an ontology?
    • How to effectively use data standards and make data legible?
    • Sources: TBD
  3. Open Data II: Clean Data
    • What is 'clean' vs 'dirty' data, and why do they matter?
      • how to keep data organized and easy to reuse at a later date (including in-house reuse); consider metadata, storage and formats.
    • Best practices for making a reusable dataset when no standard exists.
    • Sources: TBD
  4. Collaboration I: Version Control
    • Basic git, with an emphasis on getting to GitHub as a platform for sharing & collaboration.
    • Source: TBD
  5. Collaboration II: Roadmapping
    • How to lay out a project for effective collaboration.
    • Source: Working Open guide.
  6. Collaboration III: Code Review
    • How to set expectations for good contributions that lead to easy-to-review code
    • How to make the code review process fast and efficient
    • Source: Working Open guide, Code Review Teaching Kit
  7. Code Wrangling I: Sustainable Coding
    • Effective use of documentation.
    • Producing end-to-end analysis automation scripts (R, Python, Shell, or make); understanding of how a well-made automation script serves as 'living documentation'.
    • Sources: TBD
  8. Code Wrangling II: Testing
    • Writing test suites to ensure code quality & build trust to support reuse.
    • Sources: this lesson in Python, TBD in R.
  9. Code Wrangling III: Code Packaging
    • Making & distributing packages to support reuse & collaboration.
      • discussion of useful formalisms for organizing data & code in packages / repos
    • Sources: this lesson in Python, and this lesson in R
  10. Publishing & Communication I: Citation & Discoverability
    • Software & data citation
      • DOIs
      • comments on how this addresses discoverability of code & data
    • Authoring for the Web
      • markdown / knittr
      • metadata
    • Sources: Working Open guide, TBD
  11. Publishing & Communication II: The Research Cycle
    • Strategies for opening the entire research process:
      • Grant process
      • Online lab notebooks
      • blogging, twitter & social media
      • protocol publishing
      • study pre-registration
  12. Publishing & Communication III: Licensing
    • open access publishing
      • comments on impact on science in the Global South / decoupling access from privilege
    • Why are licenses necessary?
    • What can they do? What can't they do?
    • Which ones are the most important and how do they work?
    • How to choose a license, and the intersection of licensing and copyright
    • The importance of agreeing on a license explicitly and early on a collaboration
    • sources: TBD
  13. Change Making
    • how to champion change in real life?
    • what barriers are commonly encountered, and how to avoid them?
    • sources: https://speakerdeck.com/dsalo/changing-workflows

bkatiemills avatar Jul 30 '15 23:07 bkatiemills

Thanks for this, Bill! Would be worth adding links to the Working Open guide here, and seeing if you could line up with some of the language and key categories there to strengthen / augment that work. That also may help with some of the verbiage issues (like "Programming" as a header - not sure that's the best term here, crisp up the language and minimize jargon).

Great start, and more comments to come!

kaythaney avatar Aug 04 '15 14:08 kaythaney

yep, these lessons are going to pull heavily from the WOG, once we agree the curriculum. Changed 'programming' -> 'code wrangling'.

bkatiemills avatar Aug 04 '15 15:08 bkatiemills

Suggested language for the beginning:


Open Science 101

This session series introduces practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

  • What skills are needed to practice open science?
  • What's missing / What's unnecessary? (aiming for < 12 sessions over a semester)
  • What's out there now? References! If you've seen related material, send it over.

Let us know your thoughts in the comments!

Sessions

...


Going through actual sessions now :) Really excited for this work!

abbycabs avatar Aug 05 '15 19:08 abbycabs

From the twitterverse:

@ttimbers suggests data storage & archiving - how to find data associated with a study, how to organize your own data for future reuse; also metadata, storage and formats. @minisciencegirl suggests organizing data, with useful naming schemes, structure etc.

bkatiemills avatar Aug 05 '15 19:08 bkatiemills

+1 for @minisciencegirl 's suggestion about naming schemes

Is there any reason to have the Open Data sections after Collaboration sections? It might flow better (in my mind) from data wrangling into setting the wrangled data free (i.e. Open Data), then to dive into collaborations/workflows/code review/version control. This change could also make the transition into publishing easier, as collaborations may lead to publications ... :pray:

taddallas avatar Aug 05 '15 21:08 taddallas

I'm thinking along the same lines as @taddallas re:flow. Teaching packages before git seems off to me.

abbycabs avatar Aug 05 '15 21:08 abbycabs

The ordering of 'Code Wrangling', 'Collaboration', 'Open Data' and 'Publishing & Communication' are actually just in the order I thought of them in :)

So, how about the order:

  • 'Open Data'
  • 'Collaboration'
  • 'Code Wrangling'
  • 'Publishing & Communication'

bkatiemills avatar Aug 05 '15 21:08 bkatiemills

Looks good to me!

One tiny thing: I noticed that much of the material uses Python. Perhaps it would be worthwhile to also show some R examples, as some pretty solid tools for Open Science are built around R (e.g. reproducible analyses and manuscript writing with R Markdown, testing with testthat, etc.). This point is null if the course is designed to be Python-specific, or if you think there'd be too much overlap with the R utility belt you already have.

taddallas avatar Aug 05 '15 21:08 taddallas

Nope, R-flavoured implementations of these techniques are definitely something we want! Which will get used depends on the audience, but we definitely want both options for the Code Wrangling section. The current examples are Python for no other reason than I speak Python. That said, I think there was a packages in R lesson from UBC recently I can dig up and link here - if you have a good lesson for testing in R, send it on by!

bkatiemills avatar Aug 05 '15 21:08 bkatiemills

The only thing that is conspicuously missing to my mind is licenses - they are fundamental to open science and are relevant to all the sections above. I would think these are the most important aspects of licenses to cover:

  • Why are licenses necessary?
  • What can they do? What can't they do?
  • Which ones are the most important and how do they work?
  • How to choose a license
  • The importance of agreeing on a license explicitly and early on a collaboration

blahah avatar Aug 05 '15 22:08 blahah

@Blahah - totally agree, added your points to an additional section under 'publishing & communication' - thanks! One thing that would be super helpful in that section, is ideas for hands-on activities, and engaging ways to introduce things like licenses as well as code and data citation; definitely A-list important stuff, but runs the risk of turning into a really dry lecture about DOIs and copyright.

bkatiemills avatar Aug 05 '15 22:08 bkatiemills

I think a nice way to introduce licenses and citation is by doing a set of small hands-on data mining tasks. Introducing some frustrating scenarios that are solved by proper licensing and good data citation should be memorable. We just need a paper with great data but no license, and a paper that does something good with someone else's data but doesn't cite it properly.

blahah avatar Aug 05 '15 22:08 blahah

This may be expanding the scope a bit, but some topics that would have been helpful for me early on, before I really did much coding or had a solid project together, would have been:

  • Keeping and organizing a open, digital lab notebook
  • Searching, collecting, reading and annotating content for re-usability and collaboration

noamross avatar Aug 05 '15 23:08 noamross

@noamross could that first point fit with the social media unit?

I'd love to hear your ideas on your second point - to be honest, content aggregation is a pretty weak part of my own game, I've never found a method I really liked.

bkatiemills avatar Aug 05 '15 23:08 bkatiemills

Yes, lab notebook could go in social media, but there's a fair amount of the topic that isn't explicitly social: metadata/tagging of notes, formats and organization for searching, plain-text for posterity, etc.

On collecting content, I'm similar. I have a semi-working system of Mendeley + a collection of tagged plain-text notes, but I'm not sure how well it works in terms of collaboration. @cboettig and I once wrote a review together where we built an annotated bibliography using markdown + bibtex, but it felt more like a one-use hack than a system. Ideas from others would be welcome.

noamross avatar Aug 06 '15 00:08 noamross

Great suggestions so far.

I agree on the "importance of agreeing on a license explicitly and early", and thus think this should come at the beginning of the course and not at the end. As @Blahah mentioned, this should work well after some moments of reuse-rights-related frustration, which unfortunately remain all too easy to create.

One aspect that I am missing is an overview of where things are or are not open along the research cycle - we are making progress with making research outputs more widely available, but the research process is still mostly closed (safe a few open notebooks), and funding is basically a dark corner (very few proposals are open, and basically no funding decisions).

Daniel-Mietchen avatar Aug 06 '15 00:08 Daniel-Mietchen

Working with collaborators who don't necessarily Get It about the whole "open" thing. This is one of the top questions I get whenever I talk open with people.

DOIs, and how they are not magic but are important. Data citation. Data journals and other data-publication venues. Data-use tracking and metrics, and how to use them to make a tenure case or a grant proposal stronger.

Where to get help shoring up your weak spots -- nobody can do everything!

Basic digital hygiene: backups, basic security, basic digital preservation (why "I'll put it on my website!" is a lousy idea long-term).

Navigating openness vs. privacy in human-subjects and other sensitive research.

How to use Excel, if you must, without making everyone else hate you. What to use when Excel stops being useful (stats packages, relational databases).

dsalo avatar Aug 06 '15 01:08 dsalo

Would love to see design of experiments, multiple testing corrections, and quality engineering (reducing variability) of experiments in the curriculum. (Happy to contribute on these subjects.)

tgardner4 avatar Aug 06 '15 05:08 tgardner4

To publishing: Digital object identifiers - their importance in citing and version control. (NOT RESTRICTED TO CrossRef's DOI)

@noamross I'm using the knitcitations from @cboettig on a daily basis. So if this is the result of your cooperation it certainly wasn't a one-time hack :)

I would underline the importance of learning markdown and using knitr when collaborating on scientific projects. The most important skills for me were:

  1. Statistics (Coursera courses)
  2. R programming
  3. Markdown
  4. Learning the pipeline: Markdown to Word, and PDF using Knitr package in R studio with knitcitations and BibTeX
  5. Using Mendeley as bibliography database with quick search option (deadly useful)
  6. Putting my results on the OSF.io project page and sharing them with collaborators
  7. LaTeX <- but this is sth extra

wolass avatar Aug 06 '15 09:08 wolass

Here is the program that we are developping on the MOOCSciNum "research practices at the digital age" with a strong focus on open research practices Here is the enrollement page It's in french but we hope that participant will help us to translate it in english.

Cheers

Célya

Plan du cours

Séance 0 : Recherche à l'ère du numérique : quelles transformations ? (séance d'introduction) [Interview] Numérique et Recherche [Screencast] Présentation du MOOC

Séance 1 : S'appuyer sur des ressources scientifiques existantes [Interview] Bibliothèque et numérique : quels défis et quels rôles à jouer ? [Screencast] Savoir gérer sa bibliographie seul ou en groupe avec Zotero

Séance 2 : Collecter/produire des données scientifiques [Interview] Numérique et collecte de données en santé [Screencast] Daydream : un exemple de collecte de données en ligne

Séance 3 : Traiter/analyser des données scientifiques [Interview] Données et numérique : quelles "réelles" transformations ? [Screencast 1] Recherche en neurogénétique : exemple d'utilisation de Python et de Github [Screencast 2] Analyse de données en épidémiologie avec R

Séance 4 : Archiver/partager des données scientifiques : données de santé, données sensibles [Interview 1] Des données partagées aux données ouvertes en recherche [Interview 2] Données de santé, données sensibles : quels droits ? Quelles protections ? [Screencast] Partage de données médicales anonymisées

Séance 5 : Partager ses résultats scientifiques : écrire et publier [Interview 1] Publier sa recherche à l'ère du numérique : Open Access [Interview 2] Droit d’auteur et licences Creative Commons : quelques précisions utiles avant de publier [Screencast 1] Déposer un article dans HAL

Séance 6 : Faire partie d'une communauté scientifique [Interview] Evaluer, être évalué : retour sur la "machinerie" de l'évaluation et ses évolutions [Screencast 1] Faire connaître ses activités de recherche : comparaison de Zenodo et Figshare [Screencast 2] Communiquer sur ses recherches : présence "en ligne".

Séance 7 : Nouvelles formes d’interaction en recherche et enjeux éthiques [Interview 1] Ouvrir le processus de recherche : des sciences citoyennes à la recherche participative [Interview 2] Ethique de la recherche à l'ère du numérique [Screencast 1] Blogs scientifiques

Celyagd avatar Aug 06 '15 10:08 Celyagd

Google translate + my eyes (not perfect, please make improvements).

My comments: This (below) is a broader overview of research & access. I think something more hands-on would work better in study group sessions. But I do think looking a bit broader picture is a good move - thanks for adding DOI for code, licensing info.


Course Outline

Session 0: Research in the digital age: what is different? (Introductory session) [Interview] Digital and Research [Screencast] Overview MOOC

Session 1: Building on existing scientific resources [Interview] Library and Digital: What challenges and roles exist? [Screencast] Generate a bibliography alone or in groups with Zotero

Session 2: Collect / produce scientific data [Interview] Digital and health data collection [Screencast] Daydream: an example of online data collection

Session 3: Edit / analyze scientific data [Interview] Data and digital: what are "real" transformations? [Screencast 1] Neurogenetics research: example of using Python and Github [Screencast 2] Epidemiology Data Analysis with R

Session 4: Archive / share scientific data: health data, sensitive data [Interview 1] Shared data and open data in research [Interview 2] Health data, sensitive data: which rights? What protections? [Screencast] Sharing anonymized medical data

Session 5: Sharing research results: write and publish [Interview 1] Publishing your research in the digital age: Open Access [Interview 2] Copyright and Creative Commons licenses: some useful clarifications before publishing [Screencast 1] Uploading an article in HAL

Session 6: Being part of a scientific community [Interview] Evaluate, be evaluated: return to the "machinery" of the evaluation and its evolutions [Screencast 1] Publicize your research activity: comparison of Zenodo and Figshare [Screencast 2] Communicating your research: "online" presence

Session 7: New forms of interaction in research and ethical issues [Interview 1] Opening the research process: citizen science and participatory research [Interview 2] Research ethics in the digital age [Screencast 1] Scientific blogs

abbycabs avatar Aug 06 '15 14:08 abbycabs

I think that it would be a pity to only present the technical dimensions of open science in this course. Why not explain the values and ideals behind open science and even the tensions between the diverse conceptions of open science? The social and epistemological dimensions of open science? Many researchers are working on that too! For instance, the course could explain that a generalised open science will allow students and scientists from the Global South to participate better in the Global North scientific "conversations". Or, conversely, that it would allow researchers from the North to discover the science made in the Global South, therefore enlarging their social, epistemological and cultural horizons. It should also explain that open science could mean opening science to non-scientists (and not only industry), therefore getting science and society closer, making science more relevant to local chellenges. I hpe that you do not intend to create a course which will only present the neo-liberal discourse of innovation typical of the knowledge economy paradigm, but that you will show the subversive strength of open science when it is associated to a clear conscience of the social, economical and politicial issues of our time. http://projetsoha.org

FlorencePIron avatar Aug 06 '15 15:08 FlorencePIron

Great stuff, all! Some responses:

@noamross & @Daniel-Mietchen : I've created a new section in the publishing & communication unit meant to focus on opening up the full research life-cycle, to Daniel's point; Noam, I think lab notebooks fit very nicely in there.

@dsalo : I love your idea about getting others on board with open practices; can you expand a bit, or point to some references? Great idea, but tbh I always just kind of did it and hoped to not get fired later (not a real solution :). As for DOIs (+ @wolass ), totally agree; they fit implicitly in code and data citation in my mind, but I've called them out specifically there now.

@tgardner4 : super valuable content, but can you expand a bit on how we can do this in a cross-disciplinary way? Many Study Groups have ecologists and physicists at the same table; experimental design procedures will diverge quite quickly!

@wolass we've done markdown + knittr lessons before, they were really popular! Rather than diving down a specific toolchain (might get too discipline-specific if we do that), what if we think about authoring for the web? The idea being to create content that is not simply on the web, but can be linked to, described by metadata, machine read, and consumed / distributed in 'webby' ways - I think that will touch a lot of what you mentioned, and fits into the unit on discoverability.

@Celyagd this is great stuff! Our plans hit on a lot of the same things, but what I'd be especially interested in is getting a better picture of the activities / projects that seem to be implied in your outline; Study Group is a very hands on kind of thing, so coming up with illustrative projects is really important to this discussion.

@FlorencePIron great points all; we frame this work around directly applicable and practical skills, because that's what puts butts in seats in our experience. However, that framing does not at all preclude having the conversations you want to have; for example, the inability of university libraries in the Global South to subscribe to a full range of for-profit journals, and the abrupt loss of journal access by Greek academics during their recent budget crisis are things I would expect to see comments on in the Open Access Publishing section. Help inserting that broader cultural context as the curriculum comes into focus would be very welcome.

bkatiemills avatar Aug 06 '15 17:08 bkatiemills

@BillMills Experimental design is suprisingly general when you understand the core principles. A grandfather of the field (Fisher) developed his method in agricultural experiments (hence the term "split-plot" designs for some specific structures). Yet these same principles are routinely applied in engineering, biology and physics. Below is a potential outline for content:

Quality in Experimental Design - Draft Curriculum for Open Tutorial (C) Riffyn 2015

Objectives

  • Basic knowledge of statistics relevant to quality
  • Identifying and reducing errors
  • Designed experiments
  • Qualification of assays and processes

Additional objectives

  • Structuring data for statistical analysis
  • Process modeling / crossing scales
  • Goals setting / problem definition / setting requirements
  • Troubleshooting / root-cause analysis
  • Multiple testing
  • Control (maintaining target quality)

Content (summary)

PART 1 Why should I care? There’s gold at your feet and you don’t even know it. Instead you’re chasing phantoms.

Assessing Error: process modeling & variance components What are all the potential sources of error? How do they propagate?

Structuring data How do I organize and manipulate my data for analysis?

Statistical foundations What is the error on my measurements?

Testing Are two measurements different?

Multiple testing corrections Which measurements are different from each other, or from baseline?

Regression Which process variables really matter? Which ones don’t?

DoEs (root cause analysis) How can I learn the most with the least effort?

Outliers / Non-additive noise How do I handle the weird stuff?


PART 2 / ADDITIONAL SUBJECTS

Process capability, control How well does my process/assay perform? When is it falling apart?

Process modeling / goal setting How to I set the target performance?

Correcting sources of error I know what is the problem, how do I deal with it?

Assay qualification When is my assay “good”? (Putting all of the above together.)

References

  • Montgomery & Runger, Applied Statistics and Probability for Engineers, 5th Edition, Wiley & Sons, 2011.
  • Montgomery, Statistical Quality Control, 7th Edition, Wiley & Sons, 2012.
  • JMP Online Help
  • JMP Book: Design of Experiments Guide, version 12
  • R
  • Others to be added

tgardner4 avatar Aug 06 '15 17:08 tgardner4

@tgardner4 The above looks like a nice start on experimental design. It would be valuable in a general science curriculum, but is it specific to open science?

blahah avatar Aug 06 '15 20:08 blahah

I agree with Florence Piron, Open science is not only open access, open data and open source. There is social dimension of open science which brings together society/people with sciences, this dimension also consider local knowledge, and encourage cirizen science, science Shop and commons. So if you can not integrate this dimension in your curriculum, it is better to remove open science in your title

thomasmboa avatar Aug 06 '15 20:08 thomasmboa

@Blahah As I see it, these topics are absolutely fundamental. If you can't produce a trustworthy data point, you can't share it. If you can't share it, you don't have open science.

Good coding practices are awesome, but if that code is processing rubbish data, it can only generate rubbish results. And sadly, none of the topics I outlined above are taught adequately in a general science curriculum. Just pick a random sampling of scientists and ask them: what is power, what is variance analysis, what is false discovery rate, and when/how should you apply them? Almost no one knows. And that means no one can truly trust each other's results.

Don't hesitate to challenge my views if you disagree - these are born of two decades in the lab. But I'm very interested in alternative views!

tgardner4 avatar Aug 06 '15 21:08 tgardner4

@tgardner4 I agree that these topics are fundamental but also think they are somewhat out of scope for a ~12 lesson group-study on open science. There are, however, some important connections between experimental design and open science that could be addressed, such as:

  • How can an open scientific process facilitate checking and quality of methods?
  • How to maximize the transparency and auditability experimental design and statistical methods shared.
  • How to think about data quality at different stages of data publication
  • How and where to include quality checking information in your data and metadata

noamross avatar Aug 06 '15 21:08 noamross

@noamross I agree with your points. What I outlined is a study group unto it's own. I think you propose a nice solution though - an intro to the topic (and perhaps a pointer to a separate study group dedicated to a full treatment). By including it the open science curriculum - even as an intro - you would teach participants that these are core issues that can't be overlooked in proper scientific pursuit.

I would also suggest that the original open science outline described above (the very first post in this thread) is heavily tilted toward a view that open science = coding + publishing. Absent from this curriculum is anything about experimentation or the scientific process. My suggestions are a reaction to this gap. When I hear "science" my mind goes to experimentation. When I hear "open science" I think: "collaboration on the design, execution and sharing of experiments & results." Code and publishing are a necessary, but not sufficient, portion of the the scientific process.

tgardner4 avatar Aug 06 '15 22:08 tgardner4

Taking comments from @tgardner4 @noamross and more, there might be more clarity if we change the title to:

Open Science & Data: open research practices when working with scientific data

This could be a follow up series after a broader 'Introduction to Open Science'.

abbycabs avatar Aug 07 '15 15:08 abbycabs