training-material icon indicating copy to clipboard operation
training-material copied to clipboard

TODOs from Barcelona

Open natefoo opened this issue 4 years ago • 30 comments

An issue for collecting things we notice during the 2020 Galaxy Admin Training in Barcelona that need to be fixed

  • [x] Decide how to handle variables (set custom vs role defined vs we're defining this because we'll re-use it later in vars like job_conf location)
  • [x] In Ansible tutorial
    • [x] Slides: include files/ on dir tree on "roles" slide
    • [x] Use "hello, universe" or "hello, galaxy :rocket:" instead of "hello, world" 😜
    • [x] When speaking of templates, explain that the .j2 suffix is used to indicate that this is a template file in Jinja 2 format. After filling the template with the variable values, we copy the file to its remote destination without the .j2 suffix
    • [x] "Other stuff": we should use a list of packages under the package module's name option rather than with_items
    • [x] In the same place as above, we should give another example for looping that uses loop instead of with_items
    • [x] Next to where we recommend geerlingguy Ansible Galaxy roles, we should also recommend galaxyproject and usegalaxy_eu.
    • [x] when: service_conf.changed should be when: service_conf is changed
    • [x] YAMLize fenced code blocks in "Notifying Handlers"
  • [ ] In Galaxy Installation with Ansible tutorial
    • [x] "This role is found in Ansible Galaxy (no relation - it is Ansible’s ) as galaxyproject.galaxy."
    • [x] "The official recommendation is that you should have a variables file such as a group_vars/galaxy.yml for storing all of the Galaxy configuration." - should be. group_vars/galaxyservers.yml
    • [x] galaxy_config_style should default to yaml in the role
    • [x] Remove Question from point 3 of "Hands-on: Minimal Galaxy Playbook" (same question is in point 4)
    • [x] Hands-on: (Optional) Launching uWSGI by hand is duplicated
    • [ ] galaxy.yml should not be world readable (but to change this, the config needs to be readable by group galaxy)
    • [x] Add restart handler directly to the usegalaxy_eu.galaxy_systemd role.
    • [ ] Not in the tutorial but we tried making admin_users a list, which doesn't work (I thought we ran the value of this through util.listify() but maybe not?)
    • [ ] We should run a job and have students look in /data
    • [ ] We should consider having students set cleanup_job: never and looking in /srv/galaxy/jobs (or maybe set it to onsuccess and run a job that would surely fail, e.g. due to missing dependencies)
    • [ ] Fix for CentOS 7 post-Py3 (galaxyproject/ansible-galaxy#110)
    • [ ] Does galaxyproject.postgresql properly handle the state of the Ubuntu systemd-instanceized postgresql service? A couple of people running the playbook on their own VMs had issues where PostgreSQL was down due to misconfiguration and the "make sure it's running task" apparently passed ok even though it was down? Yeah pretty sure it doesn't galaxyproject/ansible-postgresql#23
  • [x] In ephemeris slides:
    • [x] Remove "Dependency resolvers" slide (covered in Tool Shed slide deck)
    • [x] Move "Suites - more repos in one" to Tool Shed slide deck
    • [x] Move "Example config entry" slide for integrated_tool_panel.xml after or into "Toolpanel management" slide
  • [x] Merge Tool Shed slide deck with the bloated one in the dev topic (see also below)
  • [ ] User, Groups, Quotas
    • [ ] Maybe merge some/all "Production" slides (library_import_dir is here)
    • [ ] Could definitely use a Data Library tutorial - link to data, etc.
    • [ ] An example showing how groups/roles/permissions and associations work would be good
  • [x] Object store:
    • [x] Remove database table details if unnecessary?
  • [x] Do we really need to quote galaxy_config booleans as strings (i.e. "True", "False")??
  • [ ] In cluster:
    • [x] Dependency resolvers slides are rehash of earlier stuff
    • [x] Slurm should use --ntasks=1 --cpus-per-task=4 rather than --ntasks=4
    • [x] Make slurm part of galaxy playbook? (maybe show tags?)
    • [x] vars to group_vars
    • [ ] Stop re-using IDs between sections (aka don't use the same values for runner IDs, destination IDs, job resource IDs, etc.
    • [ ] https://github.com/galaxyproject/galaxy/issues/9485
    • [x] Update map_resources.py: https://gist.github.com/natefoo/bbcfc162fad83cbc31bc98d82dbfd1c8
    • [x] ~~Use standalone vars for DTD config and job resource param file paths (as is done with job config file path) and rearrange these copy boxes so they're in the same order as the job config file one (actually - fenced diff block here is probably preferable so you can see that you're adding to existing vars - should do this across tutorials modifying group vars, including interactive tools, CVMFS)~~ see "Decide how to handle..."
    • [ ] I don't think we ever actually explain what dirs/paths need to be cluster accessible(!!!) I believe the full list (in galaxyproject.galaxy vars) is galaxy_shed_tools_dir, galaxy_tool_dependency_dir, galaxy_file_path, galaxy_job_working_directory, galaxy_server_dir, galaxy_venv_dir. We should probably update the Installing tutorial to put these all on some distinct path (e.g. /data, but rename to /clusterFS or something). And maybe there should be a layout in galaxyproject.galaxy that does this.
  • [ ] CVMFS/Ref data
    • [ ] Make proper tutorial of this
  • [ ] BioBlend:
    • ~[ ] Move the Jupyter notebook from https://github.com/nsoranzo/bioblend-tutorial/ to a files directory~ Given that Binder seem to clone the entire GitHub repository, it seems better to keep the notebooks in a separate small repo.
    • [ ] Write a small tutorial with links to run the notebooks on Binder
  • [x] Object store
    • [x] Many slides are duplicates with Maintaining and others, the remainder are fairly junk, only the last 2 are about object store.
    • [x] Use "dot notation" for dictionary access in template vars (a few other tutorials as well)
    • [x] Document object store max_percent_full
  • [ ] Pulsar
    • [ ] We should probably set transport_timeout (a PulsarRESTJobRunner plugin param) so that it is more resilient to connection timeouts. Also document this if it's not in job_conf.xml.sample_advanced
  • [ ] General
    • [ ] Diff the client directory and only rebuild whenever the client directory changes https://github.com/galaxyproject/ansible-galaxy/issues/107
  • [x] Monitoring w/ gxadmin
    • [x] Needs the gxadmin group vars (they are defined in a different tutorial (Grafana)).
  • [ ] Troubleshooting
    • [ ] Make a split_logging config var that automatically sets up filename_template logging as described in advanced logging configuration
    • [ ] Replace job_runner_name with (or add column) handler in gxadmin query job-info
    • [ ] Add pgcleanup support to galaxyproject.galaxy
    • [ ] Add tmpwatch support to galaxyproject.galaxy
    • [ ] tmpwatch on other caches (object store cache for instance)
    • [ ] Add backup managed configs support to galaxyproject.galaxy
  • [ ] TIaaS
    • [ ] Create some short intro slides (@shiltemann will do)
  • [ ] Jenkins
    • [ ] proxy_pass use variable

Not admin-related:

  • [ ] In the dev topic, create a new slide deck for publishing Galaxy tools on the Tool Shed moving the corresponding slides from tool-integration and toolshed decks

natefoo avatar Mar 02 '20 10:03 natefoo

in Galaxy installation with Ansible tutorial the part Galaxy is now configured with an admin user, a database, and a place to store data. Additionally we’ve immediately configured the mules for production Galaxy serving. So we’re ready to set up supervisord which will manage the Galaxy processes!

hands_on Hands-on: (Optional) Launching uWSGI by hand

    SSH into your server
    Switch user to Galaxy account (sudo -iu galaxy)
    Change directory into /srv/galaxy/server
    Activate virtualenv (. ../venv/bin/activate)
    uwsgi --yaml ../config/galaxy.yml
    Access at port <ip address>:8080 once the server has started

is duplicated.

lldelisle avatar Mar 02 '20 12:03 lldelisle

@lldelisle thanks!

natefoo avatar Mar 02 '20 13:03 natefoo

@lldelisle That was fixed already in #1810

nsoranzo avatar Mar 02 '20 13:03 nsoranzo

validate job xml etc against the definition

hexylena avatar Mar 04 '20 09:03 hexylena

In: https://training.galaxyproject.org/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html#a-dynamic-destination Use different name for the group id

lldelisle avatar Mar 04 '20 09:03 lldelisle

@lldelisle thanks, we added this as "Stop re-using IDs between sections (aka don't use the same values for runner IDs, destination IDs, job resource IDs, etc."

natefoo avatar Mar 04 '20 09:03 natefoo

Writing in my own comment, lest any updates conflict or be ovewritten

  • Connect to compute
    • [ ] validate job xml etc against the XML DTDs when possible
  • Pulsar
    • [x] Switch to MQ from http for py3 issues. also more 'real'. Don't need to secure the MQ since that's painful, but this would be enough.
    • [x] Vault for secrets.
  • Other
    • [x] Rename https://github.com/galaxyproject/dagobah-training/ to https://github.com/galaxyproject/gat/
    • [x] Rewrite the job_conf to use template from the start. Maybe everything should just go in templates? In case?
  • gxadmin part 3
    • [x] influxdb-client error?
    • [x] move monitoring to group_vars/monitoring.yml (https://github.com/galaxyproject/training-material/pull/1827)
    • [x] fix this https://github.com/galaxyproject/training-material/pull/1827#issuecomment-594631204
    • [x] missing begin/endraw in monitoring in one part.

hexylena avatar Mar 04 '20 10:03 hexylena

typo in https://galaxyproject.github.io//training-material/topics/admin/tutorials/pulsar/tutorial.html#testing-pulsar journalctcl -fu galaxy instead of journalctl -fu galaxy

lldelisle avatar Mar 04 '20 11:03 lldelisle

typo in https://galaxyproject.github.io//training-material/topics/admin/tutorials/pulsar/tutorial.html#testing-pulsar journalctcl -fu galaxy instead of journalctl -fu galaxy

Thanks, will be fixed by https://github.com/galaxyproject/training-material/pull/1822

nsoranzo avatar Mar 04 '20 11:03 nsoranzo

Connect to compute Citing from the hands-on tutorial:

if the folder does not exist, create files/galaxy/config next to your playbook.yml (mkdir -p files/galaxy/config/)

The playbook name should probably change to galaxy.yml, since other tutorials reference it.

ondrejme avatar Mar 04 '20 12:03 ondrejme

Thanks @ondrejme!

hexylena avatar Mar 04 '20 12:03 hexylena

change the short help of local gxadmins: https://training.galaxyproject.org/training-material/topics/admin/tutorials/gxadmin/tutorial.html local_hello() { ## hello: Says hi -> local_hello() { ## : Says hi

local_query-latest() { ## query-latest [jobs|10]: Queries latest N jobs (default to 10) -> local_query-latest() { ## [jobs|10]: Queries latest N jobs (default to 10)

lldelisle avatar Mar 04 '20 15:03 lldelisle

"Invalid username or password" when grafana starts, maybe due to: grafana_url: "https:///grafana/" in https://training.galaxyproject.org/training-material/topics/admin/tutorials/monitoring/tutorial.html

nsoranzo avatar Mar 04 '20 16:03 nsoranzo

Connect to compute Citing from the hands-on tutorial:

if the folder does not exist, create files/galaxy/config next to your playbook.yml (mkdir -p files/galaxy/config/)

The playbook name should probably change to galaxy.yml, since other tutorials reference it.

@ondrejme Thanks, it will be addressed by https://github.com/galaxyproject/training-material/pull/1829 .

nsoranzo avatar Mar 04 '20 16:03 nsoranzo

In https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html#postgresql At the beginning of the tutorial (when setting postgres) we had in group_vars/galaxyservers.yml

# Python 3 support
pip_virtualenv_command: /usr/bin/python3 -m virtualenv # usegalaxy_eu.certbot, usegalaxy_eu.tiaas2, galaxyproject.galaxy
certbot_virtualenv_package_name: python3-virtualenv    # usegalaxy_eu.certbot
pip_package: python3-pip                               # geerlingguy.pip

Then when we set galaxy_config and uwsgi the solution shows something which begins by:

# python3 support
pip_virtualenv_command: virtualenv

I guess this is not expected...

lldelisle avatar Mar 04 '20 22:03 lldelisle

In the same solution, it is written: galaxy_user: {name: galaxy, shell: /bin/bash, home: "{{ galaxy_root }}"}

Whereas in the table above it is written: {name: galaxy, shell: /bin/bash}

lldelisle avatar Mar 04 '20 22:03 lldelisle

home: "{{ galaxy_root }}"}

Wow, @lldelisle you found it. It looks like I added it, a long time ago. I really don't know how that happened. Ok, amazing, thank you. We will make sure those snippets are in sync in the future.

hexylena avatar Mar 05 '20 06:03 hexylena

I found a journalctf -u galaxy -f instead of journalctl -u galaxy -f in https://training.galaxyproject.org/training-material/topics/admin/tutorials/tiaas/tutorial.html#setting-up-tiaas

lldelisle avatar Mar 05 '20 11:03 lldelisle

I found a journalctf -u galaxy -f instead of journalctl -u galaxy -f in https://training.galaxyproject.org/training-material/topics/admin/tutorials/tiaas/tutorial.html#setting-up-tiaas

Fixed already in https://github.com/galaxyproject/training-material/pull/1836 , thanks for reporting anyway!

nsoranzo avatar Mar 05 '20 11:03 nsoranzo

gxit - leading spaces in paste

hexylena avatar Mar 05 '20 13:03 hexylena

gxit - leading spaces in paste

https://github.com/galaxyproject/training-material/pull/1842

nsoranzo avatar Mar 05 '20 13:03 nsoranzo

Hands-on: Enabling Interactive Tools in Galaxy Step3: I would suggest changing order if "id" and "destination" in tag, as it is with other tool-destinations mappings

Step4:
interactivetools_enable: "True" remove quotation marks and make the capital letter small

ondrejme avatar Mar 05 '20 13:03 ondrejme

in https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html If you want not to use ssl, I guess you also need to change the templates/nginx/galaxy.j2 because:

    # Listen on port 443
    listen        *:443 ssl default_server;

Will not work, right?

lldelisle avatar Mar 05 '20 13:03 lldelisle

@lldelisle If you changed this to listen *:80 default_server;, you should also move this template from nginx_ssl_servers to nginx_servers, remove redirect-ssl from nginx_servers, and comment nginx_ssl_role. You would also need to remove /etc/nginx/sites-enabled/redirect-ssl. You could do this with a pre_task like:

- name: Remove redirect-ssl config
  file:
    path: /etc/nginx/sites-enabled/redirect-ssl
    state: absent

natefoo avatar Mar 05 '20 14:03 natefoo

Many thanks... So the only think which is missing in the training material is: change

    # Listen on port 443
    listen        *:443 ssl default_server;

to

    # Listen on port 80
    listen        *:80 default_server;

If you ran the playbook once with redirect-ssl before deciding to do not use SSL, remove the file /etc/nginx/sites-enabled/redirect-ssl.

lldelisle avatar Mar 05 '20 16:03 lldelisle

In https://training.galaxyproject.org/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html: You wrote: Add a post_task to your playbook to install slurm-drmaa1 (Debian/Ubuntu) or slurm-drmaa (RedHat/CentOS), and additionally include the galaxyproject.repos role Then maybe you could use:

  post_tasks:
    - name: Install slurm-drmaa1 if Debian
      package:
        name: slurm-drmaa1
      when: ansible_os_family == "Debian"
    - name: Install slurm-drmaa if RedHat
      package:
        name: slurm-drmaa
      when: ansible_os_family == "RedHat"

(If I undertood well...)

lldelisle avatar Mar 05 '20 17:03 lldelisle

To myself: ansible_python.version.major

lldelisle avatar Mar 06 '20 12:03 lldelisle

combination of statements and opinions from @natefoo @Slugger70 @mvdbeek @nsoranzo @hexylena and @shiltemann, synthesized into one summary/todo list.

Barcelona

This training was fantastic! And incredibly strange, things worked! Like flawlessly nearly. We got through 5 days of content in 3. We had to come up with an extra 2 days.

A notable difference this time was how many students tried to run the playbooks immediately on their own infrastructure, either from the start on their own VMs, or after class on their own infra. Despite asking everyone to run it on the VM, we also had a couple of people brave enough to run from their own laptop, mostly without issue.

All around great set of participants! But it led us to focus on areas we need to improve the materials

Seeing the Effects

From @natefoo:

an idea I had: two column design on the tutorials where one column is the things you do in ansible and the other column is the effects it has on the system

this latest training went well but at times it felt very black-boxish, "just run these things and voila!"

For something like the ansible tutorial we could show a

$ cat /tmp/test.txt
some contents

In something like the galaxy tutorial we'd show all the changes to the system that each step makes. I'd say something like the latest commit on the release_XX.YY branch has been cloned to /srv/galaxy/server

In order to reduce how much it needs to be updated, we will just use this in the first two trainings where we need to show this effect (ansible, ansible-galaxy).

The students can then see the differences the ansible is making and gain the understanding to help enable them to troubleshoot.. As things never always "just work", especially when running on varied or outsourced hardware, with the large viariety of quality of tools etc..

"Real exercises"

We noted that a few students had issues with how ansible really works, variables being set in different places, which changes have which effects. So we're considering adding "real" exercises or hide a bit more the answers for some of the ones we already have.

It's a tough balance to strike. For most of the questions & answers in ansible-galaxy, they're awful, they ask "how does your final config look" and everyone just copies that. Maybe we should rewrite them as "Here is the config." and ask better questions??? "what does this do?" "what effect will that have?"

We should show the students Ansible Best Practices at some point? Before the training? Or after the 1st day? https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html

And we should consider developing "Ansible - advanced" or an ansible "exam" (CTF?) for the students, saying "ok, now that you know ansible, accomplish these tasks"

I also think that sometimes "just re-run the playbook" isn't enough.. Figuring out why something has changed can sometimes be more important for the big picture than how to do it. (If that makes sense.)

Continuum

I think there's a continuum, at one end is "galaxy of a few years ago where people needed to be programmers/tool devs/admins together, and we needed to teach everything in detail so they could debug" and the other end is "galaxy (of the current/ future) where things mostly just work, and they can just deploy it and not care too much since the documentation / tutorials cover all of the main points, and they don't resort to low level debugging"

If we're really moving to the "just works" end, maybe we remove that detail from the curriculum because it doesn't benefit students vs a higher level picture.

I think if they're gonna go back and not use ansible it's good to show "here's what this production deployment looks like" so they can adapt it for their own purposes

We sympathise with "ought to get an in-depth understanding", but:

  • some things aren't learned without real life experiences
  • it's difficult to synthesise useful lessons from our myriad experiences. see the troubleshooting slides, we can't summarise in some bullet points, admin is a huge topic which requires understanding across a large number of fields (linux, networking, kernel, python, c code, etc.) and that isn't something taught in a day or even a week
  • the people there want to solve their problem, and "their problem" seems to mostly be "get galaxy running on weird hpc no.2345"

It's two sides of a coin... people coming to a week long training probably ought to come away with a pretty low level understanding - but we've also found that it's really difficult to teach that low level understanding, especially to folks who mostly aren't sysadmins.

Which leads us to the next question:

What is "A Galaxy Admin"

What should students come away from GAT knowing how to do?

  • I hope when their Galaxy breaks they know what to do, or when they need to set up a fancier Galaxy server (like ours) they know where to start.
  • If they've got Pulsar, that's a lot of it
  • they should be able to resolve systems crashing (nginx, galaxy; "check the logs") and know where to look and which things could be at fault.
  • If a tool is crashing they should know how to handle this and where to look (dependencies, inputs, check stderr, etc.)
  • Should be able to set up a cluster (+find docs for other clusters)
  • adding storage
  • setting up monitoring
  • Setting up data and sharing it

everything else is less important?

Splitting

We should include more on the splitting of roles amongst machines, and write them in a way they can be used as-is. E.g. transitioning from ident auth to network auth is complex (see next aside). A number of participants tried deploying the playbook on their own systems toward the end of the week and some struggled with getting the proper DB configuration.

So db on separate server as an example and how to setup the ansible to do things like that. And talk about production setups for a large user base in detail. The benefits of automation for larger setups and some examples of tool maintenance etc.

There are now I think two different places in the tutorials where we say "if we were really doing best practices we'd create a new group and put vars in a different group vars file," maybe we should just do that,

I'd see the following splitting for the whole week:

  1. db
  2. galaxy (+proxy +slurm submit +tiaas)
  3. compute-central manager
  4. compute exec
  5. pulsar
  6. monitoring (influx/grafana)

In ansible-galaxy, only one split, db + galaxy that sounds manageable. And it is a good place to introduce this concept of "here is where you can divide your infrastructure"

DB Auth

let's bind to 127, and use md5, and make everyone use passwords. I think that would be a positive change over ident magic. (I mean, I love ident, but, it's difficult to switch / not obvious for students)

Conclusion

  • Setting splitting of input/output
  • Split playbooks better
  • Fix DB auth
  • Ansible advanced exercise
  • Libraries exercise

hexylena avatar Mar 12 '20 12:03 hexylena

WIP implementation of the side-by-side discussed during admin debriefing

image

hexylena avatar May 28 '20 09:05 hexylena

@annefou this might be interesting for you, too! Do you have any feedback on this? Authors have the choice of

  • side-by-side (which automatically becomes a single column when screen becomes too narrow)
  • or always vertical (second set of in/out)

hexylena avatar May 28 '20 09:05 hexylena