training-material
training-material copied to clipboard
TODOs from Barcelona
An issue for collecting things we notice during the 2020 Galaxy Admin Training in Barcelona that need to be fixed
- [x] Decide how to handle variables (set custom vs role defined vs we're defining this because we'll re-use it later in vars like job_conf location)
- [x] In Ansible tutorial
- [x] Slides: include
files/
on dir tree on "roles" slide - [x] Use "hello, universe" or "hello, galaxy :rocket:" instead of "hello, world" 😜
- [x] When speaking of templates, explain that the
.j2
suffix is used to indicate that this is a template file in Jinja 2 format. After filling the template with the variable values, we copy the file to its remote destination without the.j2
suffix - [x] "Other stuff": we should use a list of packages under the
package
module'sname
option rather thanwith_items
- [x] In the same place as above, we should give another example for looping that uses
loop
instead ofwith_items
- [x] Next to where we recommend
geerlingguy
Ansible Galaxy roles, we should also recommendgalaxyproject
andusegalaxy_eu
. - [x]
when: service_conf.changed
should bewhen: service_conf is changed
- [x] YAMLize fenced code blocks in "Notifying Handlers"
- [x] Slides: include
- [ ] In Galaxy Installation with Ansible tutorial
- [x] "This role is found in Ansible Galaxy (no relation - it is Ansible’s ) as galaxyproject.galaxy."
- [x] "The official recommendation is that you should have a variables file such as a
group_vars/galaxy.yml
for storing all of the Galaxy configuration." - should be.group_vars/galaxyservers.yml
- [x]
galaxy_config_style
should default toyaml
in the role - [x] Remove Question from point 3 of "Hands-on: Minimal Galaxy Playbook" (same question is in point 4)
- [x] Hands-on: (Optional) Launching uWSGI by hand is duplicated
- [ ]
galaxy.yml
should not be world readable (but to change this, the config needs to be readable by groupgalaxy
) - [x] Add restart handler directly to the usegalaxy_eu.galaxy_systemd role.
- [ ] Not in the tutorial but we tried making
admin_users
a list, which doesn't work (I thought we ran the value of this throughutil.listify()
but maybe not?) - [ ] We should run a job and have students look in
/data
- [ ] We should consider having students set
cleanup_job: never
and looking in/srv/galaxy/jobs
(or maybe set it toonsuccess
and run a job that would surely fail, e.g. due to missing dependencies) - [ ] Fix for CentOS 7 post-Py3 (galaxyproject/ansible-galaxy#110)
- [ ] Does galaxyproject.postgresql properly handle the state of the Ubuntu systemd-instanceized postgresql service? A couple of people running the playbook on their own VMs had issues where PostgreSQL was down due to misconfiguration and the "make sure it's running task" apparently passed ok even though it was down? Yeah pretty sure it doesn't galaxyproject/ansible-postgresql#23
- [x] In ephemeris slides:
- [x] Remove "Dependency resolvers" slide (covered in Tool Shed slide deck)
- [x] Move "Suites - more repos in one" to Tool Shed slide deck
- [x] Move "Example config entry" slide for
integrated_tool_panel.xml
after or into "Toolpanel management" slide
- [x] Merge Tool Shed slide deck with the bloated one in the dev topic (see also below)
- [ ] User, Groups, Quotas
- [ ] Maybe merge some/all "Production" slides (
library_import_dir
is here) - [ ] Could definitely use a Data Library tutorial - link to data, etc.
- [ ] An example showing how groups/roles/permissions and associations work would be good
- [ ] Maybe merge some/all "Production" slides (
- [x] Object store:
- [x] Remove database table details if unnecessary?
- [x] Do we really need to quote
galaxy_config
booleans as strings (i.e."True"
,"False"
)?? - [ ] In cluster:
- [x] Dependency resolvers slides are rehash of earlier stuff
- [x] Slurm should use
--ntasks=1 --cpus-per-task=4
rather than--ntasks=4
- [x] Make slurm part of galaxy playbook? (maybe show tags?)
- [x] vars to group_vars
- [ ] Stop re-using IDs between sections (aka don't use the same values for runner IDs, destination IDs, job resource IDs, etc.
- [ ] https://github.com/galaxyproject/galaxy/issues/9485
- [x] Update
map_resources.py
: https://gist.github.com/natefoo/bbcfc162fad83cbc31bc98d82dbfd1c8 - [x] ~~Use standalone vars for DTD config and job resource param file paths (as is done with job config file path) and rearrange these copy boxes so they're in the same order as the job config file one (actually - fenced diff block here is probably preferable so you can see that you're adding to existing vars - should do this across tutorials modifying group vars, including interactive tools, CVMFS)~~ see "Decide how to handle..."
- [ ] I don't think we ever actually explain what dirs/paths need to be cluster accessible(!!!) I believe the full list (in galaxyproject.galaxy vars) is
galaxy_shed_tools_dir, galaxy_tool_dependency_dir, galaxy_file_path, galaxy_job_working_directory, galaxy_server_dir, galaxy_venv_dir
. We should probably update the Installing tutorial to put these all on some distinct path (e.g./data
, but rename to/clusterFS
or something). And maybe there should be a layout in galaxyproject.galaxy that does this.
- [ ] CVMFS/Ref data
- [ ] Make proper tutorial of this
- [ ] BioBlend:
- ~[ ] Move the Jupyter notebook from https://github.com/nsoranzo/bioblend-tutorial/ to a
files
directory~ Given that Binder seem to clone the entire GitHub repository, it seems better to keep the notebooks in a separate small repo. - [ ] Write a small tutorial with links to run the notebooks on Binder
- ~[ ] Move the Jupyter notebook from https://github.com/nsoranzo/bioblend-tutorial/ to a
- [x] Object store
- [x] Many slides are duplicates with Maintaining and others, the remainder are fairly junk, only the last 2 are about object store.
- [x] Use "dot notation" for dictionary access in template vars (a few other tutorials as well)
- [x] Document object store
max_percent_full
- [ ] Pulsar
- [ ] We should probably set transport_timeout (a PulsarRESTJobRunner plugin param) so that it is more resilient to connection timeouts. Also document this if it's not in
job_conf.xml.sample_advanced
- [ ] We should probably set transport_timeout (a PulsarRESTJobRunner plugin param) so that it is more resilient to connection timeouts. Also document this if it's not in
- [ ] General
- [ ] Diff the client directory and only rebuild whenever the client directory changes https://github.com/galaxyproject/ansible-galaxy/issues/107
- [x] Monitoring w/ gxadmin
- [x] Needs the gxadmin group vars (they are defined in a different tutorial (Grafana)).
- [ ] Troubleshooting
- [ ] Make a
split_logging
config var that automatically sets upfilename_template
logging as described in advanced logging configuration - [ ] Replace
job_runner_name
with (or add column)handler
ingxadmin query job-info
- [ ] Add pgcleanup support to galaxyproject.galaxy
- [ ] Add tmpwatch support to galaxyproject.galaxy
- [ ] tmpwatch on other caches (object store cache for instance)
- [ ] Add backup managed configs support to galaxyproject.galaxy
- [ ] Make a
- [ ] TIaaS
- [ ] Create some short intro slides (@shiltemann will do)
- [ ] Jenkins
- [ ] proxy_pass use variable
Not admin-related:
- [ ] In the dev topic, create a new slide deck for publishing Galaxy tools on the Tool Shed moving the corresponding slides from tool-integration and toolshed decks
in Galaxy installation with Ansible tutorial the part Galaxy is now configured with an admin user, a database, and a place to store data. Additionally we’ve immediately configured the mules for production Galaxy serving. So we’re ready to set up supervisord which will manage the Galaxy processes!
hands_on Hands-on: (Optional) Launching uWSGI by hand
SSH into your server
Switch user to Galaxy account (sudo -iu galaxy)
Change directory into /srv/galaxy/server
Activate virtualenv (. ../venv/bin/activate)
uwsgi --yaml ../config/galaxy.yml
Access at port <ip address>:8080 once the server has started
is duplicated.
@lldelisle thanks!
@lldelisle That was fixed already in #1810
validate job xml etc against the definition
In: https://training.galaxyproject.org/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html#a-dynamic-destination Use different name for the group id
@lldelisle thanks, we added this as "Stop re-using IDs between sections (aka don't use the same values for runner IDs, destination IDs, job resource IDs, etc."
Writing in my own comment, lest any updates conflict or be ovewritten
- Connect to compute
- [ ] validate job xml etc against the XML DTDs when possible
- Pulsar
- [x] Switch to MQ from http for py3 issues. also more 'real'. Don't need to secure the MQ since that's painful, but this would be enough.
- [x] Vault for secrets.
- Other
- [x] Rename https://github.com/galaxyproject/dagobah-training/ to https://github.com/galaxyproject/gat/
- [x] Rewrite the job_conf to use template from the start. Maybe everything should just go in templates? In case?
- gxadmin part 3
- [x] influxdb-client error?
- [x] move monitoring to group_vars/monitoring.yml (https://github.com/galaxyproject/training-material/pull/1827)
- [x] fix this https://github.com/galaxyproject/training-material/pull/1827#issuecomment-594631204
- [x] missing begin/endraw in monitoring in one part.
typo in https://galaxyproject.github.io//training-material/topics/admin/tutorials/pulsar/tutorial.html#testing-pulsar
journalctcl -fu galaxy
instead of journalctl -fu galaxy
typo in https://galaxyproject.github.io//training-material/topics/admin/tutorials/pulsar/tutorial.html#testing-pulsar
journalctcl -fu galaxy
instead ofjournalctl -fu galaxy
Thanks, will be fixed by https://github.com/galaxyproject/training-material/pull/1822
Connect to compute Citing from the hands-on tutorial:
if the folder does not exist, create files/galaxy/config next to your playbook.yml (mkdir -p files/galaxy/config/)
The playbook name should probably change to galaxy.yml, since other tutorials reference it.
Thanks @ondrejme!
change the short help of local gxadmins:
https://training.galaxyproject.org/training-material/topics/admin/tutorials/gxadmin/tutorial.html
local_hello() { ## hello: Says hi
-> local_hello() { ## : Says hi
local_query-latest() { ## query-latest [jobs|10]: Queries latest N jobs (default to 10)
-> local_query-latest() { ## [jobs|10]: Queries latest N jobs (default to 10)
"Invalid username or password" when grafana starts, maybe due to: grafana_url: "https:///grafana/"
in https://training.galaxyproject.org/training-material/topics/admin/tutorials/monitoring/tutorial.html
Connect to compute Citing from the hands-on tutorial:
if the folder does not exist, create files/galaxy/config next to your playbook.yml (mkdir -p files/galaxy/config/)
The playbook name should probably change to galaxy.yml, since other tutorials reference it.
@ondrejme Thanks, it will be addressed by https://github.com/galaxyproject/training-material/pull/1829 .
In https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html#postgresql
At the beginning of the tutorial (when setting postgres) we had in group_vars/galaxyservers.yml
# Python 3 support
pip_virtualenv_command: /usr/bin/python3 -m virtualenv # usegalaxy_eu.certbot, usegalaxy_eu.tiaas2, galaxyproject.galaxy
certbot_virtualenv_package_name: python3-virtualenv # usegalaxy_eu.certbot
pip_package: python3-pip # geerlingguy.pip
Then when we set galaxy_config and uwsgi the solution shows something which begins by:
# python3 support
pip_virtualenv_command: virtualenv
I guess this is not expected...
In the same solution, it is written:
galaxy_user: {name: galaxy, shell: /bin/bash, home: "{{ galaxy_root }}"}
Whereas in the table above it is written:
{name: galaxy, shell: /bin/bash}
home: "{{ galaxy_root }}"}
Wow, @lldelisle you found it. It looks like I added it, a long time ago. I really don't know how that happened. Ok, amazing, thank you. We will make sure those snippets are in sync in the future.
I found a journalctf -u galaxy -f
instead of journalctl -u galaxy -f
in https://training.galaxyproject.org/training-material/topics/admin/tutorials/tiaas/tutorial.html#setting-up-tiaas
I found a
journalctf -u galaxy -f
instead ofjournalctl -u galaxy -f
in https://training.galaxyproject.org/training-material/topics/admin/tutorials/tiaas/tutorial.html#setting-up-tiaas
Fixed already in https://github.com/galaxyproject/training-material/pull/1836 , thanks for reporting anyway!
gxit - leading spaces in paste
gxit - leading spaces in paste
https://github.com/galaxyproject/training-material/pull/1842
Hands-on: Enabling Interactive Tools in Galaxy
Step3:
I would suggest changing order if "id" and "destination" in
Step4:
interactivetools_enable: "True"
remove quotation marks and make the capital letter small
in https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html
If you want not to use ssl, I guess you also need to change the templates/nginx/galaxy.j2
because:
# Listen on port 443
listen *:443 ssl default_server;
Will not work, right?
@lldelisle If you changed this to listen *:80 default_server;
, you should also move this template from nginx_ssl_servers
to nginx_servers
, remove redirect-ssl
from nginx_servers
, and comment nginx_ssl_role
. You would also need to remove /etc/nginx/sites-enabled/redirect-ssl
. You could do this with a pre_task
like:
- name: Remove redirect-ssl config
file:
path: /etc/nginx/sites-enabled/redirect-ssl
state: absent
Many thanks... So the only think which is missing in the training material is: change
# Listen on port 443
listen *:443 ssl default_server;
to
# Listen on port 80
listen *:80 default_server;
If you ran the playbook once with redirect-ssl before deciding to do not use SSL, remove the file /etc/nginx/sites-enabled/redirect-ssl
.
In https://training.galaxyproject.org/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html: You wrote: Add a post_task to your playbook to install slurm-drmaa1 (Debian/Ubuntu) or slurm-drmaa (RedHat/CentOS), and additionally include the galaxyproject.repos role Then maybe you could use:
post_tasks:
- name: Install slurm-drmaa1 if Debian
package:
name: slurm-drmaa1
when: ansible_os_family == "Debian"
- name: Install slurm-drmaa if RedHat
package:
name: slurm-drmaa
when: ansible_os_family == "RedHat"
(If I undertood well...)
To myself: ansible_python.version.major
combination of statements and opinions from @natefoo @Slugger70 @mvdbeek @nsoranzo @hexylena and @shiltemann, synthesized into one summary/todo list.
Barcelona
This training was fantastic! And incredibly strange, things worked! Like flawlessly nearly. We got through 5 days of content in 3. We had to come up with an extra 2 days.
A notable difference this time was how many students tried to run the playbooks immediately on their own infrastructure, either from the start on their own VMs, or after class on their own infra. Despite asking everyone to run it on the VM, we also had a couple of people brave enough to run from their own laptop, mostly without issue.
All around great set of participants! But it led us to focus on areas we need to improve the materials
Seeing the Effects
From @natefoo:
an idea I had: two column design on the tutorials where one column is the things you do in ansible and the other column is the effects it has on the system
this latest training went well but at times it felt very black-boxish, "just run these things and voila!"
For something like the ansible tutorial we could show a
$ cat /tmp/test.txt
some contents
In something like the galaxy tutorial we'd show all the changes to the system that each step makes. I'd say something like the latest commit on the release_XX.YY branch has been cloned to /srv/galaxy/server
In order to reduce how much it needs to be updated, we will just use this in the first two trainings where we need to show this effect (ansible, ansible-galaxy).
The students can then see the differences the ansible is making and gain the understanding to help enable them to troubleshoot.. As things never always "just work", especially when running on varied or outsourced hardware, with the large viariety of quality of tools etc..
"Real exercises"
We noted that a few students had issues with how ansible really works, variables being set in different places, which changes have which effects. So we're considering adding "real" exercises or hide a bit more the answers for some of the ones we already have.
It's a tough balance to strike. For most of the questions & answers in ansible-galaxy, they're awful, they ask "how does your final config look" and everyone just copies that. Maybe we should rewrite them as "Here is the config." and ask better questions??? "what does this do?" "what effect will that have?"
We should show the students Ansible Best Practices at some point? Before the training? Or after the 1st day? https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html
And we should consider developing "Ansible - advanced" or an ansible "exam" (CTF?) for the students, saying "ok, now that you know ansible, accomplish these tasks"
I also think that sometimes "just re-run the playbook" isn't enough.. Figuring out why something has changed can sometimes be more important for the big picture than how to do it. (If that makes sense.)
Continuum
I think there's a continuum, at one end is "galaxy of a few years ago where people needed to be programmers/tool devs/admins together, and we needed to teach everything in detail so they could debug" and the other end is "galaxy (of the current/ future) where things mostly just work, and they can just deploy it and not care too much since the documentation / tutorials cover all of the main points, and they don't resort to low level debugging"
If we're really moving to the "just works" end, maybe we remove that detail from the curriculum because it doesn't benefit students vs a higher level picture.
I think if they're gonna go back and not use ansible it's good to show "here's what this production deployment looks like" so they can adapt it for their own purposes
We sympathise with "ought to get an in-depth understanding", but:
- some things aren't learned without real life experiences
- it's difficult to synthesise useful lessons from our myriad experiences. see the troubleshooting slides, we can't summarise in some bullet points, admin is a huge topic which requires understanding across a large number of fields (linux, networking, kernel, python, c code, etc.) and that isn't something taught in a day or even a week
- the people there want to solve their problem, and "their problem" seems to mostly be "get galaxy running on weird hpc no.2345"
It's two sides of a coin... people coming to a week long training probably ought to come away with a pretty low level understanding - but we've also found that it's really difficult to teach that low level understanding, especially to folks who mostly aren't sysadmins.
Which leads us to the next question:
What is "A Galaxy Admin"
What should students come away from GAT knowing how to do?
- I hope when their Galaxy breaks they know what to do, or when they need to set up a fancier Galaxy server (like ours) they know where to start.
- If they've got Pulsar, that's a lot of it
- they should be able to resolve systems crashing (nginx, galaxy; "check the logs") and know where to look and which things could be at fault.
- If a tool is crashing they should know how to handle this and where to look (dependencies, inputs, check stderr, etc.)
- Should be able to set up a cluster (+find docs for other clusters)
- adding storage
- setting up monitoring
- Setting up data and sharing it
everything else is less important?
Splitting
We should include more on the splitting of roles amongst machines, and write them in a way they can be used as-is. E.g. transitioning from ident
auth to network auth is complex (see next aside). A number of participants tried deploying the playbook on their own systems toward the end of the week and some struggled with getting the proper DB configuration.
So db on separate server as an example and how to setup the ansible to do things like that. And talk about production setups for a large user base in detail. The benefits of automation for larger setups and some examples of tool maintenance etc.
There are now I think two different places in the tutorials where we say "if we were really doing best practices we'd create a new group and put vars in a different group vars file," maybe we should just do that,
I'd see the following splitting for the whole week:
- db
- galaxy (+proxy +slurm submit +tiaas)
- compute-central manager
- compute exec
- pulsar
- monitoring (influx/grafana)
In ansible-galaxy, only one split, db + galaxy that sounds manageable. And it is a good place to introduce this concept of "here is where you can divide your infrastructure"
DB Auth
let's bind to 127, and use md5, and make everyone use passwords. I think that would be a positive change over ident magic. (I mean, I love ident, but, it's difficult to switch / not obvious for students)
Conclusion
- Setting splitting of input/output
- Split playbooks better
- Fix DB auth
- Ansible advanced exercise
- Libraries exercise
WIP implementation of the side-by-side discussed during admin debriefing
@annefou this might be interesting for you, too! Do you have any feedback on this? Authors have the choice of
- side-by-side (which automatically becomes a single column when screen becomes too narrow)
- or always vertical (second set of in/out)