agnosticd
agnosticd copied to clipboard
Add config ansible-workshops-test - migration from ansible/workshops
SUMMARY
Migrate the RHEL workshop from ansible/workshops.
This config provisions, but is not "finished", since we need to validate it works.
/cc @IPvSean, we can use this as the base for further planning
ISSUE TYPE
- New config Pull Request
COMPONENT NAME
ansible-workshops-test
ADDITIONAL INFORMATION
This is ansible-workshops-test
as it includes all workshops. I've only tested RHEL, which needs to be split out. Most code will need to be shared though, since all workshops have the same structure.
The workshop provisions in 10 minutes, which is an improvement. The launch command is in the readme.
I've had to very liberally use agnosticd's framework since there are some things the existing workshops do differently, mainly concerning how the inventory is built up. Most things are identical to what worked before, since I tried to break as little as possible. Machine provisioning is obviously completely rewritten into CloudFormation. EC2 is the only provider supported because of extremely tight coupling.
The code is a mess, I know this very well. Cleanups will happen when I know there won't be any major functionality changes necessary to avoid premature optimization.
I initially wanted to do a staged approach to migrating the workshops, but the logic that needed to be ported was too complex to be split up into nice, logical parts. In the end, I figured the workflow downsides outweighed any the benefits of any smaller, but still >2kLOC PRs.
Major non-standard components, also the main things one needs to pay attention to here, are:
-
default_vars.yml
, where I dumped, in order, all variable files from ansible/workshops. Duplicates aplenty, this is 800LOC with gratuitous whitespace. -
ami_find/*
which are the existing AMI filtering playbooks. -
autoincluded_vars_workshop_instances.yml
, which is a Jinja2 template of variables templated out and included inpre_infra.yml
featuring multiple levels of templating and lazy evaluation.- This is where instances for each workshop are configured.
- 1.3 kLOC
- The schema is documented inside the file, but never validated.
- This file is very ugly, but it works. Any improvements I could make would need to be propagated through the whole ansible/workshops structure and sometimes require knowledge of why workshops are set up the way they are.
-
ec2_cloud_template.j2
is not the "standard" agnosticd template, it is completely rewritten.- VPCs, subnets, secgroups and the like are mostly hardcoded.
- Instances have a loop which is controlled by what is in
autoincluded_vars_workshop_instances.yml
.
- There is no bastion host and the SSH keypair is handled differently.
- Likely more I just don't remember off the top of my head.
TODOs:
- Split out the RHEL workshop (then others).
- Verify whether the RHEL workshop actually works (not just that it provisions).
- Need to manually convert any relevant changes in ansible/workshops since the commit I based this migration on.
@tonykay would appreciate a review if possible, thanks!
Taking a look right now @abenokraitis - thanks for all the work here
Adding some comments inline, because it is a config, and effectively sandboxed, we can merge with no impact to others. One potential issue I see is there are a number of hard coded names and filters which we would expect to be variables so we can update without a PR each time. I will add some comments to the code
I've resolved most of the things we've talked about. Changes are in separate commits to make reviewing easier. Let me know if I need to squash anything.
AMIs are now completely parametrized and resolved in a loop, so variable overrides will work. Variables are also split into the "defaults", which normally wouldn't be changed, and the "sample vars", which are things you'd likely want to or need to customize. I think that's the way they were intended to be used.
The config was also renamed from ansible-workshops-test
into ansible-workshops
, with the other existing ansible-workshops*
configs removed, since they'd be duplicates.
I couldn't manage to get things working with bastions hosts, i.e. with the built-in inventory builder. I saw no way of making host connections be direct, without proxying through the bastion host. This means the ansible-playbook
command still needs to include --skip-tags create_inventory,create_ssh_config,wait_ssh,set_hostname
to skip those steps. I don't know whether this is even possible in agnosticd, or whether the workshops can be adapted to function with a bastion host.
I couldn't manage to get things working with bastions hosts, i.e. with the built-in inventory builder. I saw no way of making host connections be direct, without proxying through the bastion host. This means the
ansible-playbook
command still needs to include--skip-tags create_inventory,create_ssh_config,wait_ssh,set_hostname
to skip those steps. I don't know whether this is even possible in agnosticd, or whether the workshops can be adapted to function with a bastion host.
The workshops should be able to function with a bastion host and the --skip-tags
listed remove that capability. This is core functionality of AgnosticD and in effect is a large part of how it is cloud agnostic.
(We can, and do, run agnosticd with --skip-tags
but the above tag mix removes core functionality which would then have to be reimplemented per cloud provider)
Each cloud_provider has its own create_inventory
and create_ssh_config
relies on that. This allows multi cloud deployment and the create_ssh_config
is what enables configuration via bastion. It is simple configuring ansible to know how to contact each host via a jumpbox.
You could effectively use your first control node for this bastion/jumpbox task. All it needs, and the bastions effectively get this "for free", is the correct ssh keys and IP connectivity to all hosts.
call at 9AM ET between XLAB & GPTE
@tonykay will deploy workshop and talk to GPTE leadership to verify a list of what else is missing to merge PR
cc @abenokraitis
I've renamed the config to ansible-workshops-agnosticd
as discussed and removed the commit deleting the other ansible-workshops-*
configs.
I forgot to mention on the call that what you need to deploy the config, apart from the command from the readme and the AWS credentials in sample_vars.yml
, is the RHAAP download offline_token
, which you can generate at https://access.redhat.com/management/api. Alternatively, comment out the task at pre_infra.yml#60
.
@sstanovnik can you paste in the exact command you are using to deploy, and confirm you are deploying with 2.9.x
? Thanks. If you have time adding a requirements.txt
to the config would be useful or confirming which one you are using. They are in tools/virtualenvs
The README's typically should contain the command to deploy so other users can basically just copy and paste to test. Thanks.
The README is important for ops. Needs to have a description so they know this is the port of the workshops as there will be 2 entries with ansible-workshop
in their name. Plus detailed instructions on how to run. For example either a long ansible-playbook cmd with all the
-eoptions they need to set or simply a sample of what they would over-ride via
-e @my_vars.yml`. We typically split it into 2 files and have the secret one last. Happy to contribute once we have it added. Basically Ops need to know how to deploy/destroy more or less via copy and paste.
Re the README it would be good to call out the offline_token
and how to get it (simply a link to the RHN docs will do). We would prefer it to have a more meaningful name as we use lots of tokens to pull from repos and repositories. A rhn_
prefix will help ops if they see log errors etc.
The exact command was in the readme, but I missed changing the config path when I renamed it to *-agnosticd
. I've now extended the readme with detailed instructions.
offline_token
can't be renamed since it's a variable that ansible/workshops uses directly. Unless you want to duplicate it under a different name in the vars file, but that would break overriding.
I've added a requirements.txt
. I wasn't using 2.9, but I've now verified this works under these two versions:
-
ansible 2.9.25
-
ansible [core 2.11.2]
Thanks @sstanovnik testing with 2.11. Note your point re the var name
I notice it looks up all amis for all workshops? Shouldn't this be conditional so that only the current workshop amis are looked up. For example if a network ami task breaks then the workshop will break for everyone plus it is a additional work. Can we wrap them in when
?
Need to document all vars to supply as over-rides e.g workshop_dns_zone
@sstanovnik when tested on an environment running other workshops an issue arises here https://github.com/xlab-steampunk/agnosticd/blob/a38dae6689a97b35a691ec0d1d7c343a380741e4/ansible/configs/ansible-workshops-agnosticd/post_infra.yml#L23
Basically the filter picks up every instance in every running workshop. I suggest the filter is changed from a simple - name: Grab all workshop hosts ec2_instance_info: filters: instance-state-name: running "tag:ansible-workshops": "true" register: workshop_hosts
To using guid
as well. Otherwise we will potentially pick up hundreds of instances before proceeding. It is already picking up the existing instances from the ansible-workshop
config
@sstanovnik re the above filter picking up all the instances, this is how we normally do it and as long as you are creating the project tag including the value of guid
all should be good as you'll only pick up the guid instances https://github.com/redhat-cop/agnosticd/blob/d2dd93ad960d5ae1e486eac3f9a7828b649bd9ef/ansible/roles-infra/infra-ec2-create-inventory/tasks/main.yml#L7
Hi @sstanovnik we need the filter issue fixed before we can carry out further tests. I'll check your repo morning Wednesday US time. This is blocking us from deploying safely.
I've limited instance lookups to the guid
.
AMIs are all fetched because that was easiest. I could either
- mark each image with the workshops it's used in, which needs to be modified properly when images are added or workshops start using different ones, or
- parse the necessary images from the workshop instances, but this is a bit more work initially.
By documenting variables, do you mean documenting their meaning along their sample values, or documenting all of them, perhaps including the meanings, in the readme?
I have a DNS issue as I switch from us-east-1
to us-east-2
and we do use multiple regions concurrently eg Europeans may deploy to eu-*
whilst APAC may use ae-*
. There seems to be a hard coded internal Rout53 associated with the first region we choose so we can't deploy elsewhere "last_updated_time": "2021-08-26T18:28:38.805000+00:00", "logical_resource_id": "S3Bucket", "physical_resource_id": "", "resource_type": "AWS::S3::Bucket", "status": "CREATE_FAILED", "status_reason": "amtw.example.opentlc.com.private already exists in stack arn:aws:cloudformation:us-east-1:719622469867:stack/ansible-workshops-agnosticd-tok-00/2b97ee70-068f-11ec-b9b8-124ddee56cf7"
This is related to the previous comment directly above
We are having issues with a hard coded string amtw
which needs to be guid
. I suggest you use guid
throughout and then in default_vars.yml
you can map ec2_name_prefix: "{{ guid }}"
as a safety mechanism and to simply maintaining the original code.
95
Also some of these are deal breakers for the labs to function correctly:
value: "attendance-host.amtw.example.opentlc.com
We also see these for gitlab and automation hub
Whilst it was used in the original Ansible Workshops deployer the creation of the working directory, keys etc in the repo itself needs to be moved out of the repo. AgnosticD supports the idea of an output_dir
and all artifacts should be stored there. Repos are cloned on the fly and deployed then destroyed and artifacts would be lost. Plus for local development it can cause issues when doing a git push
etc.
We need to change the location from ansible/configs/ansible-workshops-agnosticd/amtw
Plus the final sub-directory should either be a var e.g. /{{ guid }}
or all artifacts carry the value of guid
in their name.
It is essential that ec2_name_prefix
is set to guid
, this is breaking deploys and artifacts such as ssh keys. We would remove it from any input files and set it in default_vars_ec2.yml
like this ec2_name_prefix: "{{ guid }}"
Longer term goal would be to ideally eventually replace with guid
I've changed ec2_name_prefix
to be guid
and moved the artefacts into output_dir: "[...]/{{ guid }}"
. Not sure about the problem with switching regions - there is a separate CF stack bound to a region. Now with GUID == ec2_name_prefix name clashes shouldn't happen any more, I hope.
Hey one last thing @sstanovnik , I got everything to boot, and the website login, awesome!
The last problem (hopefully ๐ค ) is that I don't think the real IPs are being populated. The attendance website is empty, and the instructor files are empty->
e.g.
โ agnosticd git:(workshopmigration) โ ls
CODEOWNERS LICENSE README.adoc ansible ansible.cfg docs login.html test.yaml tests tools training
โ agnosticd git:(workshopmigration) โ ls /tmp/workdir/seansean/
ansible-workshop-vars.yml student1-etchosts.txt user-data.yaml
ansible-workshops-agnosticd.seansean.ec2_cloud_template student1-instances.txt user-info.yaml
instructor_inventory.txt student2-etchosts.txt
seansean-private.pem student2-instances.txt
โ agnosticd git:(workshopmigration) โ cd /tmp/workdir/seansean
โ seansean ls
ansible-workshop-vars.yml student1-etchosts.txt user-data.yaml
ansible-workshops-agnosticd.seansean.ec2_cloud_template student1-instances.txt user-info.yaml
instructor_inventory.txt student2-etchosts.txt
seansean-private.pem student2-instances.txt
โ seansean cat student1-etchosts.txt
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
โ seansean cat student1-instances.txt
[all:vars]
ansible_user=student1
ansible_ssh_pass=agnosticd-migration-admin-password
ansible_port=22
[web]
[control]
but I think we used the wrong variable for the templates
(under here-> ```templates/instructor_inventory``)
in the post_infra /agnosticd/ansible/configs/ansible-workshops-agnosticd/post_infra.yml
this looks correct->
- name: Grab all workshop hosts
ec2_instance_info:
filters:
instance-state-name: running
"tag:ansible-workshops": "true"
"tag:Workshop_guid": "{{ guid }}"
register: workshop_hosts
However the workshop_hosts
is what is holding all the data, however in the templates they are still using the ansible provisioner
{% for host in f5_node1_node_facts.instances %}
{% if 'student' ~ number == host.tags.Student %}
{{host.tags.Student}}-node1 ansible_host={{ host.public_ip_address }} ansible_user={{ host.tags.username }}
{% endif %}
the f5_node1_node_facts
for example... versus the workshop_hosts
Making sure this is not just for my setup, but did you verify the student inventory was setup correctly (e.g. /home/student1/lab_inventory/hosts) and the login page actually loaded the labs?
This is actually a red herring. the *_node_facts
variables are dynamically created in post_infra.py#36
. I did this to avoid porting the inventory/hosts templates because creating the necessary variables was much easier.
What went wrong is that the Student
tag was set to N
instead of studentN
, which I must have missed when transcribing the tags. There is a filter in the templates that silently does nothing when anything other than the expected format is given. I've fixed this now and can confirm that the inventory files are populated with IPs.
let me rebase and try! thanks Saso
On Mon, Aug 30, 2021 at 9:43 AM Saลกo Stanovnik @.***> wrote:
This is actually a red herring. the *_node_facts variables are dynamically created in post_infra.py#36. I did this to avoid porting the inventory/hosts templates because creating the necessary variables was much easier.
What went wrong is that the Student tag was set to N instead of studentN, which I must have missed when transcribing the tags. There is a filter in the templates that silently does nothing when anything other than the expected format is given. I've fixed this now and can confirm that the inventory files are populated with IPs.
โ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-cop/agnosticd/pull/3972#issuecomment-908354232, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4DUB5HAUWKLRYMFMQPQ3T7ODGLANCNFSM5A4DCAZA .
OK IP addresses are populating! Great.
Two issues that are hopefully minor->
- can you remove the IBM community grid stuff, that is from an older project and I think GPTE has pushed back on this.
e.g. just remove the role calls here->
/agnosticd/ansible/configs/ansible-workshops-agnosticd/pre_software.yml
just delete these entirely
- name: IBM community grid managed nodes
hosts: "managed_nodes"
become: true
gather_facts: true
environment: *aws_environment
tasks:
- name: install boinc-client and register
include_role:
name: ansible.workshops.community_grid
when:
- ibm_community_grid is defined
- ibm_community_grid
- name: IBM community grid control node
hosts: "control_nodes"
become: true
gather_facts: true
environment: *aws_environment
tasks:
- name: install boinc-client and register
include_role:
name: ansible.workshops.community_grid
tasks_from: auto_shutoff
when:
- ibm_community_grid is defined
- ibm_community_grid
- something is up with the roles being called in... pre_software
e.g.
- name: add dns entires for all student control nodes
hosts: '*ansible-1'
become: true
gather_facts: false
environment: *aws_environment
tasks:
- include_role:
name: ansible.workshops.aws_dns
I am getting warnings->
[WARNING]: Could not match supplied host pattern, ignoring: *ansible-1
PLAY [configure ansible control node] *************************************************************************************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: *ansible-2
[WARNING]: Could not match supplied host pattern, ignoring: *ansible-3
and skips that don't make sense to me->
PLAY [add dns entires for all student control nodes] **********************************************************************************************************************************************************
skipping: no hosts matched
so the DNS is not working right now, and its skipping a bunch of roles... it seems like the group ansible-1,2,3 are being created
actually i don't think clusters are supported, i think that is OK, so the easiest fix much be switching ansible-1, etc to control_nodes
and i think this will just work
This seems to be my fault again for missing a double nested loop that created the control nodes in ansible/workshops. I've fixed the provisioning to be able to create multiple replicas of nodes (and thus support clustering), adding the -N
suffix which was missing from the control node names, resulting in the above failures. Non-clustered deployment now executes the tasks that weren't executed before.
Clustering definitely does not work though, even in ansible/workshops. See the following section in ansible/workshops:provisioner/provision_lab.yml#115
- include_role:
name: ansible.workshops.control_node
tasks_from: package_dependencies
when: create_cluster|bool
where package_dependencies
should be 15_package_dependencies
. I've fixed this in this repo since it's one of the changes I need to sync manually anyway.
ok @sstanovnik I got this to pass with some minor edits upstream in the ansible.workshops
to support the output_dir for the control_node role and the populate_controller role. I did a new release, tested it and created two small PRs to your fork/branch. Can you accept those and I can re-test and that should finish out this project!