agnosticd icon indicating copy to clipboard operation
agnosticd copied to clipboard

Add config ansible-workshops-test - migration from ansible/workshops

Open sstanovnik opened this issue 3 years ago โ€ข 31 comments

SUMMARY

Migrate the RHEL workshop from ansible/workshops.

This config provisions, but is not "finished", since we need to validate it works.

/cc @IPvSean, we can use this as the base for further planning

ISSUE TYPE
  • New config Pull Request
COMPONENT NAME

ansible-workshops-test

ADDITIONAL INFORMATION

This is ansible-workshops-test as it includes all workshops. I've only tested RHEL, which needs to be split out. Most code will need to be shared though, since all workshops have the same structure.

The workshop provisions in 10 minutes, which is an improvement. The launch command is in the readme.

I've had to very liberally use agnosticd's framework since there are some things the existing workshops do differently, mainly concerning how the inventory is built up. Most things are identical to what worked before, since I tried to break as little as possible. Machine provisioning is obviously completely rewritten into CloudFormation. EC2 is the only provider supported because of extremely tight coupling.

The code is a mess, I know this very well. Cleanups will happen when I know there won't be any major functionality changes necessary to avoid premature optimization.

I initially wanted to do a staged approach to migrating the workshops, but the logic that needed to be ported was too complex to be split up into nice, logical parts. In the end, I figured the workflow downsides outweighed any the benefits of any smaller, but still >2kLOC PRs.

Major non-standard components, also the main things one needs to pay attention to here, are:

  • default_vars.yml, where I dumped, in order, all variable files from ansible/workshops. Duplicates aplenty, this is 800LOC with gratuitous whitespace.
  • ami_find/* which are the existing AMI filtering playbooks.
  • autoincluded_vars_workshop_instances.yml, which is a Jinja2 template of variables templated out and included in pre_infra.yml featuring multiple levels of templating and lazy evaluation.
    • This is where instances for each workshop are configured.
    • 1.3 kLOC
    • The schema is documented inside the file, but never validated.
    • This file is very ugly, but it works. Any improvements I could make would need to be propagated through the whole ansible/workshops structure and sometimes require knowledge of why workshops are set up the way they are.
  • ec2_cloud_template.j2 is not the "standard" agnosticd template, it is completely rewritten.
    • VPCs, subnets, secgroups and the like are mostly hardcoded.
    • Instances have a loop which is controlled by what is in autoincluded_vars_workshop_instances.yml.
  • There is no bastion host and the SSH keypair is handled differently.
  • Likely more I just don't remember off the top of my head.

TODOs:

  • Split out the RHEL workshop (then others).
  • Verify whether the RHEL workshop actually works (not just that it provisions).
  • Need to manually convert any relevant changes in ansible/workshops since the commit I based this migration on.

sstanovnik avatar Jul 23 '21 14:07 sstanovnik

@tonykay would appreciate a review if possible, thanks!

abenokraitis avatar Jul 23 '21 14:07 abenokraitis

Taking a look right now @abenokraitis - thanks for all the work here

tonykay avatar Jul 23 '21 14:07 tonykay

Adding some comments inline, because it is a config, and effectively sandboxed, we can merge with no impact to others. One potential issue I see is there are a number of hard coded names and filters which we would expect to be variables so we can update without a PR each time. I will add some comments to the code

tonykay avatar Jul 23 '21 20:07 tonykay

I've resolved most of the things we've talked about. Changes are in separate commits to make reviewing easier. Let me know if I need to squash anything.

AMIs are now completely parametrized and resolved in a loop, so variable overrides will work. Variables are also split into the "defaults", which normally wouldn't be changed, and the "sample vars", which are things you'd likely want to or need to customize. I think that's the way they were intended to be used.

The config was also renamed from ansible-workshops-test into ansible-workshops, with the other existing ansible-workshops* configs removed, since they'd be duplicates.

I couldn't manage to get things working with bastions hosts, i.e. with the built-in inventory builder. I saw no way of making host connections be direct, without proxying through the bastion host. This means the ansible-playbook command still needs to include --skip-tags create_inventory,create_ssh_config,wait_ssh,set_hostname to skip those steps. I don't know whether this is even possible in agnosticd, or whether the workshops can be adapted to function with a bastion host.

sstanovnik avatar Aug 16 '21 12:08 sstanovnik

I couldn't manage to get things working with bastions hosts, i.e. with the built-in inventory builder. I saw no way of making host connections be direct, without proxying through the bastion host. This means the ansible-playbook command still needs to include --skip-tags create_inventory,create_ssh_config,wait_ssh,set_hostname to skip those steps. I don't know whether this is even possible in agnosticd, or whether the workshops can be adapted to function with a bastion host.

The workshops should be able to function with a bastion host and the --skip-tags listed remove that capability. This is core functionality of AgnosticD and in effect is a large part of how it is cloud agnostic.

(We can, and do, run agnosticd with --skip-tags but the above tag mix removes core functionality which would then have to be reimplemented per cloud provider)

Each cloud_provider has its own create_inventory and create_ssh_config relies on that. This allows multi cloud deployment and the create_ssh_config is what enables configuration via bastion. It is simple configuring ansible to know how to contact each host via a jumpbox.

You could effectively use your first control node for this bastion/jumpbox task. All it needs, and the bastions effectively get this "for free", is the correct ssh keys and IP connectivity to all hosts.

tonykay avatar Aug 18 '21 19:08 tonykay

call at 9AM ET between XLAB & GPTE

@tonykay will deploy workshop and talk to GPTE leadership to verify a list of what else is missing to merge PR

cc @abenokraitis

IPvSean avatar Aug 23 '21 13:08 IPvSean

I've renamed the config to ansible-workshops-agnosticd as discussed and removed the commit deleting the other ansible-workshops-* configs.

I forgot to mention on the call that what you need to deploy the config, apart from the command from the readme and the AWS credentials in sample_vars.yml, is the RHAAP download offline_token, which you can generate at https://access.redhat.com/management/api. Alternatively, comment out the task at pre_infra.yml#60.

sstanovnik avatar Aug 23 '21 14:08 sstanovnik

@sstanovnik can you paste in the exact command you are using to deploy, and confirm you are deploying with 2.9.x? Thanks. If you have time adding a requirements.txt to the config would be useful or confirming which one you are using. They are in tools/virtualenvs The README's typically should contain the command to deploy so other users can basically just copy and paste to test. Thanks.

tonykay avatar Aug 23 '21 14:08 tonykay

The README is important for ops. Needs to have a description so they know this is the port of the workshops as there will be 2 entries with ansible-workshop in their name. Plus detailed instructions on how to run. For example either a long ansible-playbook cmd with all the -eoptions they need to set or simply a sample of what they would over-ride via-e @my_vars.yml`. We typically split it into 2 files and have the secret one last. Happy to contribute once we have it added. Basically Ops need to know how to deploy/destroy more or less via copy and paste.

tonykay avatar Aug 23 '21 15:08 tonykay

Re the README it would be good to call out the offline_token and how to get it (simply a link to the RHN docs will do). We would prefer it to have a more meaningful name as we use lots of tokens to pull from repos and repositories. A rhn_ prefix will help ops if they see log errors etc.

tonykay avatar Aug 23 '21 16:08 tonykay

The exact command was in the readme, but I missed changing the config path when I renamed it to *-agnosticd. I've now extended the readme with detailed instructions.

offline_token can't be renamed since it's a variable that ansible/workshops uses directly. Unless you want to duplicate it under a different name in the vars file, but that would break overriding.

I've added a requirements.txt. I wasn't using 2.9, but I've now verified this works under these two versions:

  • ansible 2.9.25
  • ansible [core 2.11.2]

sstanovnik avatar Aug 23 '21 16:08 sstanovnik

Thanks @sstanovnik testing with 2.11. Note your point re the var name

tonykay avatar Aug 23 '21 19:08 tonykay

I notice it looks up all amis for all workshops? Shouldn't this be conditional so that only the current workshop amis are looked up. For example if a network ami task breaks then the workshop will break for everyone plus it is a additional work. Can we wrap them in when?

tonykay avatar Aug 23 '21 20:08 tonykay

Need to document all vars to supply as over-rides e.g workshop_dns_zone

tonykay avatar Aug 23 '21 20:08 tonykay

@sstanovnik when tested on an environment running other workshops an issue arises here https://github.com/xlab-steampunk/agnosticd/blob/a38dae6689a97b35a691ec0d1d7c343a380741e4/ansible/configs/ansible-workshops-agnosticd/post_infra.yml#L23

Basically the filter picks up every instance in every running workshop. I suggest the filter is changed from a simple - name: Grab all workshop hosts ec2_instance_info: filters: instance-state-name: running "tag:ansible-workshops": "true" register: workshop_hosts To using guid as well. Otherwise we will potentially pick up hundreds of instances before proceeding. It is already picking up the existing instances from the ansible-workshop config

tonykay avatar Aug 23 '21 21:08 tonykay

@sstanovnik re the above filter picking up all the instances, this is how we normally do it and as long as you are creating the project tag including the value of guid all should be good as you'll only pick up the guid instances https://github.com/redhat-cop/agnosticd/blob/d2dd93ad960d5ae1e486eac3f9a7828b649bd9ef/ansible/roles-infra/infra-ec2-create-inventory/tasks/main.yml#L7

tonykay avatar Aug 23 '21 22:08 tonykay

Hi @sstanovnik we need the filter issue fixed before we can carry out further tests. I'll check your repo morning Wednesday US time. This is blocking us from deploying safely.

tonykay avatar Aug 24 '21 22:08 tonykay

I've limited instance lookups to the guid.

AMIs are all fetched because that was easiest. I could either

  • mark each image with the workshops it's used in, which needs to be modified properly when images are added or workshops start using different ones, or
  • parse the necessary images from the workshop instances, but this is a bit more work initially.

By documenting variables, do you mean documenting their meaning along their sample values, or documenting all of them, perhaps including the meanings, in the readme?

sstanovnik avatar Aug 25 '21 11:08 sstanovnik

I have a DNS issue as I switch from us-east-1 to us-east-2 and we do use multiple regions concurrently eg Europeans may deploy to eu-* whilst APAC may use ae-*. There seems to be a hard coded internal Rout53 associated with the first region we choose so we can't deploy elsewhere "last_updated_time": "2021-08-26T18:28:38.805000+00:00", "logical_resource_id": "S3Bucket", "physical_resource_id": "", "resource_type": "AWS::S3::Bucket", "status": "CREATE_FAILED", "status_reason": "amtw.example.opentlc.com.private already exists in stack arn:aws:cloudformation:us-east-1:719622469867:stack/ansible-workshops-agnosticd-tok-00/2b97ee70-068f-11ec-b9b8-124ddee56cf7"

tonykay avatar Aug 26 '21 18:08 tonykay

This is related to the previous comment directly above

We are having issues with a hard coded string amtw which needs to be guid. I suggest you use guid throughout and then in default_vars.yml you can map ec2_name_prefix: "{{ guid }}" as a safety mechanism and to simply maintaining the original code.

      95

Also some of these are deal breakers for the labs to function correctly:

  value: "attendance-host.amtw.example.opentlc.com

We also see these for gitlab and automation hub

tonykay avatar Aug 26 '21 18:08 tonykay

Whilst it was used in the original Ansible Workshops deployer the creation of the working directory, keys etc in the repo itself needs to be moved out of the repo. AgnosticD supports the idea of an output_dir and all artifacts should be stored there. Repos are cloned on the fly and deployed then destroyed and artifacts would be lost. Plus for local development it can cause issues when doing a git push etc.

We need to change the location from ansible/configs/ansible-workshops-agnosticd/amtw Plus the final sub-directory should either be a var e.g. /{{ guid }} or all artifacts carry the value of guid in their name.

tonykay avatar Aug 26 '21 18:08 tonykay

It is essential that ec2_name_prefix is set to guid, this is breaking deploys and artifacts such as ssh keys. We would remove it from any input files and set it in default_vars_ec2.yml like this ec2_name_prefix: "{{ guid }}"

Longer term goal would be to ideally eventually replace with guid

tonykay avatar Aug 27 '21 13:08 tonykay

I've changed ec2_name_prefix to be guid and moved the artefacts into output_dir: "[...]/{{ guid }}". Not sure about the problem with switching regions - there is a separate CF stack bound to a region. Now with GUID == ec2_name_prefix name clashes shouldn't happen any more, I hope.

sstanovnik avatar Aug 27 '21 14:08 sstanovnik

Hey one last thing @sstanovnik , I got everything to boot, and the website login, awesome!

The last problem (hopefully ๐Ÿคž ) is that I don't think the real IPs are being populated. The attendance website is empty, and the instructor files are empty->

e.g.

โžœ  agnosticd git:(workshopmigration) โœ— ls
CODEOWNERS  LICENSE     README.adoc ansible     ansible.cfg docs        login.html  test.yaml   tests       tools       training
โžœ  agnosticd git:(workshopmigration) โœ— ls /tmp/workdir/seansean/
ansible-workshop-vars.yml                               student1-etchosts.txt                                   user-data.yaml
ansible-workshops-agnosticd.seansean.ec2_cloud_template student1-instances.txt                                  user-info.yaml
instructor_inventory.txt                                student2-etchosts.txt
seansean-private.pem                                    student2-instances.txt
โžœ  agnosticd git:(workshopmigration) โœ— cd /tmp/workdir/seansean
โžœ  seansean ls
ansible-workshop-vars.yml                               student1-etchosts.txt                                   user-data.yaml
ansible-workshops-agnosticd.seansean.ec2_cloud_template student1-instances.txt                                  user-info.yaml
instructor_inventory.txt                                student2-etchosts.txt
seansean-private.pem                                    student2-instances.txt
โžœ  seansean cat student1-etchosts.txt
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6






โžœ  seansean cat student1-instances.txt
[all:vars]
ansible_user=student1
ansible_ssh_pass=agnosticd-migration-admin-password
ansible_port=22

[web]

[control]

but I think we used the wrong variable for the templates (under here-> ```templates/instructor_inventory``)

in the post_infra /agnosticd/ansible/configs/ansible-workshops-agnosticd/post_infra.yml

this looks correct->

    - name: Grab all workshop hosts
      ec2_instance_info:
        filters:
          instance-state-name: running
          "tag:ansible-workshops": "true"
          "tag:Workshop_guid": "{{ guid }}"
      register: workshop_hosts

However the workshop_hosts is what is holding all the data, however in the templates they are still using the ansible provisioner

{% for host in f5_node1_node_facts.instances %}
{% if 'student' ~ number == host.tags.Student %}
{{host.tags.Student}}-node1 ansible_host={{ host.public_ip_address }} ansible_user={{ host.tags.username }}
{% endif %}

the f5_node1_node_facts for example... versus the workshop_hosts

Making sure this is not just for my setup, but did you verify the student inventory was setup correctly (e.g. /home/student1/lab_inventory/hosts) and the login page actually loaded the labs?

IPvSean avatar Aug 27 '21 14:08 IPvSean

This is actually a red herring. the *_node_facts variables are dynamically created in post_infra.py#36. I did this to avoid porting the inventory/hosts templates because creating the necessary variables was much easier.

What went wrong is that the Student tag was set to N instead of studentN, which I must have missed when transcribing the tags. There is a filter in the templates that silently does nothing when anything other than the expected format is given. I've fixed this now and can confirm that the inventory files are populated with IPs.

sstanovnik avatar Aug 30 '21 13:08 sstanovnik

let me rebase and try! thanks Saso

On Mon, Aug 30, 2021 at 9:43 AM Saลกo Stanovnik @.***> wrote:

This is actually a red herring. the *_node_facts variables are dynamically created in post_infra.py#36. I did this to avoid porting the inventory/hosts templates because creating the necessary variables was much easier.

What went wrong is that the Student tag was set to N instead of studentN, which I must have missed when transcribing the tags. There is a filter in the templates that silently does nothing when anything other than the expected format is given. I've fixed this now and can confirm that the inventory files are populated with IPs.

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-cop/agnosticd/pull/3972#issuecomment-908354232, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABR4DUB5HAUWKLRYMFMQPQ3T7ODGLANCNFSM5A4DCAZA .

IPvSean avatar Aug 30 '21 13:08 IPvSean

OK IP addresses are populating! Great.

Two issues that are hopefully minor->

  1. can you remove the IBM community grid stuff, that is from an older project and I think GPTE has pushed back on this. e.g. just remove the role calls here-> /agnosticd/ansible/configs/ansible-workshops-agnosticd/pre_software.yml

just delete these entirely

- name: IBM community grid managed nodes
  hosts: "managed_nodes"
  become: true
  gather_facts: true
  environment: *aws_environment

  tasks:
    - name: install boinc-client and register
      include_role:
        name: ansible.workshops.community_grid
      when:
        - ibm_community_grid is defined
        - ibm_community_grid

- name: IBM community grid control node
  hosts: "control_nodes"
  become: true
  gather_facts: true
  environment: *aws_environment

  tasks:
    - name: install boinc-client and register
      include_role:
        name: ansible.workshops.community_grid
        tasks_from: auto_shutoff
      when:
        - ibm_community_grid is defined
        - ibm_community_grid
        
  1. something is up with the roles being called in... pre_software

e.g.

- name: add dns entires for all student control nodes
  hosts: '*ansible-1'
  become: true
  gather_facts: false
  environment: *aws_environment
  tasks:
    - include_role:
        name: ansible.workshops.aws_dns

I am getting warnings->

[WARNING]: Could not match supplied host pattern, ignoring: *ansible-1

PLAY [configure ansible control node] *************************************************************************************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: *ansible-2
[WARNING]: Could not match supplied host pattern, ignoring: *ansible-3

and skips that don't make sense to me->

PLAY [add dns entires for all student control nodes] **********************************************************************************************************************************************************
skipping: no hosts matched

so the DNS is not working right now, and its skipping a bunch of roles... it seems like the group ansible-1,2,3 are being created

IPvSean avatar Aug 30 '21 15:08 IPvSean

actually i don't think clusters are supported, i think that is OK, so the easiest fix much be switching ansible-1, etc to control_nodes and i think this will just work

IPvSean avatar Aug 30 '21 16:08 IPvSean

This seems to be my fault again for missing a double nested loop that created the control nodes in ansible/workshops. I've fixed the provisioning to be able to create multiple replicas of nodes (and thus support clustering), adding the -N suffix which was missing from the control node names, resulting in the above failures. Non-clustered deployment now executes the tasks that weren't executed before.

Clustering definitely does not work though, even in ansible/workshops. See the following section in ansible/workshops:provisioner/provision_lab.yml#115

    - include_role:
        name: ansible.workshops.control_node
        tasks_from: package_dependencies
      when: create_cluster|bool

where package_dependencies should be 15_package_dependencies. I've fixed this in this repo since it's one of the changes I need to sync manually anyway.

sstanovnik avatar Sep 02 '21 07:09 sstanovnik

ok @sstanovnik I got this to pass with some minor edits upstream in the ansible.workshops to support the output_dir for the control_node role and the populate_controller role. I did a new release, tested it and created two small PRs to your fork/branch. Can you accept those and I can re-test and that should finish out this project!

IPvSean avatar Sep 13 '21 22:09 IPvSean