innernet copied to clipboard
timeout when simultanious redeeming
I found that if two or more clients (servers) are trying to redeem invite at the same time, one of them is getting timeout. The 'simultaneous' is much simpler to reproduce than it sound, because playbooks in Ansible normally does stuff in parallel on all servers in a group.
Here is a simple playbook to configure servers (not the innernet server, other servers):
- hosts: innernet
- name: Create invite for the server
delegate_to: '{{ innernet_auth_server }}'
become: true
innernet add-peer
--name '{{ inventory_hostname }}'
--ip '{{ innernet_ip }}'
--cidr '{{ innernet_servers_cidr_name }}'
--invite-expires {{ innernet_server_invite_expiration_time }}
--save-config '{{ innernet_invite_path }}'
--admin false
'{{ innernet_network_name }}'
register: res
changed_when: res.rc==0
- name: Fetch invite
become: true
delegate_to: '{{ innernet_auth_server }}'
shell: |
cat '{{ innernet_invite_path }}'
rm '{{ innernet_invite_path }}'
register: invite
changed_when: invite.rc==0
- name: Save invite
become: true
content: '{{ invite.stdout }}'
dest: '{{ innernet_invite_path }}'
owner: root
group: root
mode: '0600'
- name: Accept invite
become: true
'{{ innernet_invite_path }}'
register: res
changed_when: res.rc==0
- name: Activate systemd unit
become: true
name: innernet@{{ innernet_network_name }}
'Accept invite' task succeed for the first host and fails for all others. Adding throttle: 1 or retries helps.
Hi @amarao I am working on a Ansible implementation as well: Any hints on my problem?
@janikvonrotz , there is too little information to say something definitive, but things I found:
- There is an issue with outgoing interface for invitation (I think it need fixing in ureq library).
- Simultaneous invites do not work (this issue). If you use ansible, use
throttle: 1
on the task. - I found that calling innernet fetch on innernet-server interface breaks a lot (don't do it).
You may also want to try to pause redeeming process for debugging (Press Ctrl-Z, and look at a temporary wg interface created by innernet - I found issue with wrong interface by doing this).
@amarao I assume the issue with the wrong interface is #141 I will try to reproduce the issue. Thanks for your initiative and well done reports.
Interestingly, I added a quick section in the docker tests that spins up two peers redeeming invitations at about the same time, and it didn't error out. Will have to dig more into this. Thanks for reporting, and I'm sorry I've been away for so long!
I got permission from a company I work for to opensource our innernet playbooks/roles, which I'll do shortly. For testing I use libvirt vms, and when ansible does redeeming, it does so in parallel, and it's pretty reproducible. I'm afraid, GH Actions does not allow to use nested virtualization, so to run it one would need a normal linux machine.
@amarao that would be great, thanks! I started working on dropping the docker tests in favor of just using netns on linux directly, using as the base, but if the ansible playbooks are more readable that could be another possible way to do "integration" testing.
Hey @amarao have you published your ansible playbooks on gh/galaxy?
i am working on:
The roles are not as good as they should be.
I'm working on open sourcing, it's a bit harder than I expected (mostly to rip off internal stuff and make it self-sustaining). My plan is to publish playbooks, not roles, as I don't believe roles can work there (there is too much delegation and cross-host orchestration for role). Insofar I got molecule working. I think it would take couple of days (.. evenings).
Finally, I got everything streamlined and permitted for publishing. My playbook is here:
It's full of 'retries', 'throttle: 1', etc; nevertheless there is 30-50% chance than one of the nodes fails to redeem invite. Each of those hacks is issue with automation.
Does the server use async, and if so, is it using the multi-threaded runner or just the default single threaded one? Another place to look is SQLITE ( I think innernet uses this for assigned ip storage ) and that is single threaded. But it shouldn't block long enough to back up.
@DanielJoyce yep, it uses async with a multithreaded runner (tokio).