Random failed tests on routeros devices
Validity Version
3.0.3
NetBox Version
4.1.1
Python Version
3.12.3
Steps to Reproduce
As amount of tests for routeros and number of routeros devices started to increase (8 tests/33 devices as of now) I noticed that I get random failed tests. For example if I run 8 tests/1 device I get result with all tests passed, but if I run 8 tests/33 devices I end up with around 80% of tests passed but others failed.
This is how particular test for particular device may look when scheduled 8/33 tests have been done:
It seems like failed tests are because of netmiko or validity collecting results from netmiko because command result file may contain the following lines:
POLLING ERROR
AssertionError:
or trimmed output like:
enabled: yes
primary-ntp: 10.10.10.10
secondary-ntp: 10.10.10.11
server-dns-names:
mode: unicast
poll-interval: 15m
active-server: 10.10.10.10
last-update-from: 10.10.10.10
las
or some other strange lines which looks like lines from mikrotik config but not related to command sent to device. I attach this example when I find one.
I added to routeros poller config line: session_log: /tmp/netmiko_session.log and this log file contains all command outputs from device however device result file does not contain data.
Note. I don't notice same behavior on cisco.ios devices.
Traceback
No tracebacks
Let's break it down:
- There are 2 separate processes inside Validity: tests execution and device polling (although they can be executed together for the sake of convenience, in general they have nothing to do with each other). So, the former (tests execution) seems to be completely unrelated to the issue.
- There is some problem with netmiko (or the network itself, network_issue.jpg) as I can see. AssertionError likely means the output received from a device is empty. SSH/telnet are not so reliable in terms of network automation as you may know.
- Presence of the answer (that seems correct) in the logs and absence of it in the polling result might mean netmiko could not find the trailing prompt (
[MyMikrotik] >). I'm not sure here, just a guess.
Hence:
- I can't make it fixed unless it becomes somehow reproducible (e.g. you'd describe the exact algorithm for reproducing it with clean netbox/validity setup and some generic mikrotik).
- Try to play with timeouts (via poller params)
- I'm not sure if netmiko disables colors. So you can try to do it by yourself (use
myuser+ctas your username) - You can try to reproduce it without Validity: just using netmiko directly from python shell
It seems like this is solely mikrotik-netmiko issue. I found some similar issues: https://github.com/ktbyers/netmiko/issues/2880 https://github.com/ktbyers/netmiko/issues/2512 Have to figure out what to do with this. Issue closed.
I did a little investigation and got the following.
Plain netmiko script with a list of commands pushed to send_command indeed somehow does not have time to handle the input/output of the ROS and I got empty responses on random commands.
I found that netmiko has another delay-based mechanism to send commands with send_command_timing which is used primarily for show commands.
(https://ktbyers.github.io/netmiko/docs/netmiko/index.html#netmiko.BaseConnection.send_command_timing).
So I replaced https://github.com/amyasnikov/validity/blob/e40aec1a1f24c2f3c858d478e79c50279142af53/validity/pollers/cli.py#L17
with
return driver.send_command_timing(command.parameters["cli_command"], last_read=0.1, read_timeout=30, cmd_verify=True)
and first time I got all my tests succeed.
Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?
Oops, Mikrotik tests succeed but now Cisco tests randomly failed :(
Reverted code to original send_command(), but in the poller settings I added 2 options:
{
"disable_lf_normalization": "True",
"global_cmd_verify": "True"
}
Short tests on routeros and ios now return valid results. Let's see how things go further.
Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?
I can consider adding custom user-defined pollers via plugin settings
I've implemented custom user-defined pollers mechanism. So, in the next version you'll be able to define your own poller (e.g. based on the current netmiko poller) with the parameters suitable for your case.
@amyasnikov @hart323 Is there anything Netmiko needs to do here? It is a bit unclear to me whether Netmiko + Mikrotik is still broken in the referenced Netmiko issues AND if it is broken, what the issue is and how to fix it?
Hey @ktbyers, AFAIK the last version of Netmiko still has problems with getting full config (/export command) from Routeros 7 devices. The result of send_command appears to be truncated.
The code from this PR https://github.com/ktbyers/netmiko/pull/3467 solves the issue in my case
@amyasnikov Okay I reviewed/merged that referenced PR.