validity icon indicating copy to clipboard operation
validity copied to clipboard

Random failed tests on routeros devices

Open hart323 opened this issue 1 year ago • 6 comments

Validity Version

3.0.3

NetBox Version

4.1.1

Python Version

3.12.3

Steps to Reproduce

As amount of tests for routeros and number of routeros devices started to increase (8 tests/33 devices as of now) I noticed that I get random failed tests. For example if I run 8 tests/1 device I get result with all tests passed, but if I run 8 tests/33 devices I end up with around 80% of tests passed but others failed. This is how particular test for particular device may look when scheduled 8/33 tests have been done: изображение

It seems like failed tests are because of netmiko or validity collecting results from netmiko because command result file may contain the following lines:

POLLING ERROR
AssertionError: 

or trimmed output like:

             enabled: yes
         primary-ntp: 10.10.10.10
       secondary-ntp: 10.10.10.11
    server-dns-names: 
                mode: unicast
       poll-interval: 15m
       active-server: 10.10.10.10
    last-update-from: 10.10.10.10
  las

or some other strange lines which looks like lines from mikrotik config but not related to command sent to device. I attach this example when I find one.

I added to routeros poller config line: session_log: /tmp/netmiko_session.log and this log file contains all command outputs from device however device result file does not contain data.

Note. I don't notice same behavior on cisco.ios devices.

Traceback

No tracebacks

hart323 avatar Sep 26 '24 15:09 hart323

Let's break it down:

  1. There are 2 separate processes inside Validity: tests execution and device polling (although they can be executed together for the sake of convenience, in general they have nothing to do with each other). So, the former (tests execution) seems to be completely unrelated to the issue.
  2. There is some problem with netmiko (or the network itself, network_issue.jpg) as I can see. AssertionError likely means the output received from a device is empty. SSH/telnet are not so reliable in terms of network automation as you may know.
  3. Presence of the answer (that seems correct) in the logs and absence of it in the polling result might mean netmiko could not find the trailing prompt ([MyMikrotik] > ). I'm not sure here, just a guess.

Hence:

  1. I can't make it fixed unless it becomes somehow reproducible (e.g. you'd describe the exact algorithm for reproducing it with clean netbox/validity setup and some generic mikrotik).
  2. Try to play with timeouts (via poller params)
  3. I'm not sure if netmiko disables colors. So you can try to do it by yourself (use myuser+ct as your username)
  4. You can try to reproduce it without Validity: just using netmiko directly from python shell

amyasnikov avatar Sep 26 '24 19:09 amyasnikov

It seems like this is solely mikrotik-netmiko issue. I found some similar issues: https://github.com/ktbyers/netmiko/issues/2880 https://github.com/ktbyers/netmiko/issues/2512 Have to figure out what to do with this. Issue closed.

hart323 avatar Sep 27 '24 09:09 hart323

I did a little investigation and got the following. Plain netmiko script with a list of commands pushed to send_command indeed somehow does not have time to handle the input/output of the ROS and I got empty responses on random commands. I found that netmiko has another delay-based mechanism to send commands with send_command_timing which is used primarily for show commands. (https://ktbyers.github.io/netmiko/docs/netmiko/index.html#netmiko.BaseConnection.send_command_timing).

So I replaced https://github.com/amyasnikov/validity/blob/e40aec1a1f24c2f3c858d478e79c50279142af53/validity/pollers/cli.py#L17

with return driver.send_command_timing(command.parameters["cli_command"], last_read=0.1, read_timeout=30, cmd_verify=True) and first time I got all my tests succeed.

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

hart323 avatar Oct 03 '24 09:10 hart323

Oops, Mikrotik tests succeed but now Cisco tests randomly failed :(

hart323 avatar Oct 03 '24 09:10 hart323

Reverted code to original send_command(), but in the poller settings I added 2 options:

{
  "disable_lf_normalization": "True",
  "global_cmd_verify": "True"
}

Short tests on routeros and ios now return valid results. Let's see how things go further.

hart323 avatar Oct 03 '24 11:10 hart323

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

I can consider adding custom user-defined pollers via plugin settings

amyasnikov avatar Oct 03 '24 21:10 amyasnikov

I've implemented custom user-defined pollers mechanism. So, in the next version you'll be able to define your own poller (e.g. based on the current netmiko poller) with the parameters suitable for your case.

amyasnikov avatar Nov 19 '24 22:11 amyasnikov

@amyasnikov @hart323 Is there anything Netmiko needs to do here? It is a bit unclear to me whether Netmiko + Mikrotik is still broken in the referenced Netmiko issues AND if it is broken, what the issue is and how to fix it?

ktbyers avatar Apr 03 '25 21:04 ktbyers

Hey @ktbyers, AFAIK the last version of Netmiko still has problems with getting full config (/export command) from Routeros 7 devices. The result of send_command appears to be truncated.

The code from this PR https://github.com/ktbyers/netmiko/pull/3467 solves the issue in my case

amyasnikov avatar Apr 04 '25 11:04 amyasnikov

@amyasnikov Okay I reviewed/merged that referenced PR.

ktbyers avatar Apr 04 '25 17:04 ktbyers