ecchronos Possibility for repairs to never be triggered

Since ecchronos assumes tables are repaired when there's no repair history, it's possible that repairs will never be triggered if ecchronos restarts/crashes once every repair interval before repair is actually triggered.

Feb 10 '22 14:02 masokol

Possible TestCase:

Configure ecChronos interval schedule to use minutes instead of days (default configuration;
Start ecChronos with automatic repair enabled
During schedule interval force a restart, we do this in two ways, killing java process and then running again: Kill java process
```
kill -15 <pid>
```
Restart ecChronos using ecChronos binary jar
```
nohup java -jar ecchronos-binary-5.0.1-SNAPSHOT.jar > /dev/null 2>&1 &
```

But if you're running in a container you can just execute

docker restart <container_name/id>

Check schedule repair

Some considerations:

I believe this case of ecChronos restarting during the interval is valid, but I don't see how all ecChronos instances could be always restarting, if one is, others will be running repairs on their nodes and the repair_history will receive data.

Jan 30 '24 05:01 VictorCavichioli

where can this jar be found? in my environment it does not exist

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ pwd /home/epkdaek/cassandra/ecchronos-binary-5.0.1-SNAPSHOT epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ find . -name ec*.jar epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ ls lib/e* lib/ecchronos-binary-5.0.1-SNAPSHOT.pom lib/error_prone_annotations-2.18.0.jar epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$

Feb 21 '24 11:02 DanielwEriksson

should the restart suggested to trigger the scenario be the same as if ecc is started via

./bin/ecctool start -f

and the stopping it via ctrl-C and then starting it again via the same command?

and I assume the changing from days to minutes is in /conf/ecc.yml but which parts? this section? repair: ## ## A class for providing repair configuration for tables. ## The default FileBasedRepairConfiguration uses a schedule.yml file to define per-table configurations. ## provider: com.ericsson.bss.cassandra.ecchronos.application.FileBasedRepairConfiguration ## ## How often repairs should be triggered for tables. ## interval: time: 7 unit: days ?

Feb 21 '24 11:02 DanielwEriksson

Yes, I figure out that it really does not exist, you can try in the way you've suggested.

Feb 23 '24 13:02 VictorCavichioli

After discussion with the author this is how to reproduce the issue/bug.

run "ecctool schedules" and notice "completed at" and "next repair"

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:48:55

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-26 09:59:30 | 2024-03-04 09:58:25 | VNODE | | 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-26 10:01:22 | 2024-03-04 10:00:15 | VNODE | | 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-26 10:03:15 | 2024-03-04 10:02:08 | VNODE | | 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-26 10:05:09 | 2024-03-04 10:04:01 | VNODE | | 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-26 10:07:03 | 2024-03-04 10:01:38 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

-- truncate repair_history and restart ecchronos

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:50:42

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE | | 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE | | 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE | | 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE | | 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |

Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$

after every restart these times will be recalculated without the repair job being executed.

Feb 26 '24 10:02 DanielwEriksson

The repair jobs seems to execute just fine. I have changed the repair schedule to 10 min in ecc.yml before starting.

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool start -f ecc started with pid 1066218

. ____ _ __ _ _ /\ / ' __ _ () __ __ _ \ \ \
( ( )__ | '_ | '| | ' / ` | \ \ \
\/ )| |)| | | | | || (| | ) ) ) ) ' || .__|| ||| |_, | / / / / =========||==============|/=//// :: Spring Boot :: (v2.7.17)

11:46:28.754 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Starting SpringBooter using Java 11.0.21 on elx721027t9 with PID 1066218 (/home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT/lib/application-5.0.0-SNAPSHOT.jar started by epkdaek in /home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT) 11:46:28.757 [main] INFO c.e.b.c.e.a.spring.SpringBooter - No active profile set, falling back to 1 default profile: "default" 11:46:30.103 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat initialized with port(s): 8080 (http) 11:46:30.112 [main] INFO o.a.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler ["http-nio-127.0.0.1-8080"] 11:46:30.114 [main] INFO o.a.catalina.core.StandardService - Starting service [Tomcat] 11:46:30.114 [main] INFO o.a.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.82] 11:46:30.241 [main] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring embedded WebApplicationContext 11:46:30.242 [main] INFO o.s.b.w.s.c.ServletWebServerApplicationContext - Root WebApplicationContext: initialization completed in 1429 ms 11:46:30.487 [main] INFO c.e.b.c.e.a.DefaultNativeConnectionProvider - Connecting through CQL using localhost:9042, authentication: false, tls: false 11:46:33.472 [s1-admin-0] INFO c.e.b.c.e.c.DataCenterAwarePolicy - Using provided data-center name 'datacenter1' for DataCenterAwareLoadBalancingPolicy 11:46:33.487 [main] INFO c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using localhost:7100, authentication: false, tls: false 11:46:33.728 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock is new, next repair 2024-02-27 11:56:33 11:46:33.784 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock_priority is new, next repair 2024-02-27 11:56:33 11:46:33.818 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.on_demand_repair_status is new, next repair 2024-02-27 11:56:33 11:46:33.870 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.reject_configuration is new, next repair 2024-02-27 11:56:33 11:46:33.895 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.repair_history is new, next repair 2024-02-27 11:56:33 11:46:34.283 [main] INFO o.s.b.a.e.web.EndpointLinksResolver - Exposing 1 endpoint(s) beneath base path '/actuator' 11:46:34.310 [main] INFO o.a.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-127.0.0.1-8080"] 11:46:34.325 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat started on port(s): 8080 (http) with context path '' 11:46:34.344 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Started SpringBooter in 6.0 seconds (JVM running for 6.44) 11:46:42.231 [http-nio-127.0.0.1-8080-exec-1] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet 'dispatcherServlet' 11:46:42.232 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Initializing Servlet 'dispatcherServlet' 11:46:42.233 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Completed initialization in 1 ms 11:57:03.807 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock 11:58:39.425 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock_priority 12:00:13.321 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.on_demand_repair_status 12:01:45.794 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.reject_configuration 12:03:18.858 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.repair_history

Feb 27 '24 11:02 DanielwEriksson

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Feb 28 '24 09:02 jwaeab

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

Feb 28 '24 12:02 masokol

Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)

Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.

What is the expected behaviour? Right now it expects that the next repair should execute at "now" + "configured schedule time". by this log entry,

12:21:13.488 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table

is new, next repair .

Feb 28 '24 13:02 DanielwEriksson

@masokol Then there are two options, I guess, to fix it;

Repair immediately if history is empty
Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

Feb 29 '24 05:02 jwaeab

#2 is almost how it is done today. just that it recalculates if it is restarted again. Who can decide between #1 and #2 orif there are more?

Feb 29 '24 09:02 DanielwEriksson

Yes, and that's maskol's point; e.g. the next repair date will keep moving forward if history is empty and ecchronos is restarted. My two cents are, if an empty repair history is assumed to be "repaired", I suggest that ecchronos should probably timestamp it as such in the history then - unless there are any other problems with that.

Feb 29 '24 09:02 jwaeab

@masokol Then there are two options, I guess, to fix it;

Repair immediately if history is empty

Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.

From what i've understood the assumption that everything is repaired if repair history is empty was made to avoid option 1. So maybe option 2 or some completely new solution. One could argue that having an explicit delay, like repair_delay for each schedule might be a better option. Anyway, whichever solution you choose, i think ecChronos should know if it's the first time it's starting or not.

Mar 01 '24 08:03 masokol

The working assumption that was decided is that when a new table is found ecchronos should update the histroy that a repair has been done now without doing the repair so there is history information the next time it starts

Mar 04 '24 12:03 DanielwEriksson

ecchronos ecchronos copied to clipboard

Possibility for repairs to never be triggered

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:48:55

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:50:42

| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |

ecchronos
ecchronos copied to clipboard