ecchronos
ecchronos copied to clipboard
Possibility for repairs to never be triggered
Since ecchronos assumes tables are repaired when there's no repair history, it's possible that repairs will never be triggered if ecchronos restarts/crashes once every repair interval before repair is actually triggered.
Possible TestCase:
- Configure ecChronos interval schedule to use minutes instead of days (default configuration;
- Start ecChronos with automatic repair enabled
- During schedule interval force a restart, we do this in two ways, killing java process and then running again:
Kill java process
Restart ecChronos using ecChronos binary jarkill -15 <pid>
nohup java -jar ecchronos-binary-5.0.1-SNAPSHOT.jar > /dev/null 2>&1 &
But if you're running in a container you can just execute
docker restart <container_name/id>
- Check schedule repair
Some considerations:
I believe this case of ecChronos restarting during the interval is valid, but I don't see how all ecChronos instances could be always restarting, if one is, others will be running repairs on their nodes and the repair_history will receive data.
where can this jar be found? in my environment it does not exist
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ pwd /home/epkdaek/cassandra/ecchronos-binary-5.0.1-SNAPSHOT epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ find . -name ec*.jar epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$ ls lib/e* lib/ecchronos-binary-5.0.1-SNAPSHOT.pom lib/error_prone_annotations-2.18.0.jar epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.1-SNAPSHOT$
should the restart suggested to trigger the scenario be the same as if ecc is started via
./bin/ecctool start -f
and the stopping it via ctrl-C and then starting it again via the same command?
and I assume the changing from days to minutes is in /conf/ecc.yml but which parts? this section? repair: ## ## A class for providing repair configuration for tables. ## The default FileBasedRepairConfiguration uses a schedule.yml file to define per-table configurations. ## provider: com.ericsson.bss.cassandra.ecchronos.application.FileBasedRepairConfiguration ## ## How often repairs should be triggered for tables. ## interval: time: 7 unit: days ?
Yes, I figure out that it really does not exist, you can try in the way you've suggested.
After discussion with the author this is how to reproduce the issue/bug.
run "ecctool schedules" and notice "completed at" and "next repair"
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:48:55
| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |
| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-26 09:59:30 | 2024-03-04 09:58:25 | VNODE | | 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-26 10:01:22 | 2024-03-04 10:00:15 | VNODE | | 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-26 10:03:15 | 2024-03-04 10:02:08 | VNODE | | 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-26 10:05:09 | 2024-03-04 10:04:01 | VNODE | | 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-26 10:07:03 | 2024-03-04 10:01:38 | VNODE |
Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$
-- truncate repair_history and restart ecchronos
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool schedules Snapshot as of 2024-02-26 10:50:42
| Id | Keyspace | Table | Status | Repaired(%) | Completed at | Next repair | Repair type |
| 1c2acb70-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE | | 1c4a8870-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | lock_priority | COMPLETED | 100.00 | 2024-02-20 10:50:13 | 2024-02-27 10:50:13 | VNODE | | 1ca2e1a0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | on_demand_repair_status | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE | | 1c6674e0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | reject_configuration | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE | | 1cbc0ef0-ba0a-11ee-aa7e-c71d2ee1d829 | ecchronos | repair_history | COMPLETED | 100.00 | 2024-02-20 10:50:14 | 2024-02-27 10:50:14 | VNODE |
Summary: 5 completed, 0 on time, 0 blocked, 0 late, 0 overdue epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$
after every restart these times will be recalculated without the repair job being executed.
The repair jobs seems to execute just fine. I have changed the repair schedule to 10 min in ecc.yml before starting.
epkdaek@elx721027t9:~/cassandra/ecchronos-binary-5.0.0-SNAPSHOT$ ./bin/ecctool start -f ecc started with pid 1066218
. ____ _ __ _ _
/\ / ' __ _ () __ __ _ \ \ \
( ( )__ | '_ | '| | ' / ` | \ \ \
\/ )| |)| | | | | || (| | ) ) ) )
' || .__|| ||| |_, | / / / /
=========||==============|/=////
:: Spring Boot :: (v2.7.17)
11:46:28.754 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Starting SpringBooter using Java 11.0.21 on elx721027t9 with PID 1066218 (/home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT/lib/application-5.0.0-SNAPSHOT.jar started by epkdaek in /home/epkdaek/cassandra/ecchronos-binary-5.0.0-SNAPSHOT) 11:46:28.757 [main] INFO c.e.b.c.e.a.spring.SpringBooter - No active profile set, falling back to 1 default profile: "default" 11:46:30.103 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat initialized with port(s): 8080 (http) 11:46:30.112 [main] INFO o.a.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler ["http-nio-127.0.0.1-8080"] 11:46:30.114 [main] INFO o.a.catalina.core.StandardService - Starting service [Tomcat] 11:46:30.114 [main] INFO o.a.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.82] 11:46:30.241 [main] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring embedded WebApplicationContext 11:46:30.242 [main] INFO o.s.b.w.s.c.ServletWebServerApplicationContext - Root WebApplicationContext: initialization completed in 1429 ms 11:46:30.487 [main] INFO c.e.b.c.e.a.DefaultNativeConnectionProvider - Connecting through CQL using localhost:9042, authentication: false, tls: false 11:46:33.472 [s1-admin-0] INFO c.e.b.c.e.c.DataCenterAwarePolicy - Using provided data-center name 'datacenter1' for DataCenterAwareLoadBalancingPolicy 11:46:33.487 [main] INFO c.e.b.c.e.a.DefaultJmxConnectionProvider - Connecting through JMX using localhost:7100, authentication: false, tls: false 11:46:33.728 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock is new, next repair 2024-02-27 11:56:33 11:46:33.784 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.lock_priority is new, next repair 2024-02-27 11:56:33 11:46:33.818 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.on_demand_repair_status is new, next repair 2024-02-27 11:56:33 11:46:33.870 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.reject_configuration is new, next repair 2024-02-27 11:56:33 11:46:33.895 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table ecchronos.repair_history is new, next repair 2024-02-27 11:56:33 11:46:34.283 [main] INFO o.s.b.a.e.web.EndpointLinksResolver - Exposing 1 endpoint(s) beneath base path '/actuator' 11:46:34.310 [main] INFO o.a.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-127.0.0.1-8080"] 11:46:34.325 [main] INFO o.s.b.w.e.tomcat.TomcatWebServer - Tomcat started on port(s): 8080 (http) with context path '' 11:46:34.344 [main] INFO c.e.b.c.e.a.spring.SpringBooter - Started SpringBooter in 6.0 seconds (JVM running for 6.44) 11:46:42.231 [http-nio-127.0.0.1-8080-exec-1] INFO o.a.c.c.C.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet 'dispatcherServlet' 11:46:42.232 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Initializing Servlet 'dispatcherServlet' 11:46:42.233 [http-nio-127.0.0.1-8080-exec-1] INFO o.s.web.servlet.DispatcherServlet - Completed initialization in 1 ms 11:57:03.807 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock 11:58:39.425 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.lock_priority 12:00:13.321 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.on_demand_repair_status 12:01:45.794 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.reject_configuration 12:03:18.858 [TaskExecutor-0] INFO c.e.b.c.e.c.s.ScheduleManagerImpl - Running task: VNODE repair group of ecchronos.repair_history
Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)
Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)
Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.
Since ecchronos always assumes a repair is successful if the history is empty, I don't see why this would be considered a bug if ecchronos is crashed/restarted before the interval is reached - once every interval. Sounds to me the actual bug to investigate is why ecchronos crashes/restarts all the time. ;-)
Whether you consider this a bug or enhancement doesn't matter. This is a scenario that can occur in real world and you won't even get any alarms since ecChronos still thinks everything is repaired because repair history is empty.
What is the expected behaviour? Right now it expects that the next repair should execute at "now" + "configured schedule time". by this log entry,
12:21:13.488 [RepairScheduler-0] INFO c.e.b.c.e.c.r.state.RepairStateImpl - Assuming the table
@masokol Then there are two options, I guess, to fix it;
- Repair immediately if history is empty
- Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.
#2 is almost how it is done today. just that it recalculates if it is restarted again. Who can decide between #1 and #2 orif there are more?
Yes, and that's maskol's point; e.g. the next repair date will keep moving forward if history is empty and ecchronos is restarted. My two cents are, if an empty repair history is assumed to be "repaired", I suggest that ecchronos should probably timestamp it as such in the history then - unless there are any other problems with that.
@masokol Then there are two options, I guess, to fix it;
- Repair immediately if history is empty
- Put in a fake repair date in the history - if empty. If the empty table is assumed to reflect a proper repair, why not populate it with history reflecting the same? As a matter of fact, as soon as a table is detected as new, it should populate the history with the current date.
From what i've understood the assumption that everything is repaired if repair history is empty was made to avoid option 1. So maybe option 2 or some completely new solution. One could argue that having an explicit delay, like repair_delay for each schedule might be a better option. Anyway, whichever solution you choose, i think ecChronos should know if it's the first time it's starting or not.
The working assumption that was decided is that when a new table is found ecchronos should update the histroy that a repair has been done now without doing the repair so there is history information the next time it starts