cbt icon indicating copy to clipboard operation
cbt copied to clipboard

More explicit failure indication in cbt run.

Open bdastur opened this issue 8 years ago • 6 comments

When executing the cbt.py test suite, it is very hard to figure out which steps failed/passed. My experience with this tool is very limited as I just started using it, but I see that the pdsh commands fail without any error, so it is hard to decipher why.

Also, the use_existing flag in cluster: configuration in the yaml file should be highlighted when using against an existing cluster. Once I go through a successful execution I will create a pull request for any doc changes if makes sense and other issues if I see.

Another issue I see is username and groupname are taken as the same which is not the case. Might be useful to add a groups filed as well.

Lastly --> Now I think I have gotten past some of my inital hurdles and am able to execute an fio benchmark, but I am not sure what is next.

The last step I see is:

21:30:37 - DEBUG - cbt - pdsh -R ssh -w [email protected],[email protected],[email protected] sudo chown -R behzad_dastur.behzad_dastur /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* 21:30:37 - DEBUG - cbt - rpdcp -f 1 -R ssh -w [email protected],[email protected],[email protected] -r /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite

I can see logs created at:

[root@cbtvm001-d658 cbt]# ls /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/read/ collectl.b-stageosd001-r19f29-prod.acme.symcpe.net collectl.v-stagemon-002-prod.abc.acme.net output.0.v-stagemon-001-prod.abc.acme.net collectl.v-stagemon-001-prod.abc.acme.net historic_ops.out.b-stageosd001-r19f29-prod.abc.acme.net Are there ways to now visualize this data.

bdastur avatar Mar 31 '16 21:03 bdastur

The last thing CBT does is copy over the logs and output files from the nodes/clients and brings them over to the head node. This is all raw data and FIO summary outputs so you need to create a parser if you want to visualize the data as cluster performance.

ommoreno avatar Mar 31 '16 23:03 ommoreno

Thanks for confirming/clarifying @ommoreno .

bdastur avatar Apr 01 '16 14:04 bdastur

see fiologparser.py in axboe/fio tree under tools/ , this is in process of being improved by Mark and Karl Cronburg. Error checking is being tightened up, see PRs #107 and #110

bengland2 avatar Jul 21 '16 19:07 bengland2

im running cbt on existing cluster, but im not getting any output in "output.0" file.. all im getting is some output in "historic_ops.out.. tried running both librbdfio and rados benchmark..

sand33p-23 avatar Oct 19 '16 09:10 sand33p-23

Try running the fio or rados bench command standalone and see if you get an error. Then walk backwards in the command list until you find the first command that failed.

I added code into CBT to check for failures while constructing the cluster, and throw an exception if one occurs, but did not enable failure checking everywhere - there are cases where some users may find it useful to ignore a single failure, such as a test that constructs a 1000-OSD cluster and encounters a single bad disk. You can turn it on anywhere you like by adding the parameter ", continue_if_error=False" as the last parameter in the common.pdsh calls in CBT code.

It sounds like your cluster built if you are seeing historic_ops.out results. What happens when you run rados bench command that CBT runs by itself? Also look in benchmark/radosbench.py and enable error checking there, so that CBT will tell you what's going wrong.

bengland2 avatar Oct 19 '16 12:10 bengland2

Thanks for the steps , got the cbt running after lots of troubleshooting. Really need to document the steps so wont get issues when run it on another cluster.

sand33p-23 avatar Oct 19 '16 17:10 sand33p-23