Combining logs from multiple sensors into a single dataset
I've seen the following (https://github.com/activecm/rita/issues/565), but was curious if rita import would append to, or replace a dataset when using multiple sources. For example, if I had RITA running on two separate sensors, then submitted the results to a central MongDB instance via rita import, from each separate sensor, would the result be additive and analysis be re-performed or results aggregated to represent the whole dataset, or would the first set be overwritten by the second set? It sounds like the logs need to be gathered from separate sources at the same time for it to be combined (from the following example: rita import system1-logs system2-logs combined-dataset) ? I'm ultimately looking to do this on a rolling basis to see the last 24 hours for all sensors combined, but wanted to make sure an approach like this was possible.
For example:
- Host with MongoDB
- Sensor1:
rita import sensor1logs sensor-data( --> External MongoDB) - Sensor2:
rita import sensor2logs sensor-data( --> External MongoDB)
I'll start by saying that this is possible and we do something similar in our commercial product: https://www.activecountermeasures.com/
The method we use is to transfer logs from the sensors to a host that has both mongo and rita. The logs get renamed and rita imports logs from all sensors from a given hour at once.
You should also be able to do it with rita not on the same mongo server but on the sensors like you suggest. Here's some background on how rolling datasets and overwriting works on rita. Skip to the end for my suggestion.
If you try to import data into the same dataset like you suggest, I think rita will give you a message like this New data cannot be imported into a non-rolling database. Run with --rolling to convert this database into a rolling database. Here are some illustrative examples of what you can do:
This will overwrite sensor-data so that it only contains sensor2logs. It's a shortcut for rita delete followed by rita import.
rita import sensor1logs sensor-data
rita import --delete sensor2logs sensor-data
This will append the sensor2logs into the sensor-data dataset so that it contains both. A rolling datasets defaults to having 24 chunks in it. After these commands chunk 0 will contain sensor1logs and chunk 1 will contain sensor2logs.
rita import sensor1logs sensor-data
rita import --rolling sensor2logs sensor-data
Chunks are intended to hold an hour of data each and each dataset is intended to hold a day's worth of data (which is why we default to 24). But as you can see in the previous examples both are more flexible than that. Chunks can hold arbitrary amounts of data as can datasets. You can read more about rolling datasets here and here.
If you specify a chunk that is already populated it will be overwritten instead of appended. This example will overwrite sensor1logs with sensor2logs. It is similar to the first example (rita import --delete) except this dataset has been converted to a rolling dataset. But the output of the show-* commands here would be the same as the first example.
rita import --rolling sensor1logs sensor-data
rita import --chunk 0 sensor2logs sensor-data
With all that said, you can append to a dataset by using a different chunk. But you cannot append to a chunk. In our commercial product we import all the sensors at the same time into the same chunk. In your scenario you can't do this because your rita commands are run from different systems.
Here's what I would suggest: Make a single dataset with 48 chunks instead of 24 (or expand this to your number of sensors * 24). Then schedule the command on each rita system to run every hour. Each sensor will import its own hour of logs into a new chunk. Even though multiple chunks will cover the same time period that should still work fine as the entire dataset will be for a contiguous 24 hours.
rita import --rolling --numchunks 48 sensor1logs sensor-data
rita import --rolling --numchunks 48 sensor2logs sensor-data
You might need to be a little careful though. I'm not completely sure how rita will handle concurrent imports on the same dataset. You may need to stagger your import commands to be sure they don't end up trying to write to the some chunk and clobber each other. Or do some testing first to see if rita handles that scenario correctly or not (we appreciate bug reports as well :smile: ).
Thanks for the detailed answer, @ethack ! I'll do some testing, and report back to confirm the procedure, or let you know if there I run into anything.