mtools
mtools copied to clipboard
mplotqueries should store references to log lines, not in-ram copies of the whole log line
mplotqueries stores the original log line along with the parsed info, so that it can output it when points are clicked. However, it would be a lot better to instead store a filename + byte offset (where possible, ie. when reading from a rewindable and/or seekable file), to avoid eating up impossible amounts of memory on very very large logfiles.
Alternatively, when the logfile is an actual file (ie. not a pipe), it could be mmapped, which would potentially allow for faster reading (especially when plotting and replotting the same file over and over), and fast/easy access back to the original log lines without having to use lots of ram.
You're right. Probably storing the line number is more than enough. When a point is clicked then just open the file on that line number. Not sure if linecache could work here
Thanks, I like the linecache idea, that sounds like the way to go.
Some notes:
use namedtuples to only store the fields needed, which are:
-
line_no
(use linecache to get back line for click event) -
datetime
(x-axis value) -
duration
(possibly, for--optime-start
flag) -
self.field
(y-axis value) -
group
(see below)
Issue with grouping. The grouping is currently a function that takes the logevent, and calculates group dynamically. Instead, pre-calculate group value in add_line()
(it doesn't change during the lifetime of a single plot_instance), and add and additional field group
to the tuple.
What about stdin? Need to additionally store line_str
.
stdin is just a special case of a file that can't be seeked. It's also possible to have such a file passed on the command line (eg. using bash's "<()" construct, or using mkfifo).
I would suggest the following approach. Change the rest of the code to not store line_str
, but rather the line number of the file, which is used as an indirect reference back into the file. Define an abstract "Logfile" class. This has 3 actual implementations, each of which are tried to be used in turn:
- MappedLogfile: tries to mmap the file into memory. As the lines are scanned (from the mapped memory) the first time, it keeps an array of the offset of the start of each line. This allows fast subsequent random access by line number, with basically no penalty. Mmapping logfiles also has the huge advantage that re-plotting the same file over and over (often the case when exploring a file) will generally be a lot faster.
- SeekableLogfile: If MappedLogfile fails (eg. the mmap fails, for whatever reason), then this class can be tried. It will open the file normally, and check that fseek()ing is possible. Then it will read through the file as is normally done, also keeping an array of line numbers and offsets. Then, for later random access to certain lines, it will fseek() to the correct offset and read until newline.
- CachedLogfile: If SeekableLogfile fails to initialise (eg. stdin), then fallback to this class. It's the same, except lines that are read in are fully cached in an array, and then just looked up directly later. I would suggest having a maximum memory size for the array, which is customisable on the command line and defaults to 1GB.
The other approach to dealing with non-seekable files is to cache them into a temporary disk file somewhere, somehow. I dislike this idea, because it means that it becomes mtools's problem as to find a writable location with sufficient disk space to put the temporary file(s), and to clean them up later (which isn't always possible, eg. kill -9). I much prefer the policy that if you have output from a pipe that you want to plot, and it's "large" (as defined by the maximum CachedLogfile cache size above), then it's your job to pipe it into a file and then feed that file to mplotqueries. This pushes the decision of finding a writable location with enough space, and cleaning up the file afterwards, onto the user, but I don't mind that because the user is far better informed than mtools in these regards.