mlxprs
mlxprs copied to clipboard
Add ML log browsing and monitoring tooling
Monitoring ML logs is a common pain point for developers, consultants and users. Adding functionality to assist with this to MLXPRS would improve the value of the tool and lessen development burden.
Important features:
Support for Error, Access and Request logs Ability to retrieve a date range of logs, including spanning multiple files; includes ability to define relative ranges (e.g., last hour) Ability to retrieve logs from a selection of or all hosts at once; includes option to interlace log entries by time, regardless of host Ability to filter all types of logs by regex Near real-time log tailing (via polling) Ability to specify port to use for retrieving logs (to avoid issues where the log retrieval itself adds to the logs you're trying to monitor) Ability to clear or force a rotation of log files Some basic level of built-in parsing and highlighting/formatting (severity, source IPs, timestamps, etc.)
Maybe features:
Additional filesystem-level log types (arbitrarily defined directory to monitor / patterns for file selection) Database-level log types (DHF jobs, meters, etc.) Regex or severity based notifications
There will have to be significant modifications to the server side to support a lot of that.
I understand the scenarios as a developer where you might do this with a single host and it could quite easily be done with MarkLogic in a container. VS Code has good docker integration so that could be a good option.
For log collection in a clustered environment, can you expand on what the developer use cases are for doing that from an IDE? There are log aggregation tools that are great at collecting and filtering logs for analysis. Maybe describe a few specific developer workflows that you would envision?
I did some testing and playing around this week, and I think most of it could be achieved (non-optimally) without server modifications. For example, I was able to do something like this to handle polling a tail:
let lastLength = 164275;
const data = {};
data.fileName = "8002_AccessLog.txt";
data.path = "./Data/Logs";
data.length = xdmp.filesystemFileLength(data.path + '/' + data.fileName);
data.stream = xdmp.externalBinary(data.path + '/' + data.fileName, lastLength, data.length - lastLength + 1);
data.text = xdmp.binaryDecode(data.stream, "UTF-8");
//Write back data.length to lastLength and display results, if any, then loop
I agree that core server improvements would make a lot of this easier/more efficient.
For use cases, as a consultant I rarely have access to the infrastructure the customer uses for log aggregation. I am almost always working with the manage/LATEST/logs endpoint. One I was running into just today was debugging an error between a client query tool (Metaphactory) and ML. The client tool would issue a SPARQL query to the load balancer, it would go to a random endpoint for evaluation, and I had some xdmp.log calls in there to try to narrow down the issue. Finding the specific calls on a cluster with other traffic was particularly time consuming. Using Access or Request logs to debug endpoint access and permissions assignment with LDAP is another especially difficult task. Even when writing queries in QConsole or MLXPRS, I rarely use xdmp log - instead preferring to return a debug object with my results.
I would like to be able to have a tailed log open in one VSCode window and my module open in another; then be able to quickly make a change, push the code, run a query from a third party tool, and see the results of my change.
For getting local files from multiple hosts, this is more than a bit hacky so I could see a good argument to not do it, but I've used the approach in the past with a lot of success:
https://gist.github.com/ableasdale/ed88160d9b384e43fb8320d4f38c127e
Another option would be to open connections directly to each host individually and require that the network be set up for that to work.
I have written scripts a that accomplish a pseudo tail in the past. The start, end and regex only work against the error log though.
Based the description of your use cases, I wouldn't think that would go into a developer IDE. That said, just being able to "tail" the Error* logs as a developer is valuable so I would recommend scoping this ticket to that and have discussions around the larger cluster-wide log aggregation and analysis requirements.
I was going to just do the start/end/regex locally when doing Access / Request logs. Doing a relative start (so like give me the last hour) should be pretty easy by getting chunks of the data starting from the end until you find the start time you're looking for.
But I agree that doing just the error logs is significantly easier thanks to xdmp:logfile-scan and probably the majority of the value to start with. I actually should look to see how the manage/LATEST/logs endpoint works - you can get logs from each host there so there must be a non-hacky mechanism to do it.
There's a new branch log-tools
for this. Still a work in progress, but far enough along for testing.