accumulo We really need to use formatters

For maintainance puposes of our systems, we rely heavily on formatters and their use in the accumulo shell. It appears that this has been deprecated in 2.x and removed in 3.x this year #3265. Using JShell is simply too cumbersome when compared to using the accumulo shell. I am opening up this ticket to start a conversation in this regard.

Aug 28 '23 20:08 ivakegg

First, I just want to mention I'm not sure this is the best venue for a discussion. This is probably better suited to the developer mailing list. But I'll respond here for now.

It's worth noting that this was something that we didn't quite finish cleaning up in 2.1 and 3.0 (my fault... I had this on my personal TODO list, but it got lost), so some stuff is still there and hasn't been removed. The scan interpreter was removed, but the formatter still partially remains as a per-table property. I would argue for completing that removal, rather than reversing course, though.

For background, there is quite a history with this. The earliest I could find relates to a proposal in 2013 to try to replace the shell with a suite of more narrowly tailored utilities ACCUMULO-1045 in the spirit of Unix's "Make each program do one thing well" philosophy, so that a bloated shell was not needed, and we could reduce a lot of complexity by empowering users to utilize tools in their OS's shell rather than reimplement those features ourselves in Accumulo's shell with a lot of maintenance overhead. Formatting the output using Linux tools (sed, awk, perl, cut, tr, head, tail, paste, etc.) was just one of the possible options. Now that we have modern Java, which can execute a .java source file with a main method without first compiling and packaging it, or that can use JShell, which comes loaded with a bunch of features itself for interactively scripting Java, and now that we have a simple to use AccumuloClient object that can be initialized automatically using a client configuration file, it is very trivial to interact with Accumulo without using the provided shell. So, we are closer today than we ever have been to be able to get rid of the shell in favor of separate tools, though not quite in the way I proposed in ACCUMULO-1045. Now, most of the shell's interactivity features could probably be replaced by JShell, and the few remaining use cases can be easily scripted by users on-demand for their specific cases. So, in a way, I think we're even better situated to get rid of the Accumulo shell than what I proposed on that old JIRA.

We revisited the discussion when we started looking at the complexity and inconvenience of the ScanInterpreter in 2019 with #1138. It was there that we first actually proposed to remove the formatter and interpreter options, without a drastic change in the Accumulo shell. It finally got done over a year ago when #2806 was merged to address #2787, which itself was created because of the user confusion the current code caused (#2210).

While this change happened recently, it's been in the works for years, with on-and-off discussions about it. It basically boils down to this:

The feature adds a lot of complexity, for a small convenience. There are now many much more simple ways for users to achieve the same thing. Even if JShell isn't used, it's trivial to write a small AccumuloClient that uses the same config file and scans a table and outputs lines in any format the user wishes.
This feature tried to bridge the gap for users between the limited maintenance functionality that the shell is supposed to be for, and the way they actually think about their data. Unfortunately, it did so by creating confusion because Accumulo now has two views of the data... how it's actually stored, and the way users think about it; this creates as many problems as it solves, not the least of which was that it did this inconsistently between similar shell commands.
It bloats the shell and makes users more dependent on Accumulo's shell to satisfy every little niche use case they might have, rather than being a general purpose tool for maintenance. We want to balance the convenience with having a general purpose maintenance tool with the shell having a limited and easy-to-maintain scope. We don't want to go too far and become the default means by which users scan their data, with a bloated and feature-full tool that is intended to handle nearly every use case out of an abundance of convenience (that's not sustainable).

Ultimately, I don't think we need this feature in Accumulo anymore to support the use case it was originally intended for. I think I could probably help you write a set of much more narrowly-focused set of tools to handle your use case. Before we try to reintroduce this feature, I'm happy to try to discuss your specific use cases, and to help your team consider some alternatives that I think might be better overall.

Aug 29 '23 02:08 ctubbsii

I could probably start helping by providing some examples of alternatives, if you could highlight some specific pain points with the alternatives you've tried so far.

Aug 30 '23 22:08 ctubbsii

"The feature adds a lot of complexity, for a small convenience." I strongly disagree with this. For all of those that have to maintain our systems, it becomes crutial to be able to read the key-value pairs in a human readable form. A typical example is where the value is encoded as a protobuf which is complete unreadable without a formatter.
"Limited maintenance functionality." I strongly disagree with this. The shell is the main utility used by all kinds of scripts throughout our systems including monitoring to be able to scan various tables and verify certain activities are being performed properly. For example listing the ongoing scans, verifying splits are being created on our sharded tables, generating alerts when too many files are backed up on a tablet, not to mention the ability to track down various problems in our system related to queries or bad data.
The shell is a general purpose tool used for maintenance. I not sure why you don't think this is the case.

I am ok if we split out the shell into a separate project, but we require this capability (including formatters) and will not be able to migrate to 3.x unless this functionality still exists. Note that we do not need formatters configured on the tables, but we do need to be able to specify them when executing a scan in the shell.

Sep 06 '23 15:09 ivakegg

(this response was originally discussed with @ivakegg outside of this ticket; I'm sharing here, to summarize that conversation and to share my views with the wider community)

to be able to read the key-value pairs in a human readable form.

Definitely agree that we want to preserve that convenience. The "small" convenience I was referring to was the ability to do it inside the Accumulo shell, specifically, vs. being able to do it at all. Being able to do it is essential, but being able to do it inside the Accumulo shell is not, if other ways are provided. I think we can provide those other ways, which can satisfy this use case, but also empower many others.

In the meantime, I think we can restore the formatter flag specifically in the shell, since the alternate ways are not yet ubiquitous. But, I think we should keep the interpreters out, as well as the table configuration that applies a formatter to every scan automatically, because those are where a lot of user confusion and complexity comes in.

"Limited maintenance functionality." I strongly disagree with this. [snip]

The shell is a general purpose tool used for maintenance. I not sure why you don't think this is the case.

I agree it's a general purpose tool. By "limited", all I mean is that it's not as powerful or efficient, or intended to be, as using the full Java API, which I don't think is in dispute. I think the examples you gave are great examples of the kinds of "limited maintenance functions" it can support that I was referring to. I am not suggesting removing support for any of these uses. But I do think there is room to improve how we support these uses, and I think that's where we were going with the previous changes.

I am ok if we split out the shell into a separate project, but we require this capability (including formatters) and will not be able to migrate to 3.x unless this functionality still exists. Note that we do not need formatters configured on the tables, but we do need to be able to specify them when executing a scan in the shell.

The shell is already kind of a separate project (it's a separate module right now). Some of its functions that require access to server-side config files (like the fate admin tools) have already been moved to a separate distinct utility. Some ZK maintenance tasks also now exist as separate tools, but are a bit more low-level than what the current shell supports, which may be a little redundant, but I think it's fine for now. I think what is really missing here, that would go a long ways to support your use case, is to have a separate "scan" utility, specifically for scanning with similar command line options as in the current shell, but which you can pipe the output in your OS shell environment to tools like grep, awk, sed, cut, paste, etc. That would make it easy for users to write their own data views, not limited to the Formatter interface we previously supported. We could also provide support for Formatters, or a more limited BiFunction<Key,Value,String> version of it, in that tool, but could de-emphasize it to encourage users to maintain their application-specific views/transformations outside of Java, and in their OS shell environment. I'm not actually sure whether it would be more useful for users to maintain a set of Formatters or a set of pipe-able scan ... | myFormatter.sh capabilities. But, having a scan tool outside of the current shell would empower users to do the latter quite easily, with no code complexity in Accumulo at all... and that would be far more powerful than what you currently get with the accumulo-shell.

Sep 08 '23 11:09 ctubbsii