sslr icon indicating copy to clipboard operation
sslr copied to clipboard

CodeBuffer: implement .subSequence(), plus tests

Open fge opened this issue 11 years ago • 18 comments

There is no reason why CodeBuffer isn't able to support it.

Just use a CharBuffer to .wrap() the buffer, and use this CharBuffer's .subSequence() with the indices adjusted according to the buffer's current position.

fge avatar Mar 22 '15 12:03 fge

Thanks for PR, it will be reviewed during next sprint on SSLR.

Godin avatar Mar 25 '15 22:03 Godin

Implementation of method "subSequence" can be copy-on-read or wrapping of underlying array, and both have their advantages and disadvantages depending on scenarios of usage. Since currently there is no potential consumers of this feature nor in SSLR itself, nor in main consumers of this library (SonarQube plugins), scenarios of usage are unknown and method left unimplemented as a kind of defensive approach against disadvantages of both implementations.

Now wrap(...).subSequence(...) will create CharBuffer, which will retain reference on underlying byte array from CodeBuffer, opening door for possible memory leaks. This PR is not backed by any usage scenario, so that unlikely to be accepted.

Godin avatar May 15 '15 23:05 Godin

Well, another possibility would have been to wrap the whole byte[] into a CharBuffer to begin with. A .subsequence() would then have been marginally cheaper.

The problem here is basically that CodeBuffer breaks the CharSequence contract; this patch solves it, at a cost which is on the whole pretty negligible since no SonarQube code uses that... BUT if client code wants to use that (I want; and since CodeBuffer doesn't support that I have to create a wrapper just for that), I believe the principle of least surprise here should have priority over _very_hypothetical memory leaks.

fge avatar May 15 '15 23:05 fge

Not hypothetical at all and defensive principle is exactly to avoid bad surprises. And again - "I want" is not a concrete usage scenario.

Godin avatar May 15 '15 23:05 Godin

Sorry but those are purely hypothetical. And like I said, another implementation (I can provide it) would just have been to wrap the whole char array. This way a call to .subSequence() would just reuse that instance-wide charbuffer.

And I maintain: the principle of least surprise takes priority over anything else. CodeBuffer implements CharSequence, therefore it should obey the CharSequence contract, and CharSequence never says anywhere, in any way, shape or form, that.subSequence() may throw an UnsupportedOperationException.

I'll redo the patch with a global CharBuffer.

fge avatar May 15 '15 23:05 fge

Done.

Note however that this could be vastly improved. In particular, it looks like pop() could be rewritten as a simple get() on the buffer which in turn means the offset calculations could be simplified.

If you want me to work on this aspect, I can do that too.

Not only that but it can also lead the way to real unicode support, as I do in grappa.

fge avatar May 15 '15 23:05 fge

And I repeat - not hypothetical, but a real issue faced in past. Current implementation is the principle of least surprise for us. Update has exactly the same flaw as previous change. Instead of being aggressively pushy, might be better to start listening to what others say. So once again - how you use subsequence (usage scenario)? Without this another solution to problem of contract - remove inheritance of CharSequence, ticket created - http://jira.sonarsource.com/browse/SSLR-372

Godin avatar May 15 '15 23:05 Godin

In grappa, I use what is called an InputBuffer.

What parboiled did in the past, for no good reason, was to specialize implementations for String and char[] where CharSequence could do it all. I unified this into a single implementation relying only on a CharSequence.

I was quite happy to see that CodeBuffer implemented it... Until I saw that in fact no it didn't. Therefore I had to reimplement the interface:

/**
 * An {@link InputBuffer} over a {@link CodeReader}
 *
 * <p>Unfortunately, this is required. Sonar's {@link CodeBuffer} does not
 * support {@link CharSequence#subSequence(int, int)} for no good reason that I
 * can see... So this is basically a {@link CharSequenceInputBuffer} with
 * subsequence extraction rewritten.</p>
 *
 */
@Immutable
public final class CodeReaderInputBuffer
    implements InputBuffer
{
    private static final ExecutorService EXECUTOR_SERVICE;

    static {
        final ThreadFactory factory = new ThreadFactoryBuilder()
            .setDaemon(true).setNameFormat("linecounter-thread-%d").build();
        EXECUTOR_SERVICE = Executors.newCachedThreadPool(factory);
    }

    private final CodeReader reader;
    private final Future<LineCounter> lineCounter;

    public CodeReaderInputBuffer(@Nonnull final CodeReader reader)
    {
        this.reader = Objects.requireNonNull(reader);
        lineCounter = EXECUTOR_SERVICE.submit(() -> new LineCounter(reader));
    }

    @Override
    public char charAt(final int index)
    {
        return index >= 0 && index < reader.length()
            ? reader.charAt(index) : Chars.EOI;
    }

    /**
     * Returns the Unicode code point starting at a given index
     * <p>If the index is greater than, or equal to, the buffer's length, this
     * method returns -1.</p>
     *
     * @param index the index
     * @return the code point at this index, or -1 if the end of input has been
     * reached
     *
     * @throws IllegalArgumentException index is negative
     */
    @Override
    public int codePointAt(final int index)
    {
        final int length = reader.length();
        if (index >= length)
            return -1;
        if (index < 0)
            throw new IllegalArgumentException("index is negative");

        final char c = reader.charAt(index);
        if (!Character.isHighSurrogate(c))
            return c;
        if (index == length - 1)
            return c;
        final char c2 = reader.charAt(index + 1);
        return Character.isLowSurrogate(c2) ? Character.toCodePoint(c, c2) : c;
    }

    @Override
    public String extract(final int start, final int end)
    {
        final int realStart = Math.max(start, 0);
        final int realEnd = Math.min(end, reader.length());
        final CharBuffer buf = CharBuffer.allocate(realEnd - realStart);
        for (int i = realStart; i < realEnd; i++)
            buf.put(charAt(i));
        return new String(buf.array());
    }

    @Override
    public String extract(final IndexRange range)
    {
        return extract(range.start, range.end);
    }

    @Override
    public Position getPosition(final int index)
    {
        /*
         * A CodeReader column index starts at 0, not 1; we therefore need to
         * substract one from the extracted position...
         */
        final Position position
            = Futures.getUnchecked(lineCounter).toPosition(index);
        return new Position(position.getLine(), position.getColumn() - 1);
    }

    @Override
    public String extractLine(final int lineNumber)
    {
        Preconditions.checkArgument(lineNumber > 0, "line number is negative");
        final LineCounter counter = Futures.getUnchecked(lineCounter);
        final Range<Integer> range = counter.getLineRange(lineNumber);
        final int start = range.lowerEndpoint();
        int end = range.upperEndpoint();
        if (charAt(end - 1) == '\n')
            end--;
        if (charAt(end - 1) == '\r')
            end--;
        return extract(start, end);
    }

    /**
     * Get the index range matching a given line number
     *
     * @param lineNumber the line number
     * @return the index range
     */
    @Override
    public IndexRange getLineRange(final int lineNumber)
    {
        final Range<Integer> range
            = Futures.getUnchecked(lineCounter).getLineRange(lineNumber);
        return new IndexRange(range.lowerEndpoint(), range.upperEndpoint());
    }

    @Override
    public int getLineCount()
    {
        return Futures.getUnchecked(lineCounter).getNrLines();
    }

    @Override
    public int length()
    {
        return reader.length();
    }
}

One more reason that I fail to see the non implementation of .subsequence() is that it is actually extremely benefical for regexes which I see is used a lot in SSLR: Pattern.matcher() accepts a CharSequence as an argument.

fge avatar May 16 '15 00:05 fge

As to this:

And I repeat - not hypothetical, but a real issue faced in past

would you please elaborate? The interface is quite simple to implement. In fact, CharBuffer does it all.

fge avatar May 16 '15 00:05 fge

Yes, I can elaborate: consumer of "subSequence" instead creation of copy keeps reference on result as is, and result survives multiple parses of different files, accumulating content of those files in memory.

Godin avatar May 16 '15 00:05 Godin

Then why don't you just return a String? String implements CharSequence... Which means you don't need to pass a hidden reference to the caller...

I can modify the patch to do that, too...

fge avatar May 16 '15 00:05 fge

Ate my own dog's food, new patch.

(oh, and tests still pass)

fge avatar May 16 '15 00:05 fge

Further note about the patch: it fully obeys the contract in that if an index is passed which is out of bounds, or start is greater than end etc, it will return an IndexOutOfBoundsException. No violation at all here ;)

fge avatar May 16 '15 00:05 fge

Creation of String - is a copy-on-read, which might be undesirable in scenarios, where expected that method "subSequence" guarantees constant time. That's why I asked to provide usage scenarios, which are supposed to be not just a dump of some code, but an explanations (data, data flows, workloads, performance requirements, ...), which would help to choose solution / implementation approach, instead of trying quickly randomly find "correct" hot patch, whose behavior we'll be forced to support as soon as consumers will start relying on it.

Pattern itself doesn't use/require implementation of method subSequence.

Godin avatar May 16 '15 00:05 Godin

Pattern itself doesn't use/require implementation of method subSequence.

It does. I speak from experience here.

A Matcher takes a CharSequence as an argument and this is not for kicks. What a Matcher does is call .charAt() to fecth characters from the CharSequence, and when it finds a match and you want the match, it extracts the .subsequence() from the CharSequence and calls .toString() on it.

I discovered this fact while working on largetext, and tried to print what was actually matched... I didn't implement .toString() over my CharSequences at first and I obtained a .toString() representation which looked like MyShortClassName@xxxxx....

fge avatar May 16 '15 00:05 fge

Ok, excuse me, usage of Pattern within SSLR doesn't require this.

Godin avatar May 16 '15 01:05 Godin

OK, fine, then the decision is yours. I believe my patch is pretty simple to be included as is, since the results of it are basically stateless.

fge avatar May 16 '15 01:05 fge

Some more comments...

Creation of String - is a copy-on-read, which might be undesirable in scenarios, where expected that method "subSequence" guarantees constant time.

Let's be realistic here for a moment. Can a copy of some arbitrary amount of data be in constant time, that is O(1)?

The answer is obvious: NO. It can never be. The only condition where it can be is that if a reference to the original data is kept, and THAT is the scenario which is undesirable.

While my first patch was indeed "constant time" in that regard, and undesirable, the latest patch is NOT constant time (its runtime depends on the length of the requested subsequence).

But both patches cure the fact that CodeBuffer, as it is now, breaks the CharSequence contract; with my patch, it obeys it to its fullest extent, bounds checking and all, for no additional cost whatsoever, except for the guys like me who do happen to rely on CharSequence.

That's why I asked to provide usage scenarios, which are supposed to be not just a dump of some code, but an explanations (data, data flows, workloads, performance requirements, ...), which would help to choose solution / implementation approach, instead of trying quickly randomly find "correct" hot patch, whose behavior we'll be forced to support as soon as consumers will start relying on it.

Well there you have one scenario. That dump of code as you call it is precisely the critical part in all of my parsin process. And there is no data flow/performance requirements to speak of: just the fact that if an interface is claimed to be implemented, IT SHOULD BE. And my fix for that is simple, obvious and comes at no additional cost.

So, what's the problem?

fge avatar May 17 '15 00:05 fge