dmd icon indicating copy to clipboard operation
dmd copied to clipboard

ConfigurableLexer extension under version DMDLIB

Open sorin-gabriel opened this issue 3 years ago • 10 comments

As other pull requests I made in the past, the goal of this one is to make the Lexer more flexible and usable outside of the compiler. This is a rather large pull request judging by the number of lines changed, but most of them are related to the Lexer scan function, mostly copied as-is from the current Lexer. The main feature is the extension of the Lexer class with a ConfigurableLexer class.

Suggestions are appreciated, particularly on these topics:

  • I am looking for a better solution to the workaround of changing the visibility of the structures and functions at the bottom of the lexer.d file (previously defined as private) to public - package (dmd) doesn't seem to work
  • doc and comm fields are a bit redundant, but I think things are better organized like this (this also means doDocComment and commentToken are set accordingly when calling super(), for compatibility reasons)

sorin-gabriel avatar Mar 31 '22 22:03 sorin-gabriel

Thanks for your pull request and interest in making D better, @sorin-gabriel! We are looking forward to reviewing it, and you should be hearing from a maintainer soon. Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#13935"

dlang-bot avatar Mar 31 '22 22:03 dlang-bot

As other pull requests I made in the past, the goal of this one is to make the Lexer more flexible and usable outside of the compiler.

By overriding primary interfaces such as scan and nextToken, you're no better off than just writing a separate lexer to do whatever specific job you want though.

The two largest and most popular downstream users of dmd as a library don't require this in order to function. This looks like it will negatively affect them for what gain?

ibuclaw avatar Apr 01 '22 07:04 ibuclaw

@ibuclaw What we want is to be able to configure behavior with regards to whitespaces/newlines. The 2 options that we have are: (1) add version(DMDLIB) statements inside scan or (2) this PR. I would vote for the first option, which is what we did initially, however, @WalterBright said that he doesn't want all these versions inside the lexer (which is funny considering all the if (sc.cfile) inside the compiler). Since we are talking about 10-20 LOC max, I would vote for option number 1. I don't expect that other changes would be necessary.

RazvanN7 avatar Apr 01 '22 07:04 RazvanN7

What we want is to be able to configure behavior with regards to whitespaces/newlines.

I don't get this.

enum CommentOptions {
    None,
    All,
    AllCondensed,
}

enum WhitespaceOptions {
    None,
    OnlySpaces,
    OnlyNewLines,
    OnlyTabs,
    All,
    AllCondensed,
}

What are you trying to do with DMDLib that you need all these options for?

he doesn't want all these versions inside the lexer (which is funny considering all the if (sc.cfile) inside the compiler). Since we are talking about 10-20 LOC max, I would vote for option number 1. I don't expect that other changes would be necessary.

So is this PR just to show why version(DMDLib) should stay? Because adding a 1000 line module for different whitespace handling in the lexer is ridiculous.

dkorpel avatar Apr 01 '22 08:04 dkorpel

@dkorpel We are trying to provide an interface that can be used in D-scanner to replace libdparse. This PR should provide a lexer that is configurable and offers a range interface, so that the initial lexer is kept as clean as possible. The current implementation is a discussion starter, so that we can get to an agreement on what is the best way to move forward.

RazvanN7 avatar Apr 01 '22 09:04 RazvanN7

The current implementation is a discussion starter, so that we can get to an agreement on what is the best way to move forward.

  • What differences with libdparse's lexer makes DMD's lexer incompatible with D-scanner?
  • Why does D-scanner need the lexer in the first place, instead of just the parser?
  • What exactly needs to be configurable about the lexer?

To add a range interface, you don't need class ConfigurableLexer : Lexer, I think something like this would work:

struct LexerRange
{
    private Lexer lexer; // the DMD lexer
    Token front;
    bool empty = false;
    void popFront() 
    {
        front = lexer.nextToken();
        if (front == Token.EOF) 
            empty = true;
    }
    this(Lexer lexer)
    {
        this.lexer = lexer;
        popFront();
    }    
}

Whitespace tokens can be scanned manually in LexerRange.popFront(). You already have the source text and a Loc.fileOffset of two adjacent tokens, so manually inserting a whitespace token should be doable.

dkorpel avatar Apr 01 '22 11:04 dkorpel

@dkorpel

What differences with libdparse's lexer makes DMD's lexer incompatible with D-scanner? Why does D-scanner need the lexer in the first place, instead of just the parser? What exactly needs to be configurable about the lexer?

The dmd lexer currently squashes all consecutive whitespaces and newlines into a single one. Moreover, a whitespace is not even represented as a token in the lexer, so it's impossible to check for things like double white spaces between an if and a (. It also implements token count which depends on whether you count a whitespace as a token or not. There may be other features that require a configurable lexer, but these are the ones that we've encountered up until now.

To add a range interface, you don't need class ConfigurableLexer : Lexer, I think something like this would work:

The range interface already exists in the lexer under version(dmdlib), however, @WalterBright suggested that we put it in a different file so that we avoid adding versions inside the lexer.

Whitespace tokens can be scanned manually in LexerRange.popFront(). You already have the source text and a Loc.fileOffset of two adjacent tokens, so manually inserting a whitespace token should be doable.

No, that is not possible, you need to change the scan method for that.

RazvanN7 avatar Apr 01 '22 15:04 RazvanN7

Whitespace tokens can be scanned manually in LexerRange.popFront(). You already have the source text and a Loc.fileOffset of two adjacent tokens, so manually inserting a whitespace token should be doable.

No, that is not possible, you need to change the scan method for that.

I don't see why you say it's not possible either, the current location is public, so you can just check that before calling scan.

ibuclaw avatar Apr 01 '22 16:04 ibuclaw

My opinion is that we should definitely move forward with a parametrized lexer. This is really useful to build a formatter and other tools on top of libdmd.

I'm more inclined toward the idea of templating the lexer so that we can parameterise it at compile time. version (DMDLIB) is not the solution at all. In some situations, introducing version (DMDLIB) is to support some slow feature that the compiler is not using. I think we shouldn't assume that the user wants that when using libdmd. For example, a formatter and a linter, would benefit from different parameters, skipping whitespace tokens all at once, would be beneficial just for some tools.

I think it is possible to reuse the same lexer. We need to first refactor some lexer routines, as most of the logic is inside scan function. And by reusing the lexer, I'm not referring to adding a bunch of version (DMDLIB) code but rather creating a collection of options to do that parametrization at compile-time.

ljmf00 avatar Apr 02 '22 17:04 ljmf00

ping @sorin-gabriel . any progress on this?

RazvanN7 avatar Apr 25 '22 13:04 RazvanN7