antlr4 ANTLR4 Memory Usage for Tokenizing a 200MB File

I am experiencing an issue where tokenizing a large 200MB file using ANTLR4 results in over 1GB of memory usage. Here’s the code I am using to process the file:

try (InputStream inputStream = new FileInputStream(file)) {
    CharStream charStream = CharStreams.fromStream(inputStream);
    Lexer lexer = new MySqlLexer(charStream);
    UnbufferedTokenStream<Token> tokenStream = new UnbufferedTokenStream<>(lexer);
    while(true) {
        Token token = tokenStream.LT(1);
        int tokenType = token.getType();
        if (tokenType == Token.EOF) {
            break;
        }
        // ....other code
    }
} catch (Exception e) {
    // Handle exception
}

I noticed that as soon as I get the tokenStream, the memory usage spikes to over 1GB. I am using ANTLR4 version 4.13.1 on macOS 15.2. The grammar file I am using is MySqlLexer.g4.

Feb 07 '25 15:02 openai0229

@openai0229 How did you managed to fix it?

Feb 12 '25 13:02 PanOscar

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

Feb 14 '25 02:02 openai0229

Why would you parse 200MB of SQL as a single file? This seems like anXY question to me.

Each token is going to create an object in the token stream, and lexing is done upfront. Plus 200MB to hold your source.

Are you trying to parse a complete dump of every query you have in one file?

On Thu, Feb 13, 2025 at 20:35 acsc @.***> wrote:

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4770#issuecomment-2658129699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMFLRMDHAVLQZELPEWL2PVI6BAVCNFSM6AAAAABWWFK2JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJYGEZDSNRZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***> [image: openai0229]openai0229 left a comment (antlr/antlr4#4770) https://github.com/antlr/antlr4/issues/4770#issuecomment-2658129699

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4770#issuecomment-2658129699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMFLRMDHAVLQZELPEWL2PVI6BAVCNFSM6AAAAABWWFK2JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJYGEZDSNRZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Feb 14 '25 03:02 jimidle

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

Feb 14 '25 06:02 openai0229

What you could try is to parse a statement, then remove it, and repeat until the file is empty.

Feb 14 '25 07:02 ericvergnaud

What you could try is to parse a statement, then remove it, and repeat until the file is empty.

Thank you for your suggestion! However, I’m not quite sure I fully understand your idea. Could you please provide me with a code example to illustrate how I can implement this approach? It would really help me understand better. Thanks again!

Feb 14 '25 08:02 openai0229

The idea is to use the MySql parser and to invoke rule sqlStatement. Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

Feb 14 '25 08:02 ericvergnaud

The idea is to use the MySql parser and to invoke rule sqlStatement. Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

Thanks for the clarification! I’ll give that a try. Thanks again for your help!

Feb 14 '25 09:02 openai0229

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

Feb 14 '25 14:02 jimidle

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

Thank you for your suggestion! Before I received your message, I had already started splitting the 200MB file into smaller chunks for processing. This approach has helped me significantly reduce memory consumption, and it’s been working much better for handling large files. You’re right, maybe trying to process such a large file directly with ANTLR was not the best choice.

Feb 14 '25 15:02 openai0229

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

I've encoutered such big SQL files in my previous work (not 200 MB, but 30 MB that's anyway very big). Also, generated code also could be very big (for instance, by ANTLR itself). That's why I think the question is very reasonable.

Each token is going to create an object in the token stream, and lexing is done upfront. Plus 200MB to hold your source.

But token probably doesn't hold the entire string, only start and end offsets. That's why the entire memory usage by all tokens could be even less than 200, especially if big strings are used. Also, they can be handled on-the-fly.

The idea is to use the MySql parser and to invoke rule sqlStatement. Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

As far as I understand, @openai0229 even doesn't use parser. The topic is about tokenization and parsing only increases memory usage.

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

I don't think it's an ideal solution, I'd try to set up tokens processing on-the-fly. But I actually don't remember if it's possible to configure UnbufferedTokenStream or other stream to release already processed tokens. Also, is 1GB spike is really big? I suppose GC can handle most amount of memory after all tokens are processed.

Feb 15 '25 18:02 KvanTTT

I don't think it's an ideal solution, I'd try to set up tokens processing on-the-fly. But I actually don't remember if it's possible to configure UnbufferedTokenStream or other stream to release already processed tokens. Also, is 1GB spike is really big? I suppose GC can handle most amount of memory after all tokens are processed.

Thank you for your reply! I also agree that chunking the file is not an ideal solution. Unfortunately, it’s the approach I’ve had to resort to for now. I’m still looking for better ways to optimize memory usage, especially to minimize ANTLR’s memory consumption. During my testing, the memory usage spikes as soon as I obtain the Lexer object. The highest peak I’ve seen is around 2GB.

As a temporary solution, I’ve currently divided the file into chunks, with each chunk being around 1MB. Below is the memory usage after applying this method. While it’s not the best solution, it seems to address my needs for the time being.

If you have any experience or suggestions, I would greatly appreciate it! Thank you again!

Feb 16 '25 06:02 openai0229