JPlag icon indicating copy to clipboard operation
JPlag copied to clipboard

Performance with large data sets

Open JasonBarnabe opened this issue 8 months ago • 8 comments

On my site greasyfork.org (source), I offer a feature that lets users see the most similar JavaScript submissions to their own. I am investigating the use of JPlag for better results with this feature.

As a trial, I ran JPlag locally with a representative set of JS files. This set contains 130,000 files, totalling 3.4GB. The command I ran is java -jar ./jplag-6.0.0-jar-with-dependencies.jar -l javascript --csv-export ./greasyfork/tmp/gfcodesearch/.

This command ends up freezing my computer due to memory usage after the CLI reports parsing a few hundred files. I also tried to use the new/old directory feature to compare one file to all the rest, hoping that this would reduce the required resources, but this had similar results - java -jar ./jplag-6.0.0-jar-with-dependencies.jar -l javascript --csv-export -old ./greasyfork/tmp/gfcodesearch/ checkjpflag/.

Is there anything I can do to get this to run in a reasonable amount of time and with reasonable system resources, or this quantity of data just way outside JPlag's capabilities?

JasonBarnabe avatar Apr 13 '25 16:04 JasonBarnabe

130,000 files lead to 8,449,935,000 pairwise comparisons if every file is a submission, which is, naturally, quite resource expensive. As in most academic settings, the number of submissions is below 10,000, JPlag is not well tested for such large loads.

Do you know during which phase JPlag freezes? What is the average size of a single submission in that set? or LOC per submission?

I would definitely skip clustering --cluster-skip and also use --old to ensure only pairs are checked that should be. Besides that, I would recommend giving your JVM enough memory, e.g. via JVM parameter -Xmx (though I have no idea how much memory would be required here).

tsaglam avatar Apr 14 '25 06:04 tsaglam

The average file size is 26KB, the average number of lines is 34. There is a large variance though, with some files being a few MB and others less than 1KB.

JPlag freezes my system during "Parsing Submissions" after only a couple hundred files have been processed.

Even if I get JPlag to run on the first 1000 files only, I get the same result. Perhaps it's not the number of files that's the issue (yet), but the size/complexity of some of them? I've attached the first 1000 files. js.zip

This is the command I'm using. newfiles/ contains 1.js from the zip I uploaded.

java -jar ./jplag-6.0.0-jar-with-dependencies.jar -l javascript --csv-export --cluster-skip --log-level=TRACE -old ./contentsofuploadedfolder/ newfiles/

With --log-level=TRACE I can see that while most files are parsed within a few seconds, some take more than a minute: 383.js is 16KB and appears to take 2 minutes to parse, 513.js is 195KB and takes 3 minutes, 95.js is 122KB and it takes at least 10 minutes (I aborted the process) ... I don't know if this is expected performance or not.

JasonBarnabe avatar Apr 14 '25 15:04 JasonBarnabe

From your sample, the average LOC per file is around 500, but the maximum per file is 25000, which is relatively large. Large submissions above 10000 LOC can make the comparison algorithm step slow, but for you, the problem seems to persist before that step. So the issue might be the parser. For Javascript we are currently using an ANTLR grammar, which may be at fault. I will try to investigate and update you.

EDIT: old/new will only help for the comparison step, so if parsing is the issue, it wont help as all files still need to be parsed.

EDIT2:

With --log-level=TRACE I can see that while most files are parsed within a few seconds, some take more than a minute: 383.js is 16KB and appears to take 2 minutes to parse, 513.js is 195KB and takes 3 minutes, 95.js is 122KB and it takes at least 10 minutes (I aborted the process) ... I don't know if this is expected performance or not.

If your JVM memory is already near full, Java will need to do I lot of swapping in/out of the memory. Then, any action will take very long, even for small files. However, many of these files seem to contain insanely long lines. 710.js has nearly 2 million characters in a line, storing some JSON data.

EDIT3: On my device, your data does not produce memory issues (stays at 2.7GB memory) but freezes during parsing.

tsaglam avatar Apr 14 '25 15:04 tsaglam

With this subset of 1000, I do not get a system hang either, just 6.5GB memory usage and it gets stuck on 95.js.

JasonBarnabe avatar Apr 14 '25 22:04 JasonBarnabe

Based on my current understanding, we have two different issues at play here:

  • Memory issues: You can give the JVM more memory, but obviously, this is limited by the memory your machine has. JPlag is not designed/optimized for scenarios where you cannot hold all data in memory at all times.
  • Freezing: Is probably related to these long lines in the files; it might be a parser library issue. Here, I need to investigate a little further.

I will report back when I know more. You definitely have a fascinating use case here.

tsaglam avatar Apr 15 '25 08:04 tsaglam

I realize that the 130,000-file "compare everything to everything" is not a thing that you'd expect to complete in a reasonable amount of time. My use case is such that I want to check one JS file for minimum similarity against all other files. This would occur whenever a new JS file was submitted to my site - something that occurs on the order of 100 per day.

I was hoping there would some features/shortcuts that could speed up my use case:

  • As I understand it, the existing "old" vs "new" file distinction would mean comparing 1 to 130,000 rather than 130,000 to 130,000.
  • On-disk caching of parsing results - all 130,000 could be parsed once, with results stored on disk, then I could run multiple comparisons afterwards without needing to re-parse.
  • Since I am looking for a minimum amount of similarity, skip a comparison if it becomes apparent that that level of similarity will not be met.

Of course, I have no knowledge of JPlag's internals, so I don't know if the new features would be possible or desired.

I am totally OK with you saying that what I'm trying to do is not JPlag's use case or the amount of data is not within its capabilities. I am just playing around with JPlag at the moment because I don't even know how usable its results would be to me yet. But at the very least, I think having the readme give a general idea of the speed and resource requirements would be a good idea, e.g. "comparing 100 10KB JavaScript files should take under 5 minutes and 2GB of memory".

JasonBarnabe avatar Apr 15 '25 13:04 JasonBarnabe

I just spent a short time digging, the issue regarding freezing lies in the ANTLR parser code generated from the grammars we use (from https://github.com/antlr/grammars-v4/) to parse code. For example, when debugging 95.js, the evaluation of line 1302 is the issue, which is one of these very long lines in the file. In this case, it is a concatenation of strings that is evaluated by the ANTLR grammar. Thus, it is sadly not something I can fix.

What you could try here is formatting your input data to maximize such lines, as there might be the chance that the ANTLR parser can work better with these long lines split into multiple ones.

The issues regarding dataset size are something you can deal with, as you mentioned above. Regarding the size limits in the readme, I think this is a great idea. I will do some tests and add something in this regard. I would guess that a few million LOC are fine, though, as long as individual submissions stay below 100k LOC.

tsaglam avatar Apr 25 '25 17:04 tsaglam

I realize that the 130,000-file "compare everything to everything" is not a thing that you'd expect to complete in a reasonable amount of time. My use case is such that I want to check one JS file for minimum similarity against all other files. This would occur whenever a new JS file was submitted to my site - something that occurs on the order of 100 per day.

I was hoping there would some features/shortcuts that could speed up my use case:

  • As I understand it, the existing "old" vs "new" file distinction would mean comparing 1 to 130,000 rather than 130,000 to 130,000.
  • On-disk caching of parsing results - all 130,000 could be parsed once, with results stored on disk, then I could run multiple comparisons afterwards without needing to re-parse.
  • Since I am looking for a minimum amount of similarity, skip a comparison if it becomes apparent that that level of similarity will not be met.

Of course, I have no knowledge of JPlag's internals, so I don't know if the new features would be possible or desired.

I am totally OK with you saying that what I'm trying to do is not JPlag's use case or the amount of data is not within its capabilities. I am just playing around with JPlag at the moment because I don't even know how usable its results would be to me yet. But at the very least, I think having the readme give a general idea of the speed and resource requirements would be a good idea, e.g. "comparing 100 10KB JavaScript files should take under 5 minutes and 2GB of memory".

I share similar insights and use cases with you! You can see the issue I submitted for details. https://github.com/jplag/JPlag/issues/2539 https://github.com/jplag/JPlag/issues/2541

ElgebarOri avatar Aug 02 '25 15:08 ElgebarOri