JPlag
JPlag copied to clipboard
Core algorithm refactoring
This PR aims to refactor the core algorithm to make it more readable and maintainable. The following refactorings were performed:
- move state
Token.marked
into algorithm - remove optimisations to reduce RAM usage in form of reversed hash tables only being generated when necessary
- remove
TokenHashMap
which only differed from a normal map by using a modulo on the keys which had no benefit for the algorithm - encapsulate
Token.hash
,TokenList.tokenHashes
,TokenList.hashLength
into separate classSubsequenceHashLookupTable
with improved naming and documentation - remove
TokenList
and replace it withList<Token>
- move state
Token.isBaseCode
into algorithm- refactored base code tagging to occur outside of the core algorithm
- only tag base code for compared submission, not for the base code submission
Discussion
TokenList
removal
With the removal of TokenList
come two minor changes.
First, the performed string optimisation to point to the same String object in case the file name for multiple tokens is the same was removed. I don't think this optimisation is necessary with modern-day computers anymore.
Second, a token list adjusted the tokens to have monotonically non-decreasing line numbers. This is only relevant for the visualisation, as the matching does not use the line number information. With some minor testing, the adjustment behaviour seemed to trigger only for FILE_END tokens which was fixed by assigning a NO_VALUE line to any FILE_END token. Since FILE_END token cannot appear in any match anyways, it should not make any difference for the visualisation of results.
Choice of actual Set
/ Map
class
The refactorings introduce various variables of type Set
or Map
. To tune performance, we could test different types of these classes. An initial discussion was already started here.
Testing
As the test coverage of the core is not too high, we should validate the refactorings on additional data sets (and maybe try to integrate some of those in form of plain token sequences into the core). The refactorings should not change functionality, thus the similarity values should remain identical for any submission. Furthermore, as the algorithm is highly optimised for performance, we should compare the performance of the refactored implementation to the previous one with a large data set. Unfortunately, I don't have a comparison set at hand, so it would be beneficial if some of the @jplag/maintainer could help out here.
Outlook
Tokens
I refactored all state related information of tokens into the algorithm. In a future PR, we should refactor tokens to be records instead of classes. However, for this to be possible, we need to avoid subclassing tokens for every language. This will probably not be needed anymore as soon as token constants are refactored to enums.
Naming
We use firstSubmission
and secondSubmission
throughout the application. From my perspective, these names indicate that there might be some kind of ordering which is not. Thus, I renamed the submissions in the core algorithm to leftSubmission
and rightSubmission
. Another option would have been patternSubmission
and textSubmission
which would be closer to the original paper. I chose the former naming as the clear separation between pattern (baseline file) and text (file to detect matches in) is not given in our application context. If left / right naming is accepted, in a following PR we should adjust the naming throughout the application.
Parallel Comparison Strategy
Since tokens are now state-free, it should be possible to entirely remove the locking the parallel comparison strategy.
https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag.frontend-utils/src/main/java/de/jplag/TokenList.java#L34-L45
I would like to remove the TokenList
data structure entirely for which we have to get rid of this custom add functionality.
The first part (ll37-39) with the file name can probably be removed.
For the second part (ll40-42) I am not sure whether we require a token sequence to have monotonically non-decreasing lines. I would guess if anywhere then in the report viewer. @nestabentum do you by accident know more about that?
@JanWittler all I know is that the TokenList
is currently used to determine the matches of a comparison: https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag/src/main/java/de/jplag/reporting/reportobject/ReportObjectFactory.java#L156-L171
The information on matches is then further used to highlight them in the comparison view.
@JanWittler all I know is that the
TokenList
is currently used to determine the matches of a comparison:https://github.com/jplag/JPlag/blob/0780fc49c5d37c70f172f9d8289ddc158b6de934/jplag/src/main/java/de/jplag/reporting/reportobject/ReportObjectFactory.java#L156-L171
The information on matches is then further used to highlight them in the comparison view.
Alright, thanks for the pointer. I will have a look at that.
The aim of this PR is to make the core algorithm understandable by looking at the code, without reading the paper. From my perspective this is now given, thus the PR is ready for review. If you spot any parts of the algorithm that you still don't understand or would like to see more documentation for, please comment.
For the second part (ll40-42) I am not sure whether we require a token sequence to have monotonically non-decreasing lines. I would guess if anywhere then in the report viewer.
@JanWittler I think these lines ensure that EOF tokens have a proper line number. As EOF tokens are specified without a line index, this might be the here to set the line index to the one of the second to last token of the file. EOF tokens cannot easily given a line number in the frontend, as the line indexing is done by the parser/scanner/compiler. Also, this might be an optimization for tokens that do not use lines and columns but only a single index. However, this is less likely.
@jplag/maintainer I think there is no other PR anymore blocking this one, so it is ready to merge from my side. The end2end tests all run successfully, so I would assume that the changes do not change any results (as intended). However, if you like to validate this on a larger submission set, feel free to do so.