potools icon indicating copy to clipboard operation
potools copied to clipboard

Line number problems in C `# notranslate` exclusions

Open aitap opened this issue 9 months ago • 0 comments

The following message seems to be missing from data.table.pot:

https://github.com/Rdatatable/data.table/blob/b7f2106efe038d93577f427f34c06d9c00b4c486/src/fread.c#L2775

The code seems to consider this message to be subject to a # notranslate exclusion:

https://github.com/MichaelChirico/potools/blob/0dc529285c4f54a86d0755317d9304d735c3858f/R/get_src_messages.R#L255

debug: src_messages = drop_excluded(src_messages, exclusions[is_outside_char_array(exclusion_pos,
    arrays)])
Browse[1]> src_messages[grepl('sep=', msgid)] # <-- row 3 here
                                                                      msgid msgid_plural   fname
                                                                     <char>       <list>  <char>
1:   sep='\\\\n' passed in meaning read lines as single character column\\n       [NULL] DTPRINT
2:                                             sep=',' so dec set to '.'\\n       [NULL] DTPRINT
3:                                                    %8.3fs (%3.0f%%) sep=       [NULL] DTPRINT
                                                                                     call array_start is_marked_for_translation line_number
                                                                                   <char>       <int>                    <lgcl>       <int>
1: DTPRINT(_("  sep='\\\\n' passed in meaning read lines as single character column\\n"))       71163                      TRUE        1674
2:                                           DTPRINT(_("  sep=',' so dec set to '.'\\n"))       83411                      TRUE        1892
3:           DTPRINT(_("%8.3fs (%3.0f%%) sep="), tLayout-tMap, 100.0*(tLayout-tMap)/tTot)      129888                      TRUE        2775
Browse[1]> n
<...>
Browse[1]> src_messages[grepl('sep=', msgid)] # <-- one row less now!
                                                                      msgid msgid_plural   fname
                                                                     <char>       <list>  <char>
1:   sep='\\\\n' passed in meaning read lines as single character column\\n       [NULL] DTPRINT
2:                                             sep=',' so dec set to '.'\\n       [NULL] DTPRINT
                                                                                     call array_start is_marked_for_translation line_number
                                                                                   <char>       <int>                    <lgcl>       <int>
1: DTPRINT(_("  sep='\\\\n' passed in meaning read lines as single character column\\n"))       71163                      TRUE        1674
2:                                           DTPRINT(_("  sep=',' so dec set to '.'\\n"))       83411                      TRUE        1892
Browse[1]> exclusions[is_outside_char_array(exclusion_pos, arrays)]
          file line1 capture_lengths
        <char> <int>           <int>
1: src/fread.c   438               0
2: src/fread.c  1366               0
3: src/fread.c  1733               0
4: src/fread.c  1783               0
5: src/fread.c  2111               0
6: src/fread.c  2119               0
7: src/fread.c  2305               0
8: src/fread.c  2775               0 # <-- why is line 2775 excluded?
9: src/fread.c  2794               0
Browse[1]> readChar(file, file.size(file)) |> substr(exclusion_pos[8]-32, exclusion_pos[8]+16)
[1] "\n      DTPRINT(\"  =====\\n\"); // # notranslate\n   " # <-- exclusion no.8 corresponds to a different line!

Since the exclusions are matched against the original, non-preprocessed file contents: https://github.com/MichaelChirico/potools/blob/0dc529285c4f54a86d0755317d9304d735c3858f/R/get_src_messages.R#L75 ...and the newlines are matched in the preprocessed file contents, where they have different offsets due to the comments being removed: https://github.com/MichaelChirico/potools/blob/0dc529285c4f54a86d0755317d9304d735c3858f/R/get_src_messages.R#L77-L82 ...the line numbers produced from exclusion_pos and newlines_loc end up being incorrect: https://github.com/MichaelChirico/potools/blob/0dc529285c4f54a86d0755317d9304d735c3858f/R/get_src_messages.R#L250-L254

Matching exclusions against the original file would have given the correct line number:

Browse[1]> newlines_loc2 = c(0L, as.integer(gregexpr("\n", readChar(file, file.size(file)), fixed = TRUE)[[1L]]))
Browse[1]> data.table(
      file = file,
      line1 = findInterval(as.integer(exclusion_pos), newlines_loc2),
      capture_lengths = attr(exclusion_pos, "capture.length")[ , 1L]
    )[8]
          file line1 capture_lengths
        <char> <int>           <int>
1: src/fread.c  2113               0
Browse[1]> readLines(file)[2113]
[1] "      DTPRINT(\"  =====\\n\"); // # notranslate"
Browse[1]>

...but there must be a better solution, one that is compatible with preprocessing.

aitap avatar Feb 09 '25 20:02 aitap