oce icon indicating copy to clipboard operation
oce copied to clipboard

need CRAN submission re text encoding

Open dankelley opened this issue 1 year ago • 9 comments

I've learned that R-devel is tightening the handling of encoding. With very gracious help from the developer in charge of that, I've learned that we must supply encoding in file() but not in readLines(), and also that we ought to remove useBytes in various spots.

My plan is to install the latest R-devel on my mac (once it is in a non-broken state), so I can do quick local tests. I think any R-devel will uncover the errors. I'm also under the impression that devtools::check_win_devel() should report errors, and I've put in a test there (results due in 1/2 hour).

Checklist:

  • [x] ensure that all encoding= args are removed from readLines() calls. This involves O(50) changes.
  • [x] ensure that encoding= is present in open() and file() calls. This involves checking, mainly, since I think I did that before. CHECKED WITH git grep 'open(' |grep -v 'encoding='|grep -v "rb"
  • [x] ensure that useBytes=TRUE is removed from grep() calls. (179 instances)
  • [x] " from gsub() calls (1 instance). CHECKED WITH git grep -n "gsub.*useBytes"

dankelley avatar Jul 13 '22 19:07 dankelley

Hm. devtools::check_win_devel() does not report errors, but maybe it needs a day or two to get an update.

dankelley avatar Jul 13 '22 20:07 dankelley

Hi Dan,

I've just committed a change to R-devel which makes sub/gsub/strsplit return a result marked as "bytes" when non-ASCII and using bytes (so when one of the inputs is "bytes" or useBytes=TRUE). Previously, these functions would leave the result unflagged, so potentially an invalid string, and not helping the purpose of "bytes" as given i ?Encoding ("Strings marked as "bytes"`are intended to be non-ASCII strings which should be manipulated as bytes, and never converted to a character encoding").

I've started a branch called encoding2. Also, below is what I was told re the problem:

oce fails its checks with this change:

 ── Failure (test_ctd.R:550:5): original names pair with final names
────────────
 dno$sigmaTheta (`actual`) not equal to "sigma-é00" (`expected`).

And it is easy to reproduce with:

library(oce)
f <- system.file("extdata", "d201211_0011.cnv", package="oce")
d <- read.oce(f)
dno <- d[["dataNamesOriginal"]]
Encoding(dno$sigmaTheta)

The result is now "bytes". The representation is the same as before, but the tests (correctly) compare it to a string flagged UTF-8, so the strings differ in encoding.

It is easy to find out where this happens using "trace":

trace(gsub, exit=quote(if ("bytes" %in% Encoding(returnValue())) browser()))
where 5: gsub("^# name [0-9][0-9]* = (.*):.*$", "\\1", h, ignore.case =
TRUE,
   useBytes = TRUE)
where 6: cnvName2oceName(lines[nameLines[iline]], columns, debug = debug -
   1)

Now, while debugging this, I also found out that read.oce, with this example, opens the connection via

file <- file(file, "r", encoding = encoding)

specifying "latin1" encoding. That's perfectly fine, but then

lines <- readLines(file, encoding = encoding)

which is wrong. The "encoding" argument of readLines is used to mark the text, not to do any conversions. What happens is that the file connection will convert the input from latin1 to the current encoding (usually UTF-8 as on my system), but then readLines flags that instead as "latin1", so the string becomes invalid, see

Browse[2]> charToRaw(lines[Encoding(lines)=="latin1"])
[1] 23 20 6e 61 6d 65 20 32 32 20 3d 20 73 69 67 6d 61 2d c3 a9 30 30
3a 20 44
[26] 65 6e 73 69 74 79 20 5b 73 69 67 6d 61 2d 74 68 65 74 61 2c 20 4b
67 2f 6d
[51] 5e 33 5d

but there is 2d c3 a9 30 30 so c3 a9, which is UTF-8

So far R doesn't always check string validity, but it will most likely be changed soon to do so, to make these errors easier to spot.

If the input files are simply valid latin1 files, which I think is the case (from our previous discussions), then it would make sense to simply convert then to the native encoding (not pass encoding=argument to readLines) and never use "useBytes=TRUE".

There is one theoretical problem that you might run on a system where some of the latin1 characters won't be representable in the native encoding, but the easiest option would be to ignore that (because such system would be very unlikely). Still, if you wanted to be somewhat robust against that, you could e.g. pass an unopened connection with specified encoding to readLines (then it will be converted and flagged as UTF-8). These things are described in ?file (Encoding section) and ?readLines. Please let me know if you had any questions,

dankelley avatar Jul 13 '22 20:07 dankelley

~GOOD NEWS as of this morning, devtools::check_win_devel() now reports an error. That means I won't have to rely on rhub being restored to full power (which has no stated timetable ... more days, more weeks, months??)~

dankelley avatar Jul 14 '22 11:07 dankelley

Um, re the "GOOD NEWS" from yesterday, R-devel has been reverted a bit, so it only produce diagnostics if env-var

_R_REGEX_MARK_NEW_RESULT_AS_BYTES_=TRUE 

is set. That's fine for local builds, but I can't see how to transmit env-vars to remote builds like devtools::check_win_devel() or rhub (the latter of which is nowadays too limited in its test platforms to be of any help, anyway).

But, I've succeeded in building today's R-devel from source, and that means I can test locally.

dankelley avatar Jul 15 '22 17:07 dankelley

Now get successful build-check-install with the "encoding2" branch, commit 1b7647e99c405e75fe925ec964b102f69a7017f4, with R-devel as of earlier today, using

_R_REGEX_MARK_NEW_RESULT_AS_BYTES_=TRUE R CMD build --compact-vignettes="qpdf" oce
_R_REGEX_MARK_NEW_RESULT_AS_BYTES_=TRUE R CMD check --as-cran oce_1.7-9.tar.gz
_R_REGEX_MARK_NEW_RESULT_AS_BYTES_=TRUE R CMD INSTALL oce_1.7-9.tar.gz

dankelley avatar Jul 15 '22 18:07 dankelley

I plan to merge "encoding2" into "develop" early tomorrow morning. I just don't like having too many branches in-play at once, because merge conflicts are a waste of time. Also, I want the user-base to have access, and to provide a testing army.

dankelley avatar Jul 15 '22 18:07 dankelley

I have merged "encoding2" into "develop". It will make sense for all users to be on the lookout for problems with this. This should only affect text-based files, so not e.g. netcdf files, rsk files, etc. The largest possibility of problems is with cnv files.

It would take a long time to explain the changes and why they were required, so I won't try. Besides, the actual impetus for this issue seems to have evaporated, since the planned changes to R-devel have been rolled back (too many packages were affected, apparently).

With the R-devel I have on my machine (built from source yesterday) I get extra debugging if I set an env-var as follows (in my ~/.zshrc):

export _R_REGEX_MARK_NEW_RESULT_AS_BYTES_="TRUE"

I think this will continue to work as R-devel evolves but I am not sure. My impression is that the breakage that was resulting in the past few days with CRAN R-devel systems will start to go away, because the changes to R-devel were rolled back. Still, I assume these changes will come back later, and so I am glad I altered oce, for later.

dankelley avatar Jul 16 '22 11:07 dankelley

Today I got a reminder to resubmit by Aug 30. The error (below, item 1) now is different from before, but I think still related to encoding.

I tried using the rhub version of that machine (item 2 below) which is advertised to have the same locale, but the output suggests that it is actually UTF8.

So, what to do?

My answer is simple: I am removing this as a CRAN test, but retaining it as a local test. That way, if a user has a problem, we can ask them to do some diagnostics about what's going on.

I don't see the benefit in exposing oce to CRAN rejection for errors that I cannot reproduce locally, that no user has reported to us, and that relate to niche functions like read.ctd.ssda(). (If this were read.ctd.sbe(), that would be an entirely different story!)

Item 1: the CRAN report

== Failed tests ================================================================
     -- Error (test_ctd_ssda.R:8:5): read.ctd.ssda() works (discussions/1929) -------
     Error in `1L:dataStart`: argument of length 0
     Backtrace:
     x
     1. \-oce::read.oce(file) at test_ctd_ssda.R:8:4
     2. \-oce::read.ctd.ssda(...)
     
     [ FAIL 1 | WARN 18 | SKIP 9 | PASS 2688 ]
     Error: Test failures
     Execution halted 
Flavor: [r-devel-linux-x86_64-debian-clang](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/oce-00check.html)

Item 2: the not-really-equivalent rhub machine

The below looks great, but when I did a test, I saw that the local was not as stated, but rather UTF-8. Perhaps they haven't set up that machine correctly yet, after rhub not working anywhere near properly for 2 months.

> rhub::platforms()
debian-clang-devel:
  Debian Linux, R-devel, clang, ISO-8859-15 locale

The test I did:

rhub::check(platform="debian-clang-devel",show_status=FALSE)

and the results of that test:

  • [x] https://artifacts.r-hub.io/oce_1.7-9.tar.gz-62907e7a227f4fe6b642e108c340b782/oce.Rcheck/

dankelley avatar Aug 16 '22 17:08 dankelley

rhub::check_for_cran() will report at the following (a check means it was OK, perhaps with comments at the end).

  • [x] https://builder.r-hub.io/status/oce_1.7-9.tar.gz-49a6cb1d3abd4195a4b79bc0afca6896
  • [x] https://builder.r-hub.io/status/oce_1.7-9.tar.gz-623400129a714582b45cec60ad5948cf
  • [x] https://builder.r-hub.io/status/oce_1.7-9.tar.gz-4d79f6b2112b4f6b9d20ef1556a06ff1
  • [x] https://builder.r-hub.io/status/oce_1.7-9.tar.gz-23dabd1b450348b4b1c02f556093a2d7

dankelley avatar Aug 16 '22 18:08 dankelley

My current plan is to submit to CRAN at 1300h (Halifax time) today, unless some DFO colleagues tell me that it is broken on ODF files (see #1996).

Why the rush? Because we have a deadline of Aug 20, and it can often take a day of fiddling to get things submitted, and if there are problems, that day can become a week. I don't want oce to get booted from CRAN.

dankelley avatar Aug 17 '22 10:08 dankelley

I just submitted the "develop" branch (see below). I will not merge into the "main" branch unless/until it gets accepted.

Version: 1.7-9 Date: 2022-08-17 16:05:34.905 UTC SHA: d6011974df0c29a643782a8fdbcab3b658498d92

dankelley avatar Aug 17 '22 16:08 dankelley

It was auto-rejected because the build-test time exceeded the limit of 10 minutes. Accordingly, I did a reply-to-all, with the following as an explanation.

I request that an exception to the 10-minute rule be granted, owing to the size of the package. For reference, below is a posting to R-pkg-devels dated 4:49AM 2022-05-13.

Actually, oce is one of the very few packages where you cannot do much to reduce this:

  • checking R code for possible problems ... [177s] OK
  • checking examples ... [128s] OK
  • checking tests ... [70s] OK
  • checking re-building of vignette outputs ... [52s] OK

This shows the runtime checks only consume 4 minutes, but the package is rather complex and needs a lot of time in the other parts of the checks.

Therefore reverse dependency checks have been triggered on CRAN.

Best, Uwe Ligges

dankelley avatar Aug 17 '22 17:08 dankelley

I resubmitted it. There was a problem with a unit in a manpage. Well, I think. I got a bit confused in the message I got when I asked for an allowance re the 10-min rule. Anyway, instead of wasting CRAN members' time by asking for more clarification, I changed what seemed to be a problem, bumped the version number, and resubmitted.

dankelley avatar Aug 18 '22 16:08 dankelley

It's made it past the first hurdle (background music as I write this: https://www.youtube.com/watch?v=KzvnEvbL-qs ... The Boss never fails to inspire!)

Dear maintainer,

package oce_1.7-10.tar.gz has been auto-processed and is pending an automated reverse dependency check. This service will typically respond to you within the next day. For technical reasons you may receive a second copy of this message when a team member triggers a new check.

Log dir: https://win-builder.r-project.org/incoming_pretest/oce_1.7-10_20220818_184512/ The files will be removed after roughly 7 days. Installation time in seconds: 180 Check time in seconds: 586 R Under development (unstable) (2022-08-17 r82724 ucrt)

Pretests results: Windows: https://win-builder.r-project.org/incoming_pretest/oce_1.7-10_20220818_184512/Windows/00check.log Status: OK Debian: https://win-builder.r-project.org/incoming_pretest/oce_1.7-10_20220818_184512/Debian/00check.log Status: OK

dankelley avatar Aug 18 '22 17:08 dankelley

Second hurdle passed --

Dear maintainer,

thanks, package oce_1.7-10.tar.gz is on its way to CRAN.

Best regards, CRAN teams' auto-check service Package check result: OK

No changes to worse in reverse depends.

dankelley avatar Aug 18 '22 17:08 dankelley