rinruby icon indicating copy to clipboard operation
rinruby copied to clipboard

Pulled string is shortened if diacritics are contained

Open strepon opened this issue 9 years ago • 2 comments

If a string in R contains characters with diacritics, the pulled string in Ruby is shortened by number of these characters.

For instance, this code

require "rinruby"
R.eval "text <- 'zkouška'"
print R.pull "text"

prints zkoušk only. (Tested on Ruby 2.1.5, R 3.2.2, Ubuntu 15.10.)

Anyway - thanks for rinruby, it's really useful! (I use it in Jekyll plugin to generate a website from Rmd files.)

strepon avatar Jan 31 '16 16:01 strepon

Interesting. Something related with string encoding on Ruby, I think. I will see what is going on.

clbustos avatar Feb 01 '16 05:02 clbustos

Hi there.

I ran into similar issues.

The problem is relate to the way that the socket communication channel reads/writes strings.

A short (incomplete) fix is to change https://github.com/clbustos/rinruby/blob/master/lib/rinruby.rb#L586

to be:

-writeBin(as.integer(nchar(var)),#{RinRuby_Socket},endian="big")
+writeBin(as.integer(nchar(var,type='bytes')),#{RinRuby_Socket},endian="big")

When RinRuby sets up the communication channel, it's first writing the length of the string payload before sending out the string itself. On this line, it's writing the length of the payload.

On the Ruby side, when reading from the TCP socket, it first reads the 4-byte integer which tells it the length that it needs to read. It then does an IO read of that many bytes from the TCP socket.

When strings in R don't have unicode characters in them, nchar(mystring) is the same as nchar(mystring,type='bytes'). In other words, the number of bytes equals the number of characters. So Ruby reads it just fine.

In your case, nchar('zkouška') is 7 and nchar('zkouška',type='bytes') is 8. R tells Ruby to read 7 bytes, and so it cuts off the last 'a'.

Since RinRuby and the TCP socket read are reading by number of bytes, we need the R side to pass along the number of bytes.

When I have some more spare time, I will try to do a proper fork and pull request... but hopefully for now this helps the authors to know what's going on.

I say that the fix is incomplete because I'm having some encoding issues on the Ruby side -- it's coming in as ??? characters.

Thanks for this very helpful gem!

thefooj avatar May 05 '16 13:05 thefooj