rinruby
rinruby copied to clipboard
Pulled string is shortened if diacritics are contained
If a string in R contains characters with diacritics, the pulled string in Ruby is shortened by number of these characters.
For instance, this code
require "rinruby"
R.eval "text <- 'zkouška'"
print R.pull "text"
prints zkoušk
only. (Tested on Ruby 2.1.5, R 3.2.2, Ubuntu 15.10.)
Anyway - thanks for rinruby, it's really useful! (I use it in Jekyll plugin to generate a website from Rmd files.)
Interesting. Something related with string encoding on Ruby, I think. I will see what is going on.
Hi there.
I ran into similar issues.
The problem is relate to the way that the socket communication channel reads/writes strings.
A short (incomplete) fix is to change https://github.com/clbustos/rinruby/blob/master/lib/rinruby.rb#L586
to be:
-writeBin(as.integer(nchar(var)),#{RinRuby_Socket},endian="big")
+writeBin(as.integer(nchar(var,type='bytes')),#{RinRuby_Socket},endian="big")
When RinRuby sets up the communication channel, it's first writing the length of the string payload before sending out the string itself. On this line, it's writing the length of the payload.
On the Ruby side, when reading from the TCP socket, it first reads the 4-byte integer which tells it the length that it needs to read. It then does an IO read of that many bytes from the TCP socket.
When strings in R don't have unicode characters in them, nchar(mystring)
is the same as nchar(mystring,type='bytes')
. In other words, the number of bytes equals the number of characters. So Ruby reads it just fine.
In your case, nchar('zkouška')
is 7 and nchar('zkouška',type='bytes')
is 8. R tells Ruby to read 7 bytes, and so it cuts off the last 'a'.
Since RinRuby and the TCP socket read are reading by number of bytes, we need the R side to pass along the number of bytes.
When I have some more spare time, I will try to do a proper fork and pull request... but hopefully for now this helps the authors to know what's going on.
I say that the fix is incomplete because I'm having some encoding issues on the Ruby side -- it's coming in as ??? characters.
Thanks for this very helpful gem!