fs2 icon indicating copy to clipboard operation
fs2 copied to clipboard

Decode text in any charset (utf-16 and others)

Open soujiro32167 opened this issue 5 years ago • 13 comments

Similar to fs2.text.utf8Decode, I would like to auto-detect and decode text in other charsets

soujiro32167 avatar Jun 28 '20 16:06 soujiro32167

Looks like thats been around in http4s for a while! Thanks @rossabaker https://github.com/http4s/http4s/commit/32eaa8f82d541fe6de5877aeadf66f8ec9b81c41. This does not auto-detect, but handles multi-byte boundaries across chunks. I ended up using https://github.com/albfernandez/juniversalchardet for auto detection

soujiro32167 avatar Jul 12 '20 16:07 soujiro32167

I'm not entirely sure if automatic detection is feasible. I would however like to work on this. Which charsets should be prioritised. According to Google, the most used character sets are ASCII, ISO 8859-1 and UTF-8, which is already supported.

vasilmkd avatar Jul 12 '20 20:07 vasilmkd

I've been meaning to contribute http4s' for a while. The main thing holding me back was being a bit more rigorous on the testing. The big thing holding me back from being rigorous on the testing is that it's a pain to create valid generators for arbitrary charsets. Ours also has an outstanding bug, where it's stripping BOM markers outside the initial byte.

I think @vasilmkd is on the right track focusing on a few. The three you mentioned are the big ones. I've also seen Windows-1252 a lot in the wild, though it's in decline. My experience is biased toward English, so I also found stats of unknown credibility. It would be good to choose something not Western European.

There are six guaranteed to be on the JVM, and testing any others sacrifices portability. In practice, test environments are likely to come with all the significant ones, though you could skip the test where the charset is not found.

You might be more efficient than the Java charset encoders by handwriting the single-byte codecs, but the http4s one should give a good start on the general problem.

I would hesitate to do autodetection in fs2, because of the liability of another dependency. A microlibrary that autodetects using juniversalchardet could be nice, and could delegate to this decoder once detected.

rossabaker avatar Jul 13 '20 04:07 rossabaker

@rossabaker Thanks for linking the StandardCharsets. I think it would be beneficial to support them. I also agree about the portability note.

vasilmkd avatar Jul 13 '20 09:07 vasilmkd

FWIW, this is what we've been using. It doesn't support multibyte (other than utf-8), but it supports others:

val multibyteCharsets = Set(
  StandardCharsets.UTF_8,
  StandardCharsets.UTF_16,
  StandardCharsets.UTF_16BE,
  StandardCharsets.UTF_16LE)

def charsetPipe[F[_]](charset: Charset): Option[Pipe[F, Byte, String]] = {
  import StandardCharsets._
  charset match {
    case UTF_8 | US_ASCII                   => Some(fs2.text.utf8Decode[F])
    case c if multibyteCharsets.contains(c) => None
    case singleByteCharset =>
      Some(
        _.mapChunks(bs => Chunk.singleton(new String(bs.toBytes.toArray, singleByteCharset))))
  }
}

Daenyth avatar Jul 13 '20 14:07 Daenyth

The charsetPipe could generate false positives for single byte. You could call maxCharsPerByte() on a charset decoder to be sure.

Creating scalacheck generators for all the charsets is an important part of testing. Unfortunately, there isn't a standard method on charsets that produces the relevant alphabet. The standard six are easy, but testing others will probably require some spec reading.

rossabaker avatar Jul 13 '20 14:07 rossabaker

Actually, you can come close to deriving the alphabet for an arbitrary charset with canEncode from a CharsetEncoder. It just returns false negatives for surrogates, which is why there is a CharSequence version. And surrogates are where a lot of the corner cases are.

rossabaker avatar Jul 13 '20 14:07 rossabaker

The charsetPipe could generate false positives for single byte

Yup. For our use case it was OK, but it's not great for a general tool.

maxCharsPerByte

neat, I didn't know that existed!

Would that then be something like

case c if c.newDecoder().maxCharsPerByte == 1 =>

(And cache the Decoder)

Daenyth avatar Jul 13 '20 15:07 Daenyth

I'm not sure how expensive decoders are to create. You could iterate all available charsets on your JVM, create a decoder, and build a static set from there. I think JVMs can install charsets at runtime, but you're deep into the weeds by that point.

rossabaker avatar Jul 13 '20 16:07 rossabaker

That's what I meant

Daenyth avatar Jul 13 '20 19:07 Daenyth

Here is my current code:

  def stringToCharset(s: String): Either[CharsetError, Charset] =
    Try(Charset.forName(s)).toEither.leftMap(ex => CharsetError(ex.getMessage))

  def detectCharsetStream[F[_]](sampleSize: Long = sampleSize)(implicit F: Sync[F]): Pipe[F, Byte, Charset] = { s =>
    val detector = new UniversalDetector()
    s.take(sampleSize)
      .chunkAll
      .evalMap { sample =>
        detector.handleData(sample.toArray)
        detector.dataEnd()
        F.fromOption(Option(detector.getDetectedCharset), CharsetError("no charset detected"))
          .flatMap(stringToCharset _ andThen F.fromEither)
      }
  }

  def decode[F[_]: RaiseThrowable](charset: Charset): Pipe[F, Byte, String] =
    org.http4s.internal.decode(org.http4s.Charset.fromNioCharset(charset))

val s  = fs2.io.file.readAll[IO](path, Blocker.liftExecutionContext(ec), 4096)
val result = s.through(detector.detectCharsetStream()).flatMap(cs => s.through(decode(cs)))

soujiro32167 avatar Jul 14 '20 17:07 soujiro32167

A universal decoder is really missing in fs2. I would appreciate @rossabaker if you could contribute your solution from http4s. Even if it isn't as thoughtfully tested as you would like, it is probably still better than many of us throwing together our own solutions, even less rigorously tested. We can improve it later.

susuro avatar Aug 25 '20 18:08 susuro

For reference, in http4s:

Some links that could be used to build a generator / sample text corpus: https://stackoverflow.com/questions/9190330/is-there-a-set-of-lorem-ipsums-files-for-testing-character-encoding-issues

Daenyth avatar Sep 09 '20 01:09 Daenyth