ksoup icon indicating copy to clipboard operation
ksoup copied to clipboard

EUC-KR charset is not parsable

Open vanniktech opened this issue 1 year ago • 7 comments

I'm using version 0.1.9 with the ktor module to parse the response from this website: http://www.bodnara.co.kr/rss/rss_bodnara.xml

I get my source reader via response.bodyAsChannel().toByteArray().openSourceReader() and then I use Ksoup.parse with an XML Parser and the charset is EUC-KR. However this does not work on Android:

io.ktor.utils.io.charsets.MalformedInputException: Input length = 1
  at io.ktor.utils.io.charsets.CharsetJVMKt.throwExceptionWrapped(CharsetJVM.kt:370)
  at io.ktor.utils.io.charsets.CharsetJVMKt.decode(CharsetJVM.kt:241)
  at io.ktor.utils.io.charsets.EncodingKt.decode(Encoding.kt:103)
  at io.ktor.utils.io.charsets.EncodingKt.decode$default(Encoding.kt:101)
  at com.fleeksoft.ksoup.io.CharsetImpl.decode(CharsetImpl.kt:47)
  at com.fleeksoft.ksoup.io.KByteBuffer.readText(KByteBuffer.kt:63)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.implRead(StreamDecoder.kt:147)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.lockedRead(StreamDecoder.kt:87)
  at com.fleeksoft.ksoup.ported.io.StreamDecoder.read(StreamDecoder.kt:50)
  at com.fleeksoft.ksoup.ported.io.InputSourceReader.read(InputSourceReader.kt:46)
  at com.fleeksoft.ksoup.parser.CharacterReader.doBufferUp(CharacterReader.kt:76)
  at com.fleeksoft.ksoup.parser.CharacterReader.bufferUp(CharacterReader.kt:58)
  at com.fleeksoft.ksoup.parser.CharacterReader.current(CharacterReader.kt:222)
  at com.fleeksoft.ksoup.parser.TokeniserState$Data.read(TokeniserState.kt:12)
  at com.fleeksoft.ksoup.parser.Tokeniser.read(Tokeniser.kt:38)
  at com.fleeksoft.ksoup.parser.TreeBuilder.stepParser(TreeBuilder.kt:129)
  at com.fleeksoft.ksoup.parser.TreeBuilder.runParser(TreeBuilder.kt:112)
  at com.fleeksoft.ksoup.parser.TreeBuilder.parse(TreeBuilder.kt:77)
  at com.fleeksoft.ksoup.parser.Parser.parseInput(Parser.kt:61)
  at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:179)
  at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:77)
  at com.fleeksoft.ksoup.helper.DataUtil.load(DataUtil.kt:44)
  at com.fleeksoft.ksoup.Ksoup.parse(Ksoup.kt:70)

I saw this https://github.com/fleeksoft/ksoup/commit/0b76b21c5bb54d9a265cf331cea4365b32609b76 but I'm not on windows, so it should work?

vanniktech avatar Sep 25 '24 12:09 vanniktech

@vanniktech Can you please mention which variant you are using?

itboy87 avatar Sep 25 '24 13:09 itboy87

com.fleeksoft.ksoup:ksoup-ktor2:0.1.9

vanniktech avatar Sep 25 '24 13:09 vanniktech

@vanniktech i tested with the following code it worked fine:

val doc = Ksoup.parseGetRequest("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())

and this also worked fine:

val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
val doc = Ksoup.parse(sourceReader = httpResponse.asSourceReader(), baseUri = "", parser = Parser.xmlParser())
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())

Can you please share your code how you are reading bytes from web

itboy87 avatar Sep 25 '24 14:09 itboy87

This also works for me on the JVM (Desktop / Mac):

suspend fun main() {
  val url = "http://www.bodnara.co.kr/rss/rss_bodnara.xml"
  val request = HttpRequestBuilder().apply {
    url(url)
  }

  val response = HttpClient().get(request)
  val document = Ksoup.parse(
    sourceReader = response.bodyAsChannel().toByteArray().openSourceReader(),
    baseUri = url,
    charsetName = response.charset(),
    parser = Parser.xmlParser(),
  )

  println(document)
}

fun HttpResponse.charset() = headers[HttpHeaders.ContentType]?.asContentTypeOrNull()?.parameter("charset")
  ?: "UTF-8"

// https://youtrack.jetbrains.com/issue/KTOR-6241/Lenient-Content-Type-Parsing
internal fun String.asContentTypeOrNull() =
  runCatching { ContentType.parse(replace(", charset=", "; charset=")) }.getOrNull()

The same code crashes on Android though with the exception from my original issue. Did you try it on an Android emulator?

vanniktech avatar Sep 26 '24 07:09 vanniktech

@vanniktech Yes, there is an issue with the EUC-KR charset in Android with Ktor 2, but it’s working fine with Ktor 3. I’m looking into whether I can fix it on my end.

itboy87 avatar Sep 26 '24 09:09 itboy87

@vanniktech I’m trying to fix the issue; in the meantime, you can try this, it is working fine:

val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
                val doc = Ksoup.parse(
                    html = httpResponse.bodyAsText(),
                    baseUri = "",
                    parser = Parser.xmlParser()
                )

Reading text from ChannelBody and parsing it works fine.

itboy87 avatar Sep 26 '24 09:09 itboy87

That's neat. I've changed it. Maybe ktor3 has a better/improved charset implementation due do the switch to kotlinxio?

vanniktech avatar Sep 26 '24 11:09 vanniktech

@vanniktech this is fixed upcoming version. Thanks

itboy87 avatar Oct 11 '24 04:10 itboy87