ksoup
ksoup copied to clipboard
EUC-KR charset is not parsable
I'm using version 0.1.9 with the ktor module to parse the response from this website: http://www.bodnara.co.kr/rss/rss_bodnara.xml
I get my source reader via response.bodyAsChannel().toByteArray().openSourceReader() and then I use Ksoup.parse with an XML Parser and the charset is EUC-KR. However this does not work on Android:
io.ktor.utils.io.charsets.MalformedInputException: Input length = 1
at io.ktor.utils.io.charsets.CharsetJVMKt.throwExceptionWrapped(CharsetJVM.kt:370)
at io.ktor.utils.io.charsets.CharsetJVMKt.decode(CharsetJVM.kt:241)
at io.ktor.utils.io.charsets.EncodingKt.decode(Encoding.kt:103)
at io.ktor.utils.io.charsets.EncodingKt.decode$default(Encoding.kt:101)
at com.fleeksoft.ksoup.io.CharsetImpl.decode(CharsetImpl.kt:47)
at com.fleeksoft.ksoup.io.KByteBuffer.readText(KByteBuffer.kt:63)
at com.fleeksoft.ksoup.ported.io.StreamDecoder.implRead(StreamDecoder.kt:147)
at com.fleeksoft.ksoup.ported.io.StreamDecoder.lockedRead(StreamDecoder.kt:87)
at com.fleeksoft.ksoup.ported.io.StreamDecoder.read(StreamDecoder.kt:50)
at com.fleeksoft.ksoup.ported.io.InputSourceReader.read(InputSourceReader.kt:46)
at com.fleeksoft.ksoup.parser.CharacterReader.doBufferUp(CharacterReader.kt:76)
at com.fleeksoft.ksoup.parser.CharacterReader.bufferUp(CharacterReader.kt:58)
at com.fleeksoft.ksoup.parser.CharacterReader.current(CharacterReader.kt:222)
at com.fleeksoft.ksoup.parser.TokeniserState$Data.read(TokeniserState.kt:12)
at com.fleeksoft.ksoup.parser.Tokeniser.read(Tokeniser.kt:38)
at com.fleeksoft.ksoup.parser.TreeBuilder.stepParser(TreeBuilder.kt:129)
at com.fleeksoft.ksoup.parser.TreeBuilder.runParser(TreeBuilder.kt:112)
at com.fleeksoft.ksoup.parser.TreeBuilder.parse(TreeBuilder.kt:77)
at com.fleeksoft.ksoup.parser.Parser.parseInput(Parser.kt:61)
at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:179)
at com.fleeksoft.ksoup.helper.DataUtil.parseInputSource(DataUtil.kt:77)
at com.fleeksoft.ksoup.helper.DataUtil.load(DataUtil.kt:44)
at com.fleeksoft.ksoup.Ksoup.parse(Ksoup.kt:70)
I saw this https://github.com/fleeksoft/ksoup/commit/0b76b21c5bb54d9a265cf331cea4365b32609b76 but I'm not on windows, so it should work?
@vanniktech Can you please mention which variant you are using?
com.fleeksoft.ksoup:ksoup-ktor2:0.1.9
@vanniktech i tested with the following code it worked fine:
val doc = Ksoup.parseGetRequest("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())
and this also worked fine:
val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
val doc = Ksoup.parse(sourceReader = httpResponse.asSourceReader(), baseUri = "", parser = Parser.xmlParser())
assertEquals("보드나라::전체기사", doc.selectFirst("title")?.text())
Can you please share your code how you are reading bytes from web
This also works for me on the JVM (Desktop / Mac):
suspend fun main() {
val url = "http://www.bodnara.co.kr/rss/rss_bodnara.xml"
val request = HttpRequestBuilder().apply {
url(url)
}
val response = HttpClient().get(request)
val document = Ksoup.parse(
sourceReader = response.bodyAsChannel().toByteArray().openSourceReader(),
baseUri = url,
charsetName = response.charset(),
parser = Parser.xmlParser(),
)
println(document)
}
fun HttpResponse.charset() = headers[HttpHeaders.ContentType]?.asContentTypeOrNull()?.parameter("charset")
?: "UTF-8"
// https://youtrack.jetbrains.com/issue/KTOR-6241/Lenient-Content-Type-Parsing
internal fun String.asContentTypeOrNull() =
runCatching { ContentType.parse(replace(", charset=", "; charset=")) }.getOrNull()
The same code crashes on Android though with the exception from my original issue. Did you try it on an Android emulator?
@vanniktech Yes, there is an issue with the EUC-KR charset in Android with Ktor 2, but it’s working fine with Ktor 3. I’m looking into whether I can fix it on my end.
@vanniktech I’m trying to fix the issue; in the meantime, you can try this, it is working fine:
val httpResponse = NetworkHelperKtor.instance.get("https://www.bodnara.co.kr/rss/rss_bodnara.xml")
val doc = Ksoup.parse(
html = httpResponse.bodyAsText(),
baseUri = "",
parser = Parser.xmlParser()
)
Reading text from ChannelBody and parsing it works fine.
That's neat. I've changed it. Maybe ktor3 has a better/improved charset implementation due do the switch to kotlinxio?
@vanniktech this is fixed upcoming version. Thanks