[BUG] HttpCollectImpl XML parsing assumes UTF-8
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
https://github.com/apache/hertzbeat/blob/1fe70a60cc82d1fa83b9d2af3b46b047480796dc/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L251
If you have a String, you don't need to convert to byte array (which is almost a waste of memory).
DocumentBuilder has a parse(InputSource) method. https://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-org.xml.sax.InputSource-
InputSources can be constructed to wrap StringWriters that wrap the String.
Expected Behavior
Don't convert Strings to byte arrays unnecessarily wasting memory and causing parse issues. Imagine if the XML has an XML declaration that has an encoding that is not UTF-8. If you already have the String, the parser will ignore the value. If you convert to a byte array, the parser will use the XML encoding value but you have explicitly converted to UTF-8 in your code so these encodings may not match.
Steps To Reproduce
No response
Environment
HertzBeat version(s): latest
Debug logs
No response
Anything else?
No response
The underlying issue is more that you convert to a String in the first place.
https://github.com/apache/hertzbeat/blob/1fe70a60cc82d1fa83b9d2af3b46b047480796dc/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L135-L139
Using an InputStream or a byte array should be more more efficient than a String. It would definitely not be worse.
Sorry for missing that, got it thanks, we will consider parsing directly from the bytes stream later, just like described in todo
@tomsun28
Instead of converting the InputStream into a String, we directly parse the InputStream as an XML stream and then use DocumentBuilder.parse() method directly with the InputStream, which avoids the intermediate String conversion.
Is this approach right?
Can i work on this issue?
@tomsun28 Instead of converting the
InputStreaminto a String, we directly parse theInputStreamas an XML stream and then useDocumentBuilder.parse()method directly with theInputStream, which avoids the intermediate String conversion. Is this approach right? Can i work on this issue?
of course, welcome.
@tomsun28 Well I think to convert
String resp = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);
to
InputStream inputStream = response.getEntity().getContent();
would require some significant changes to the code.
The parseResponseBySiteMap method cannot readily be changed to read an InputStream due to its failover code (to handle non-XML).
https://github.com/apache/hertzbeat/blob/4009525f81cd0641450050b60be4c456c76cad48/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L273-L284
InputStreams can only be read once while Strings can be used as input more than once.