hertzbeat icon indicating copy to clipboard operation
hertzbeat copied to clipboard

[BUG] HttpCollectImpl XML parsing assumes UTF-8

Open pjfanning opened this issue 1 year ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

https://github.com/apache/hertzbeat/blob/1fe70a60cc82d1fa83b9d2af3b46b047480796dc/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L251

If you have a String, you don't need to convert to byte array (which is almost a waste of memory).

DocumentBuilder has a parse(InputSource) method. https://docs.oracle.com/javase/8/docs/api/javax/xml/parsers/DocumentBuilder.html#parse-org.xml.sax.InputSource-

InputSources can be constructed to wrap StringWriters that wrap the String.

Expected Behavior

Don't convert Strings to byte arrays unnecessarily wasting memory and causing parse issues. Imagine if the XML has an XML declaration that has an encoding that is not UTF-8. If you already have the String, the parser will ignore the value. If you convert to a byte array, the parser will use the XML encoding value but you have explicitly converted to UTF-8 in your code so these encodings may not match.

Steps To Reproduce

No response

Environment

HertzBeat version(s): latest

Debug logs

No response

Anything else?

No response

pjfanning avatar Dec 02 '24 12:12 pjfanning

The underlying issue is more that you convert to a String in the first place.

https://github.com/apache/hertzbeat/blob/1fe70a60cc82d1fa83b9d2af3b46b047480796dc/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L135-L139

Using an InputStream or a byte array should be more more efficient than a String. It would definitely not be worse.

pjfanning avatar Dec 02 '24 12:12 pjfanning

Sorry for missing that, got it thanks, we will consider parsing directly from the bytes stream later, just like described in todo

tomsun28 avatar Jan 26 '25 08:01 tomsun28

@tomsun28 Instead of converting the InputStream into a String, we directly parse the InputStream as an XML stream and then use DocumentBuilder.parse() method directly with the InputStream, which avoids the intermediate String conversion. Is this approach right? Can i work on this issue?

Suvrat1629 avatar Jan 26 '25 12:01 Suvrat1629

@tomsun28 Instead of converting the InputStream into a String, we directly parse the InputStream as an XML stream and then use DocumentBuilder.parse() method directly with the InputStream, which avoids the intermediate String conversion. Is this approach right? Can i work on this issue?

of course, welcome.

tomsun28 avatar Jan 26 '25 12:01 tomsun28

@tomsun28 Well I think to convert

String resp = EntityUtils.toString(response.getEntity(), StandardCharsets.UTF_8);

to

InputStream inputStream = response.getEntity().getContent();

would require some significant changes to the code.

Suvrat1629 avatar Jan 30 '25 13:01 Suvrat1629

The parseResponseBySiteMap method cannot readily be changed to read an InputStream due to its failover code (to handle non-XML).

https://github.com/apache/hertzbeat/blob/4009525f81cd0641450050b60be4c456c76cad48/hertzbeat-collector/hertzbeat-collector-basic/src/main/java/org/apache/hertzbeat/collector/collect/http/HttpCollectImpl.java#L273-L284

InputStreams can only be read once while Strings can be used as input more than once.

pjfanning avatar Jan 30 '25 14:01 pjfanning