rwppa icon indicating copy to clipboard operation
rwppa copied to clipboard

problems about WebTextHandler

Open CoolSpring8 opened this issue 5 years ago • 1 comments

Currently we have this in internal/proxy/proxy.go:

// WebTextHandler fixes links in page and "src" issues in javascript files.
// This solution, however, may prevent content streaming. Fix it?
func WebTextHandler() goproxy.RespHandler {
	return goproxy_html.HandleString(
		func(s string, ctx *goproxy.ProxyCtx) string {
			c := RVPNURLMatcher.ReplaceAllString(s, "$1://")
			rawURLWithPort := ctx.UserData.(reqData).rawURLWithPort
			rawURLWithoutPort := ctx.UserData.(reqData).rawURLWithoutPort
			c = strings.ReplaceAll(c, rawURLWithPort[:strings.LastIndex(rawURLWithPort, "/")+1], "") // possible out of bounds?
			c = strings.ReplaceAll(c, rawURLWithoutPort[:strings.LastIndex(rawURLWithoutPort, "/")+1], "")
			return c
		})
}

As the comments indicates, there are several problems.

  1. Issue of "src" in javascript files is not completely resolved. Corruption of scripts still exists on some sites.
  2. Out of bounds in trying to get the string before (including) last "/".
  3. Streaming of response. Apparently for now the result is first "ReadAll"-ed, processed and sent to the browser.

Solutions and plans:

  • [ ] 1: This requires a close inspecting and some experiments on some sites.
  • [ ] 2: A better way to do the extracting. Note since this is also related to "src" issue as mentioned above, if we can find other ways to solve that, there is no need to do this substitution anymore.
  • [ ] 3: This is more of a performance concern. Let's have a detailed explanation in the next reply.

And one more small thing: CSS files can probably be excluded from being applied with this handler.

CoolSpring8 avatar Jul 31 '20 00:07 CoolSpring8

For 3: Well, I'm not sure if browsers really start rendering one element before its downloading finishes.

The current behavior is that "Waiting (TTFB)" takes a long time, and "Content Download" phase finishes like lightning (as rwppa runs locally).

As introducing streaming text replacement will add overhead and requires more work, it would be better to know if this have a perceivable impact on user experience.

An extra bonus would be being more stable and have a lower memory usage.

So, if having decided to optimize this:

  • Streaming text replacement: https://github.com/icholy/replace This one supports regular expressions besides normal bytes replace and string replace.
  • Character set conversion is a big problem. One solution is to convert content to utf-8, do processing and converts back. Another is to convert the new string (in strings.ReplaceAll) to the given encoding. (However, it looks like that regexp cannot be encoded this way.) Refer to following resources:
    • https://github.com/elazarl/goproxy/blob/master/ext/html/html.go
      • It uses https://github.com/rogpeppe/go-charset/ which is said to be superseded and should be replaced with golang.org/x/text/transform .
    • github.com/emersion/go-message/charset
      • This seems to be an actual practice of using golang.org/x/text/transform .

CoolSpring8 avatar Jul 31 '20 01:07 CoolSpring8