online
online copied to clipboard
Speed-up variable substitution
This is an Easy Hack. Potential mentors: @Ashod
Detailed description and rationale
When we serve our initial HTML files - we have a reasonably large file loleaflet.html - and before we serve it we do a lot of repeated substitutions in it:
Code pointers
checkout:
wsd/FileServer.cpp
We do things like:
Poco::replaceInPlace(preprocess, std::string("%ACCESS_TOKEN%"), escapedAccessToken);
Poco::replaceInPlace(preprocess, std::string("%ACCESS_TOKEN_TTL%"), std::to_string(tokenTtl));
Poco::replaceInPlace(preprocess, std::string("%ACCESS_HEADER%"), escapedAccessHeader);
Which continually re-parse and re-allocate a std::string which is far from optimal.
We should instead build an unordered_set of all the variables we want to replace; and then search through the file for a first '%' and then if it has a reasonably short, ALL_CAPS type variable after it - look that variable up and substitute it and if not just continue appending it to a string stream or somesuch to save lots of extra allocations.
Links to related documentation
We prolly want something like:
std::string s; s.reserve(existing file-size);
and just s.append(chunk);
as we go along.
Would be nice to have a util method that does this cleanly for us that takes an unordered_set I guess.
Thanks !
I've been mulling over this for a week now (while looking at implementing a feature), not knowing this easy hack existed! (Note to self: check your missed notifications more frequently).
Anyway, my idea is very similar to the description (with a small tweak). Here it goes:
- While caching files, store templates with
%
variables in a special container that has the contents string pre-tokenized (ideally, this container is StringVector or uses it directly). This is done only once at startup. - Construct an
unordered_map
with <variable-name, value> pairs beforepreprocessFile
and pass it as input. This is done for each request. - In
preprocessFile
simply run over the pre-tokenized file contents and for every non-variable token just append it to the output (which should have reserved size > original file-size, since many substitutions are larger than the variable-name length), and for each variable token substitute with the value from theunordered_map
, if it exists, otherwise do what is done right now (for backwards compatibility, unless we know it's safe to do something different and more efficient).
Great =) of course this is not urgent, but would be a great easy hack for a newbie.
hey, I'm pretty new to open source, you want to put your input inside a set and later check if the next input is already exists means outputted or not?
Sure - the realization is that instead of re-parsing and scanning this string many many times searching for %FOO%, %BAA%, etc. we search for %
@mmeeks okay, I think with some experiments I would understand it much better, how will I be running this locally, can you guide me on this? and where is the location of the file?
Hi @ArshErgon! Thanks for your interest in this task.
This is a file-server over http. When we get a request, we have to serve the file, but only after we replace any variables in the file (marked between two % characters) with their true values. This variable replacement can be improved to be faster by splitting the file up-front (done once right after reading the file contents), so we don't have to do that every time we get a request. We then have all the variables in a map (not a set), so we can quickly lookup the variable name in the map and output its value instead. Notice that once we read the files from disk, there is no disk I/O any more; It's all in memory buffers.
The code is fully in C++. You'd also need to be able to compile the project and run the tests. Please see the readme in the main project to get you started there. The details of the task, include the file where the changes are needed, are in the description above and my comment breaks down the algorithm further.
Let me know if you have any questions!
Hello! Is this issue still open or not fixed?
Hi, which tests should I exactly use for this issue? Thank you!
Hi, which tests should I exactly use for this issue? Thank you!
@NoahbJohnson15,
Ideally, all the tests should be run and pass. But for this change, manually loading documents in the browser is also needed. That is because this changes the files being served to the browser.
Thanks!
Hi @NoahbJohnson15 , are you still working on this?
@hugopeixoto weren't you interested in this one?
Hello, I was wondering if this issue is still being worked on. If not I'd like to give it a shot. As of now, I managed to setup a working environment using Gitpod and skimmed over the file that needs to be updated. As far as I've understood, the task is to build a simple parser that is supposed to traverse the file once and while doing so has to replace the variables in the format %VARIABLE_NAME% with their respective values that are stored in a lookup table (possibly an unordered map). So a simple utility function that takes references to the mentioned file and the lookup table should suffice, right? If so where should this utility function should be placed?
Also in case if the issue is assigned to me, any help about testing the implementation using Gitpod would be appreciated as this is the first time I am using Gitpod.
Sounds like you have a good grasp of it perlescen7 - if you want to work on it just mention that you are in the ticket and go for it =) Hopefully it's not too hard to parse the string to the 1st % and then to the next % and so on, do the lookups, substitutions and so on =) Would be nice to reduce allocation thrash when doing so with some reasonably sized string buffer or whatever that we write into. Thanks !
@Ashod, it appears that there is an existing pull request (PR) that you haven't reviewed for a while. However, I'm interested in working on this issue if the current solution proposed in the PR is not deemed appropriate. Could you please review the PR and let me know if I should proceed with working on it? I find the problem quite intriguing and would be eager to take a shot at solving it. Thank you for your guidance.
@codewithvk Thanks for your interest and for reaching out. This part of the code is implicated in a more complex feature to load documents asynchronously, which is quite a significant change. As such, I will be manually picking up bits and pieces from @pearlescen7's PR (preserving credit, of course) and merging with said feature.
I expect more work will remain when the dust settles and your contribution will be valuable.
@Ashod I think I just merged most of this in a patch from you right (?) probably we can close this one for now. https://github.com/CollaboraOnline/online/commit/78f5b706959f800d8d55fd536b80888a5b14c498 and related - and has some nice unit tests: good =)
Thanks!