tectonic icon indicating copy to clipboard operation
tectonic copied to clipboard

What exactly is being downloaded?

Open nozpheratu opened this issue 5 years ago • 13 comments

New Latex user here, I have a simple tex file:

\documentclass{minimal}
\begin{document}
Look, some text!
\end{document}

I started compiling this document over a half hour ago, and Tectonic still seems to be downloading a TON of dependecies. Is there any way I can tell it to only use the bare minimum?

nozpheratu avatar Aug 25 '19 05:08 nozpheratu

One answer is that in LaTeX, "the bare minimum" is still a surprisingly large number of files. The way that the system is designed, it really ought to be only fetching what it can get away with, although it's also true that the engine needs to pull all of the dependencies to create the "format file", which is a larger set of files than those needed to compile a document once the format file has been created.

But anyway, the initial download should only take a few minutes, not half an hour. Did you do this run on a slow network?

One possible thing for us to do: document how much data needs to be download to compile a bare-minimum document, so people can get a sense of how long it might take to do their first run.

pkgw avatar Aug 25 '19 19:08 pkgw

@pkgw Does the format file creation always require the same files? One can think of providing say a zip file with all these files that tectonic could download (or be shipped with) at once. My feeling is that the reason for the initial download to take this long is not because of the amount of data, but due to the number of http requests.

Another simple option is to just provide a "progress bar" if the total number of files is known.

rekka avatar Aug 26 '19 13:08 rekka

@rekka Yeah, it's a very constant set of files, and if we had a "multi-tool" type of CLI (like git) I'd definitely like the idea of providing an option to bootstrap the cache:

tectonic bootstrap-cache ~/path/to/files.zip

I think of HTTP requests as not having a very big overhead, but we're opening a new TCP connection for each request (I think!), and a lot of our files are small, so maybe that's not a good approximation here.

pkgw avatar Aug 26 '19 13:08 pkgw

@pkgw Yep, something like that would be nice.

In my case the overhead for the http connection is somewhere around 1.5-2s per file, so it is quite substantial (I am in Japan, which might add a few hundred ms).

rekka avatar Aug 26 '19 13:08 rekka

I was also curious about this, because after getting a report from a user taking ~20 minutes, I zapped my tectonic cache and redownloaded everything, and it was slow here as well. Before that, I had never noticed the population of the cache. I thought that maybe something had changed (adding hyphenation files bringing in new requirements?), or that perhaps the server was just slow that day?

It wouldn't suprise me though if I was away from the computer and just never noticed the initial population of the cache though.

ratmice avatar Aug 26 '19 14:08 ratmice

I also have issues with very long downloading time when bootstrapping. With a cursory glance, I saw that tectonic is using reqwest but it appears it is creating a reqwest::Client for every connection; reqwest::Client holds a connection pool internally, and should be reused as much as possible; reusing connections might speed up downloading many files quite a bit.

rasky avatar Oct 06 '19 22:10 rasky

@rasky I don't see that, there are 3 Client::new() calls corresponding to these 3 phases

note: downloading index
note: downloading SHA256SUM
note: downloading tectonic-format-latex.tex
.... etc

during the last phase after the initial "downloading tectonic-format-latex.tex", all are sharing a reqwest::Client. but it's this phase which is taking the majority of the time it seems.

ratmice avatar Oct 15 '19 19:10 ratmice

I would like to put in another endorsement for reducing the number of HTTP requests as much as possible. The overhead is huge in many cases! I have an ethernet connection that can download at 100MB/s, but because every file is fetched separately, it takes 0.75s/file -- and these are tiny files, so the effective download rate is ~50KB/s, or in other words over a thousand times slower due to making many connections. If requests absolutely must be made separately, we could still get a great performance improvement by doing them in parallel.

raxod502 avatar Nov 16 '19 22:11 raxod502

Perhaps Peter can clarify, but

As far as I can tell, the problem is that it's downloading while parsing, as in it stops parsing, downloads the file, and continues parsing when it has received the data. The data received can then cause it to download more files, etc.

I think that some solution to fixing this might need more data on the server e.g. parse through all the files, and create a graph of each file's dependencies, so that you can download a table of contents, and at then download batches of files and their dependencies in parallel.

But I don't think the client itself has enough information to be able to continue parsing to actually be able to encounter another file so it can start downloading it in parallel without some help.

ratmice avatar Nov 16 '19 22:11 ratmice

@ratmice That's definitely a factor, but should be surmountable. Tectonic could keep a connection to the server open in the background and try to reuse it as much as possible. Servers might aggressively drop idle connections, but the worst case is that you just reconnect as we often do now.

You could map out dependencies of various files to a certain extent, but those dependencies might vary depending on the engine state and characteristics of the input document, so they're not well-defined.

Just to be clear, I have never really sat down and understood how HTTP connection reuse is or isn't actually working out in Tectonic in practice, so it is quite possible that there are some easy wins to be had! I would love to push on this topic but haven't been able to sit down and have a good coding session for a long time.

pkgw avatar Nov 18 '19 14:11 pkgw

@pkgw I've looked at it a little bit, and afaict connection reuse should be working for the vast majority of connections,

The connection pool should be set up in cached_itarbundle.rs HttpRangeReader::new(), There is a constant number of initial connections, but the majority of connections under get_everything reuse the HttpRangeReader, and thus should reuse reqwest's internal connection pooling.

With the version of reqwest we are using, the timeout seems to default to 30s which seems reasonable. I haven't been able to sit down and look at reqwest's internals however to see if the pooling mechanism is actually working.

But from the tectonic side, I didn't see anything egregious, but will try and look into the reqwest internal pooling when I can, perhaps it might be closing the connection when the number of outstanding requests reaches 0 which would happen after every request.

ratmice avatar Nov 18 '19 17:11 ratmice

This was unexpected to me, I have a texlive installation but suddenly today when I executed tectonic it started to download a bunch of sty files. Before auto downloading anything I'd really recommend you to prompt the user first whether it wishes these files to be downloaded. Currently I'm looking up where these files have been actually downloaded, as I don't need them.

oblitum avatar Jul 14 '21 17:07 oblitum

Now tectonic simply started failing to download whatever it tries to download due to DNS error. Please, is there any way to force it stop trying to do magic and see that I have TeXLive already installed? Otherwise the tool has just become a brick, it can't do anything useful...

oblitum avatar Nov 23 '21 13:11 oblitum