navigation-timing workerStart and redirects

workerStart and redirects

Open nicjansma opened this issue 3 years ago • 28 comments

This addresses a few issues around workerStart especially in the case of redirects:

workerStart needed to be added to the diagram https://github.com/w3c/navigation-timing/issues/128 (see screenshot below)
- I've added a new Worker Startup phase that happens prior to Redirect
- I've added a workerStart timestamp before the start of the Worker Startup phase
- For clarity, I've added a Cross Origin Workers & Redirects section prior to the new Same Origin section
workerStart definition was cleaned up. Notably, it's:
- 0 for no SW (same as before)
- if redirects, the startup of the first request in the final same-origin redirect chain (new to address https://github.com/w3c/navigation-timing/issues/99 and https://github.com/w3c/navigation-timing/issues/100)
- if it's already available (navigating between two docs on same origin), the time before fetch (same as before)
- otherwise, SW startup time (same as before)
In the processing model, added a new worker-start-step that is split out from the step that was setting unloadEventEnd
- workerStart was zero-ed out in the case of same-origin redirects -- it should be the startup of the worker from the first request in the final same-origin redirect chain https://github.com/w3c/navigation-timing/issues/128#issuecomment-674441549
  - Same-Origin redirects no longer zero out workerStart, and will still jump back to fetch-start-step so it doesn't overwrite the worker startup time for the same origin
- workerStart was the value of the cross-origin fetch during cross-origin redirects -- it should be 0 (or updated by later same-origin SWs) https://github.com/w3c/navigation-timing/issues/128#issuecomment-674441549
  - The processing model keeps track of last origin workerStart was set for, and will reset workerStart if the origin ever changes.
The Same-origin check was missing a case where a same-origin no-redirect navigation was returning "fail", so added a new step to check for no redirects and return "pass".
Adds workerStart to the list of things that NavTiming2 has over NavTiming1

Current diagram:

I will also be reviewing the current WPTs to lock-in this behavior, assuming we all agree to the above changes.

Preview | Diff

Sep 15 '20 17:09 nicjansma

I also need to review how workerStart is defined in https://www.w3.org/TR/resource-timing-2/to make sure it's compatible (or update that too).

Sep 15 '20 18:09 nicjansma

Today, there can be delays between fetchStart and workerStart in implementations. Does workerStart need to be clearly defined as possibly after fetchStart in the diagram and spec or is the intent to change that?

Sep 22 '20 18:09 toddreifsteck

Oh hi Todd :) I think the diagram is supposed to imply that, but it should probably be made clearer by moving workerStart up to be next to fetchStart

Sep 22 '20 21:09 npm1

@nicjansma did you accidentally remove the PR preview link from the first comment? I was hoping to look at the new diagram since I just realized I commented on the old one.

Sep 22 '20 21:09 npm1

My point is that I think we DO want a space in between fetchStart and workerStart but the diagram doesn't show that. When I see a delay between fetchStart and workerStart, I've historically recommended that the site should consider not blocking load on the serviceworker if the design can allow for it so having the gap is valuable.

I haven't considered what to name that bucket of time but the goal is to allow all time to be measured and broken down so a web page author can either communicate with the browser vendor OR improve their own site to improve E2E timing.

Sep 22 '20 21:09 toddreifsteck

@npm1 Yeah, I had to remove the Preview link as it was stopping me from making edits to the comment for some reason. I've added it back in.

(I didn't realize what those links did initially, that's very useful!)

@toddreifsteck I think workerStart is always before fetchStart, according to the current model and UA behavior, correct? We define fetchStart-workerStart as the "Service Worker Startup Time" in mPulse for example.

Agreed that we can help make that more clear. Right now I've put workerStart on the bottom, pointing at the roughly the same time as fetchStart. I could instead move it to the "top row", but before fetchStart.

Alternatively, I could add a new "phase" called "Service Worker Startup Time" (or just "Worker" or whatever), to make it even more clear.

Sep 22 '20 23:09 nicjansma

Discussions from the W3C call on 9/24:

I'll update the SVG graphic to include a new "Worker" phase right before "AppCache" to make it clear
We realized if there are any same-origin-redirects, workerStart becomes useless for calculating "Worker Startup Time" because the time of redirects could not be excluded.

For example:

starting on a.com, navigate to a link on b.com/1 (which has a worker)
b.com starts up a worker (=workerStart per this new proposed model)
b.com worker fetches b.com/1
b.com/1 responds with a redirect to b.com/2
b.com worker fetches b.com/2 (=fetchStart)
navigation ends on b.com/2

In this case, redirectStart/End/Count are 0 because of same-origin policy (started on a different origin).

workerStart is at the beginning of the same-origin redirect chain (step 2), but we don't necessarily know it's a redirect (because SOP). However, fetchStart-workerStart is not just the worker startup time -- it includes the redirect time as well (and we can't know that it is a redirect).

Instead, if we kept the current model where workerStart is just the startup time of the final document fetch (between step 4/5 above), then workerStart isn't really the "startup" time since it's already running -- it's just the time the worker takes right before it fetches b.com/2 in step 5.

I think it's important to try to measure "Worker Startup Time" for sites that have Service Workers deployed, and this time can often be seen to be a measurable amount (>50ms in some cases). If so, should we add a new timestamp at the end of SW startup? Maybe workerStartComplete(?) because the worker isn't ending, just the startup is completing?

Then workerStart and workerStartComplete can be the worker startup time (workerStartComplete-workerStart). For redirects, it would be the last time it started up (step 2 above).

Thoughts?

Sep 29 '20 14:09 nicjansma

Yea, this makes sense, I think this is what they're calling workerReady in https://github.com/w3c/resource-timing/issues/119. Are you thinking on adding a new parameter in this PR though? Or just updating the image and the normative text so that it's more aligned with what it's supposed to be, and then later have a separate PR for the new attribute

Sep 29 '20 14:09 npm1

Ahh great, thanks for pointing me to that one.

I'll add workerStart to the diagram here, then followup with a separate proposal (and again updated image) for workerReady.

Sep 29 '20 16:09 nicjansma

Let us know when you'd like another review on this

Oct 06 '20 20:10 npm1

@npm1 OK I think everything's all set now. I've updated the original description with the current changes too.

After this is merged, I can take a stab at workerReady e.g. https://github.com/w3c/resource-timing/issues/119

Oct 07 '20 16:10 nicjansma

FYI @makotoshimazu and @mfalken.

Oct 08 '20 15:10 wanderview

@nicjansma I think workerStart can't be before fetchStart on the very first navigation to a web page, can it?

I'm also unsure if it is before or after fetchStart when navigation preload is used. https://w3c.github.io/ServiceWorker/#service-worker-registration-navigationpreload

For a page with a service worker installed, it does seem that serviceWorkerStart could be the first event but I always believed the intent of fetchStart was to mark when the UA determined the fetch algorithm should be processed on a URL and to clearly show when unload is complete and the next phase is starting.

If discussions have shown that isn't how the metrics are implemented or are used, please accurately specify them and don't block on my gut belief. 👍 :)

Oct 09 '20 19:10 toddreifsteck

Apologies for missing this. I see https://github.com/w3c/resource-timing/issues/119 has been linked to which explains some of the issues around here. As that issue notes,fetchStart-workerStart to measure startup time is currently broken as specified: fetchStart is always before workerStart.

I'm wondering about the decision to use the first request of the last same-origin redirect chain for workerStart. What if that request was not in the scope of a service worker, whereas a later request was? In that case workerStart would be 0?

I wonder if it's more consistent to just always use the final request for workerStart.

Oct 13 '20 13:10 mfalken

@toddreifsteck:

@nicjansma I think workerStart can't be before fetchStart on the very first navigation to a web page, can it?

For the scenario where a visitor has never been to a site before, and thus the browser does not have an active Service Worker registration? In that case, workerStart would be 0 via this text:

If the current document has no active service worker registration [SERVICE-WORKERS], this attribute MUST return zero.

Or are you talking about another scenario?

@toddreifsteck:

I'm also unsure if it is before or after fetchStart when navigation preload is used. https://w3c.github.io/ServiceWorker/#service-worker-registration-navigationpreload

Great question, nothing in this spec deals with Navigation Preload. Do you want to file a separate issue to track that?

@toddreifsteck:

For a page with a service worker installed, it does seem that serviceWorkerStart could be the first event but I always believed the intent of fetchStart was to mark when the UA determined the fetch algorithm should be processed on a URL and to clearly show when unload is complete and the next phase is starting. If discussions have shown that isn't how the metrics are implemented or are used, please accurately specify them and don't block on my gut belief

Yeah I think before we acknowledged the existence of Service Workers in this spec (and in RT), fetchStart was the best "starting place" for the current document's fetch timings. i.e. it's after all of the previous page's unloading and any redirects. With workerStart, we realized that there may be some "bootup" time in the SW before the actual fetch is dispatched, so it was placed before fetchStart in those cases. In practice all current browsers (Chrome, FF) that support this show workerStart before fetchStart

Nov 19 '20 17:11 nicjansma

@mfalken:

Apologies for missing this. I see w3c/resource-timing#119 has been linked to which explains some of the issues around here. As that issue notes,fetchStart-workerStart to measure startup time is currently broken as specified: fetchStart is always before workerStart.

Thanks for sharing that. If I understand it correctly:

RT's current spec has workerStart after fetchStart
NT's current spec has workerStart before fetchStart
This PR keeps the NT spec with workerStart before fetchStart
Chrome in ~2018 may have had workerStart after fetchStart, but that was a bug and as of right now, Chrome and Firefox both have workerStart before fetchStart
In that thread and this thread, we think workerStart should be before fetchStart

@mfalken:

I'm wondering about the decision to use the first request of the last same-origin redirect chain for workerStart.

The intent was to be able to capture the "worker bootup" or "worker startup" time for a domain, i.e. before a domain handles its first request. We were hoping the first request in the chain got us that. If we were to (re)set workerStart to be the last request in the chain, then there should be little/no reported worker startup time because it had already started up for the first request.

@mfalken:

What if that request was not in the scope of a service worker, whereas a later request was? In that case workerStart would be 0?

That's a good question, we don't discuss scope at all. My assumption is the SW is not "boot"ed if the scope doesn't match in that first request, right? So in that scenario the SW would "boot" for the second+ request, and that bootup time would only be reflected in the redirectEnd-redirectStart duration but not as a separate timestamp.

Do you think we should clarify the processing model to be something like startup of the first request **that is in scope of a service worker** in the final same-origin redirect chain or something? Starts getting more complicated...

@mfalken:

I wonder if it's more consistent to just always use the final request for workerStart.

But I think in that case the cost of the SW bootup time will always be "0"ish any time there's a redirect.

And regardless, if we really want to be able to measure SW bootup time if there are redirects, we need a workerReady timestamp as that thread proposes, or the redirects will be part of fetchStart-workerStart.

Above all else I think we all want to try to make the NT and RT definition and processing model align everywhere that's possible. So if we want these (and/or more) changes in NT we should also have agreement they belong in RT as well.

Nov 19 '20 18:11 nicjansma

I think the new definition of workerStart will allow sites to measure the overhead of Service Worker startup on the main page as the gap between fetchStart-workerStart and they can measure the total time as responseEnd-workerStart if a SW is involved so this seems to solve that problem at a high level.

I don't know how many sites use Navigation Preload but the spec should handle it cleanly. Please open an issue if you believe it is worth tracking. I'm not active in spec work in my new role.

Nov 20 '20 03:11 toddreifsteck

Sorry for the delay, I took some days off.

@mfalken:

Apologies for missing this. I see w3c/resource-timing#119 has been linked to which explains some of the issues around here. As that issue notes,fetchStart-workerStart to measure startup time is currently broken as specified: fetchStart is always before workerStart.

Thanks for sharing that. If I understand it correctly:

RT's current spec has workerStart after fetchStart

NT's current spec has workerStart before fetchStart

I may be missing something, but the two specs seem to have workerStart after fetchStart. NT says workerStart is when the service worker was started up, or when the fetch event was dispatched. And it says fetchStart is the entry point to the Fetch spec ("immediately before a user agent starts the fetching process" is the clause that applies for service worker interception, I think). Worker startup and event dispatch happens in the course of the Fetch spec, so workerStart would be after fetchStart.

This PR keeps the NT spec with workerStart before fetchStart

Chrome in ~2018 may have had workerStart after fetchStart, but that was a bug and as of right now, Chrome and Firefox both have workerStart before fetchStart

In that thread and this thread, we think workerStart should be before fetchStart

@mfalken:

I'm wondering about the decision to use the first request of the last same-origin redirect chain for workerStart.

The intent was to be able to capture the "worker bootup" or "worker startup" time for a domain, i.e. before a domain handles its first request. We were hoping the first request in the chain got us that. If we were to (re)set workerStart to be the last request in the chain, then there should be little/no reported worker startup time because it had already started up for the first request.

@mfalken:

What if that request was not in the scope of a service worker, whereas a later request was? In that case workerStart would be 0?

That's a good question, we don't discuss scope at all. My assumption is the SW is not "boot"ed if the scope doesn't match in that first request, right? So in that scenario the SW would "boot" for the second+ request, and that bootup time would only be reflected in the redirectEnd-redirectStart duration but not as a separate timestamp.

Do you think we should clarify the processing model to be something like startup of the first request **that is in scope of a service worker** in the final same-origin redirect chain or something? Starts getting more complicated...

This makes sense. I think the "in scope of a service worker" would be a worthwhile clarification... or rather it should something like "the first request that is in the same scope of the FINAL in-scope request that is same-origin to the final request in the redirect chain". Suppose there are two scopes: a.test/scope1 and a.test/scope2, and the redirect chain is a.test/scope1/page1 -> a.test/scope2/page2 -> a.test/scope2/page3. This would boot up a SW at scope1 and then another one at scope2. I think we want to capture the scope2 SW startup time. But this can be follow-up.

@mfalken:

I wonder if it's more consistent to just always use the final request for workerStart.

But I think in that case the cost of the SW bootup time will always be "0"ish any time there's a redirect.

And regardless, if we really want to be able to measure SW bootup time if there are redirects, we need a workerReady timestamp as that thread proposes, or the redirects will be part of fetchStart-workerStart.

Above all else I think we all want to try to make the NT and RT definition and processing model align everywhere that's possible. So if we want these (and/or more) changes in NT we should also have agreement they belong in RT as well.

Agreed that workerReady seems to be what we're missing, and generally aligning the processing models with Fetch + Service Worker is what we want. This was discussed at the Service Worker WG briefly at https://docs.google.com/document/d/1ybS1q2HCPh3bNNOkjGpAPFug19A2BsIxYEi-i6lrB1w/edit#heading=h.k78cttk5esfw with the rough outcome that integrating the Timing Specs with Fetch is something that will need more work.

Dec 02 '20 02:12 mfalken

@mfalken:

I may be missing something, but the two specs seem to have workerStart after fetchStart. NT says workerStart is when the service worker was started up, or when the fetch event was dispatched. And it says fetchStart is the entry point to the Fetch spec ("immediately before a user agent starts the fetching process" is the clause that applies for service worker interception, I think). Worker startup and event dispatch happens in the course of the Fetch spec, so workerStart would be after fetchStart.

Ah, and I'm not as familiar with the Fetch spec steps, so I had originally read this differently (that when workerStart is just "fetch event dispatched", that's the same as fetchStart, and fetchStart is more like step "D" here).

I think part of the discrepancy is the description of workerStart in the NT spec differs from the processing model. Here's the description:

The workerStart attribute MUST return the time immediately before the user agent ran the worker (if the current document has an active service worker registration [SERVICE-WORKERS]) required to service the request, or if the worker was already available, the time immediately before the user agent fired an event named fetch at the active worker. Otherwise, if there is no active worker this attribute MUST return zero.

And I agree per your reasoning if [workerStartup=[ran the worker] or [fired fetch event]], both of which happen in Fetch spec, and fetchStart is the entry point of Fetch spec, then workerStart would be after fetchStart.

However the processing model goes in a different "timestamp order":

Immediately after the unload event is completed, record the current time as unloadEventEnd. If the navigation URL has an active worker registration, immediately before the user agent runs the worker record the time as workerStart, or if the worker is available, record the time before the event named fetch is fired at the active worker. Otherwise, if the navigation URL has no matching service worker registration, set workerStart value to zero.

[fetch-start-step] If the new resource is to be fetched using a "GET" request method, immediately before a user agent checks with the relevant application caches, record the current time as fetchStart. Otherwise, immediately before a user agent starts the fetching process, record the current time as fetchStart.

From this processing model, in order, it seems workerStart would always be before fetchStart.

Stepping back, I think part of the confusion is workerStart was gradually added to the NT/RT specs, then over time both specs were adapted more to be consistent with the Fetch spec. And maybe we're not referencing the exact correct parts of the Fetch spec?

So maybe what I'm arguing here is that fetchStart shouldn't be the entry point of Fetch spec, but rather step "D"?

In practice today, Chrome seems to consistently set workerStart to be before fetchStart. (Firefox/Safari don't seem to implement workerStart yet).

@mfalken:

This makes sense. I think the "in scope of a service worker" would be a worthwhile clarification... or rather it should something like "the first request that is in the same scope of the FINAL in-scope request that is same-origin to the final request in the redirect chain". Suppose there are two scopes: a.test/scope1 and a.test/scope2, and the redirect chain is a.test/scope1/page1 -> a.test/scope2/page2 -> a.test/scope2/page3. This would boot up a SW at scope1 and then another one at scope2. I think we want to capture the scope2 SW startup time. But this can be follow-up.

👍

@mfalken:

Agreed that workerReady seems to be what we're missing, and generally aligning the processing models with Fetch + Service Worker is what we want. This was discussed at the Service Worker WG briefly at https://docs.google.com/document/d/1ybS1q2HCPh3bNNOkjGpAPFug19A2BsIxYEi-i6lrB1w/edit#heading=h.k78cttk5esfw with the rough outcome that integrating the Timing Specs with Fetch is something that will need more work.

Awesome, let's work towards that!

Mar 04 '21 02:03 nicjansma

Tried to summarize where we're at for the WebPerf WG https://docs.google.com/presentation/d/1r3FwT1UTo7lpjZvYe-YV7cNAee8co-qCxIU5SdERalQ/edit

Mar 04 '21 21:03 nicjansma

Following the work I was doing on RT/Fetch integration, I want to make a concrete proposal for discussion here about how to handle redirects (beyond the diagram).

First of all, I think this should be in ResourceTiming and not in NavigationTiming, as workerStart is relevant for NT only because NT is an augmentation of RT (RT is more connected with fetching, NT more with document life-cycle).

The problem with workerStart and redirects is not unique to workerStart - the same problem exists for the other HTTP-related metrics in RT: domainLookupStart, domainLookupEnd,connectStart, secureConnectionStart, connectEnd, requestStart, responseStart, nextHopProtocol.

The problem is that in the case of redirects, any of these metrics could have several values, and due to a mixture of caching/workers/http, the "last" one might be ambiguous - for example, the last workerStart might be before the last connectStart if one of the workers was a redirect and the last request was an HTTP connection.

I propose doing the following:

The following metrics: domainLookupStart, domainLookupEnd,connectStart, secureConnectionStart, connectEnd, requestStart, responseStart, nextHopProtocol, workerStart, of the ResourceTiming/NavigationTiming entry would be the ones relevant for fetching the final resource, ignoring redirects. They would be matching the fetchStart metrics.
redirectStart, redirectEnd, fetchStart and responseEnd will stay as is.
Following that, consider including a "redirects" array in the RT entry, which is an array of RT entries with the redirect URL as the name of the entry and its own set of connection/worker metrics. This array would be empty if TAO fails.
For worker-served responses, workerStart should be the time before the request was handed to the worker, and responseStart should be the time when the worker returned a non-null response to fetch. domainLookupStart, domainLookupEnd,connectStart, secureConnectionStart, connectEnd, requestStart, responseStart, nextHopProtocol would be zero/empty.

Mar 10 '21 07:03 noamr

We had a further discussion on this as well on March 18th 2021 in the WebPerfWG call, with ServiceWorker folks:

https://w3c.github.io/web-performance/meetings/2021/2021-03-18/index.html

I will address that and @noamr's feedback in this PR soon, and probably will need to just wait until https://github.com/w3c/navigation-timing/pull/141 goes in for simplicity.

Apr 01 '21 13:04 nicjansma

@noamr:

Thanks for putting your suggestions together! Overall I agree on the simplification.

For worker-served responses, workerStart should be the time before the request was handed to the worker, and responseStart should be the time when the worker returned a non-null response to fetch. domainLookupStart, domainLookupEnd,connectStart, secureConnectionStart, connectEnd, requestStart, responseStart, nextHopProtocol would be zero/empty.

I think this would cause some "reduced insight" into resources vs. today, as you would lose details of DNS/TCP/req/res phases for all resources if a SW is active, right?

If the worker was just operating as a "pass through" for a resource, it seems like we should still get those breakdown in timings (assuming origin check passes).

Apr 01 '21 13:04 nicjansma

@noamr:

Thanks for putting your suggestions together! Overall I agree on the simplification.

For worker-served responses, workerStart should be the time before the request was handed to the worker, and responseStart should be the time when the worker returned a non-null response to fetch. domainLookupStart, domainLookupEnd,connectStart, secureConnectionStart, connectEnd, requestStart, responseStart, nextHopProtocol would be zero/empty.

I think this would cause some "reduced insight" into resources vs. today, as you would lose details of DNS/TCP/req/res phases for all resources if a SW is active, right?

If the worker was just operating as a "pass through" for a resource, it seems like we should still get those breakdown in timings (assuming origin check passes).

Yes, though we should make it more clearer in FETCH. I created a new issue for that: https://github.com/whatwg/fetch/issues/1208

Apr 01 '21 14:04 noamr

Where are we on this PR? What's the next step?

Feb 07 '22 09:02 yoavweiss

Where are we on this PR? What's the next step?

Based on the conversations we had at WG, I think this PR covers the issue.

redirect timing et al are all part of a fetch rather than a response. So if a response is shared across fetches (e.g. a passthrough in a service worker, an in-flight sharing of responses, retrieving from cache) - the timing is separate - including the connection timing. The only thing that "sticks" with the response is the encoded/decoded body size.

Mar 20 '22 14:03 noamr

I think this can be closed now. @nicjansma ?

Jun 18 '22 09:06 noamr

@nicjansma - friendly ping :)

Jul 22 '22 14:07 yoavweiss

navigation-timing navigation-timing copied to clipboard

workerStart and redirects

navigation-timing
navigation-timing copied to clipboard