metaphysics icon indicating copy to clipboard operation
metaphysics copied to clipboard

Plan of attack for the memory issues

Open alloy opened this issue 7 years ago • 2 comments

  • [ ] @izakp is going to setup a new env in our own architecture and route some traffic there
    • Because we need more cache ($)
    • We want more choice in memcache clients, memcachier requires the binary protocol for which there is only 1 client, written by the memcachier people. They do not appear to have a good idea right now what the cause of the socket issues are.
  • [ ] @mzikherman is going to make sure we do not make any unnecessary MP requests or cache data for those requests when a crawler comes by.
  • [ ] @alloy & @mzikherman Replace data loader singletons with per-request instances.
    • [x] Covered schema/artist/*
    • [ ] The rest

alloy avatar Mar 07 '17 17:03 alloy

cc @alloy I took a look at the Reflection/Google crawling behavior to remind myself, here's just a basic rundown of the current setup, sorry for being long-winded:

Through sitemaps, Reflection crawls the site and snapshots rendered HTML, which is basically just the 'above the fold' server-side rendered content, and support for some client-side content loading as well- though I've found occasional inconsistencies with being able to capture this.

Andy's crawled page: http://artsy-reflection.s3-website-us-east-1.amazonaws.com/__reflection/47f98474592f66f9/artist/andy-warhol

These are just HTML snapshots with no Javascript to execute.

Force then uses a meta tag on pages to tell Googlebot to request the escaped fragment version.

An example of a meta tag: https://github.com/artsy/force/blob/9eee12afd13130e297ea388e6baabd4e6e6b5c52/desktop/apps/artist/templates/meta.jade#L20-L21

Force then uses middleware to detect the presence of ?_escaped_fragment_ and if the Reflection-crawled version exists, it serves that. Otherwise, it allows the bot to crawl the real page.

Middleware: https://github.com/artsy/force/blob/8b2ddeaa398d23177714df033f6d82f314afdf81/desktop/lib/middleware/proxy_to_reflection.coffee#L11-L15

That behavior in the wild: https://www.artsy.net/artist/andy-warhol?escaped_fragment=

So...the issue of basically always filling up the cache with our contents, even for pages that get almost no traffic, is really always going to be present. We use Reflection to be able to have control over how often to crawl, and to be able to serve those and protect us from other aggressive crawling. But, since no matter what, something is going to be crawling our site on a regular basis, it seems that we'll eventually fill the cache over time. The LRU key eviction strategy should mean that the less-traffic'ed pages get the brunt of that impact though, so that's fine.

or cache data for those requests when a crawler comes by.

Maybe we just need a bigger cache? I'm not sure if changing the behavior to avoid caching when a crawler comes thru (which will most likely be on the lower traffic'ed pages since crawling the more popular ones would already be hitting the cache) will really change much. A big enough cache + LRU cache key eviction should sort of naturally work itself out I think.

make sure we do not make any unnecessary MP requests

In terms of Googlebot crawling pages that lack a Reflection pre-rendered page, I do see a few Googlebot entries in the Metaphysics logs. Some of them are for pretty recent content, and a couple of them I checked had Reflection actually crawl them just this morning. So, perhaps we should bump up the Reflection crawling cadence? Additionally, I see some requests from Googlebot without ?_escaped_fragment=- I'm not sure what those are. Should we be using the Reflection-proxying middleware based on user agent (as well as that query param)? (cc @joeyAghion )

Anyways...basically that's sort of a summary of the current state of the moving parts. Besides getting a bigger cache, which we will do when we move off Heroku, I'm not sure if there's any behavior to really change with respect to caching crawled pages. There might be something to explore with increasing Radiation's crawling, as well as more aggressively redirecting bots to the pre-rendered page.

mzikherman avatar Mar 08 '17 18:03 mzikherman

@mzikherman @alloy it would be a good step if we can adopt some tooling around memory profiling here as well. See: https://github.com/artsy/force/issues/1118

izakp avatar Mar 28 '17 19:03 izakp