Caching 1: initial setup and background

How to set up and tweak caching of plone websites.

Terminology

  • Proxy cache: basically squid or varnish. Before requests reach plone, they first go through the proxy cache, which gets a chance at serving up the content instead of having plone do all the work. The proxy cache sits in-between plone and the webserver.
  • Browser cache: what your browser caches on your local hard disk.
  • Plone: the plone website running on zope.
  • Webserver: the front-end apache or nginx. This can be either between Plone and the Proxy Cache or the Proxy Cache and the Browser.
  • CacheFu: the caching helper add-on for plone which is responsible for all caching headers added to requests. Install it or be doomed.

Basic setup

The basic setup consists of three pieces:

  • The cachefu product (Products.CacheSetup) is installed in the plone site. The actual caching configuration happens here.

  • A proxy cache (most-times varnish) in front of plone. The varnish config is generated by plone’s varnish recipe and doesn’t need additional tweaking unless you’re doing wildly different things.

    A word about the varnish cache size. Varnish’s FAQ says that you cannot regulate the memory used by varnish: the OS does that. Just set the cache size to the total amount of data that you guess will end up in there. Some extra space doesn’t hurt (except regarding disk space). Read varnish’s architect notes if you don’t trust this statement.

  • Apache or nginx in front of or behind varnish as the front-end webserver, used for rewriting requests.

Initial cachefu configuration actions

After installing cachefu, go to the cachefu configuration page. The initial defaults are mostly fine, but some settings need to be changed on the cachefu configuration initial page.

  • Switch compression to “never” as compression has given grief with company-internal microsoft proxy servers in at least two cases that I’ve been involved with.
  • This means that you can remove “accept-encoding’ from the “vary” header setting, as that isn’t needed if we switch off compression like in the point above.
  • The purge setting is normally “Purge with VHM urls”, assuming apache does the rewriting from www.example.com to http://localhost:1234/Virtualhost... in the normal VirtualHostMonster way.
  • Set the proxy cache server’s address to your varnish’s address (just stick to 127.0.0.1 most of the time) and the correct port number.
  • Set your site’s address to the correct value (something like http://zestsoftware.nl:80). Don’t forget to make a choice whether you want to prepend www or not and redirect the non-chosen variant to the right one in apache or nginx. The purge mechanism will use this site address as the basis for the purge urls it sends to varnish when needed. (Purging is explained later on).
  • Yeah, switch on cachefu with the checkbox at the top :-) You’ll learn to love this checkbox in development, though (switch off).
  • Add your custom content types (and those of installed add-ons) to the relevant “content” or “container” rules. Content for individual page-like items and container for folder-like items.

Background: browser’s request handling

To understand the way caching works we have to look at the browser as that is where the requests originate. A browser can do various things:

  • Request something for the first time. Just a GET front-page request. Similarly when it has no copy in the browser cache: it just requests the item.
  • When it has a copy in the browser cache, it can request a new page in a bit more friendly way by additionally passing along an if-modified-since header with as value the date it last requested the page. The reply can be an actually fresh page (“200 OK”) or a short “304 not modified” message.
  • When the server returned a special code (the “Etag” header) the last time the item was requested, the browser can return that code with its next request as an “If-None-Match” heading. If the special code still matches on the server, the server again returns a “304 not modified” and a fresh page otherwise.

As seen above, a server can send various headers along to the browser, instructing it to store a page for a while, for instance.

  • An “Etag” header asks the browser to remember the Etag’s value as a special code and to return that as an “If-None-Match” header the next time the same URL is requested. Etags are explained later on.
  • “Last-Modified” is the date the object was last modified. The browser can ask whether the object was actually modified since the last request in hope for a “304 Not modified”. Note that some browsers themselves try to figure out whether pages and especially images can be cached locally for a while based on the last modification date. Setting an explicit expiration date can help in those cases. Otherwise a item that hasn’t been modified in months can be cached implicitly for weeks.
  • “Expires” gives a specific expiration date. The browser is allowed to cache the object and doesn’t have to re-request it before the expiration date is over. Great way to restrict the number of requests, but you have no possibility of refreshing the content in the browser.
  • A “Pragma: no-cache” tells the browser not to store it in its browser cache, resulting in a fresh copy all the time. Warning: don’t use this in combination with https and IE (there’s a bug in IE).

The “Cache-Control” header can have several values that give hints on how to cache the item:

  • “max-age” gives a maximum age, calculated from the moment the browser downloads it. It is somewhat similar to “Expires” in end result. Though theoretically clear in intent, both “expires” and “max-age” differ in the way browsers handle it. So if you see unexpected behaviour: google for it.
  • “must-revalidate” tells the browser not to start thinking all by itself, but to always ask whether some piece of content is still fresh (“If-Modified-Since”) once it is expired. The reason: to improve performance, some browsers make an educated guess on how long they can keep serving an item without asking the server again. They guess based on the last modification date and the mime type. To prevent that guess from mucking up your caching strategy, you can use this header.

Background: proxy cache request handling

Varnish, as installed by plone’s varnish recipes handles requests in the following way.

  • POST (as opposed to the standard GET) requests are always passed through, as POST is defined as having side effects. Your browser prompts you for resending data if you reload a POST page, so the proxy cache is careful too.
  • Etag requests (“If-None-Match”) requests from the browser are also passed right through to plone as plone is the only one that can calculate whether the Etag is still valid.
  • Authenticated browser requests are also forwarded (with one exception: if the item has been explicitly marked as public, see later).
  • If the above rules don’t match, the URL is looked up in the proxy cache and returned if found. IMPORTANT NOTE: the cached copy isn’t modified apart from potentially adding an extra varnish header. So the headers will still say that plone served up the content. This is expected behaviour. Such a plone (or rather zope) header doesn’t mean it didn’t come out of the proxy cache!
  • Browser requests that aren’t in the proxy cache are forwarded to plone.

The proxy cache (varnish) can be influence by plone by having cachefu send special headers.

  • Vary header. In a proxy cache, everything is URL based, so the proxy cache normally stores only one copy per URL. That is what the “vary header” configuration on cachefu’s main configuration screen is for: it tells the proxy cache to store different versions of a URL differentiated by the indicated header.

    If a reply coming out of plone has a vary header of “Accept-Language”, the proxy cache will store a different version of a URL for browsers requesting that URL with “Accept-Language: nl, en” from those with “Accept-Language: nl, en, de”. (This also means that a single-language site will be way more efficiently cached, btw).

The “Cache-Control” header can have several values that give hints on how to cache the item:

  • A “public” header notifies the proxy cache that it is allowed to cache the item. Especially useful for items that are requested by authenticated users but that are viewable by all (like images).
  • A “private” header notifies the proxy cache that it should not ever cache the item.
  • “s-max-age” is a max-age intended for the proxy cache. This way, you can instruct the proxy cache to keep hold of an item for a different amount of time than the browser (which listens only to “max-age”).
  • “proxy-revalidate” orders the proxy cache to always re-request an item after the expiration date has passed (and to not get smart and think it can serve it a bit longer).

There are a couple of other parameters that can be used for tweaking microsoft proxy servers, but that should only be done to solve some specific problem after some heavy googling and testing.

Plone’s cachefu request handling

With cachefu installed, there are three stages in handling a request. Cachefu ties into the start and the end of plone’s normal content handling:

  • Cachefu intercepts the incoming request and determines if it can handle an “If-None-Match” request or an “If-Modified-Since” by returning a “304 Not modified” immediately. Or it can return a page from a cachefu-specific memory cache (explained later).
  • Plone serves the content.
  • Cachefu intercepts the outgoing response and adds headers for the proxy cache and for the browser. It can also store the page in the cachefu-specific memory cache.

The central core of cachefu is the adding of headers to outgoing responses. For configuring this, there are “header sets” and “rules” in cachefu.

  • A single header set configuration screen contains checkboxes for all possible headers that you might want to set. Expires, last modified, public, private, Etag, etc. The expiration date and the max age can be configured (in seconds). Headers that are checked are added to the outgoing response when this header set is used.

    Warning: Cachefu’s defaults are good. Before you tweak header sets, make sure you google a lot. Setting both an expiry date and a max age at the same time might be theoretically unnecessary, but it is needed to get consistent behaviour out of various browsers, for instance. So everything in the header sets has a reason. In rare cases you can make the decision to change something: make sure it is a really informed decision! Random tweaks mean customer phone calls.

    One modification can be made without risk: expiration date and max age. It is often handy to copy such an existing header set and set the copy’s duration to 15 minutes instead of an hour or so.

  • Rules are searched in order and the first one that matches wins. The most common rules define the contenttypes for which they’re applicable. There are also template rules for matching specific templates/views like the sitemap.

    Every rule lists the header set that it uses. Actually, you can specify a separate header set for anonymous and for logged in users. Or, for maximum flexibility, you can put in a bit of python code that returns the id of the header set.

    If you’ve created extra views for content types, you’ll have to add their names to the list of view names. Otherwise only the default view will be cached.

The last piece of cachefu is in-memory caching. You can have cachefu store rendered pages (identified by url and Etag) which it can then look up when a request enters plone. This way you get a bit of the Etag performance for multiple users at the same time. This is all explained later.

Items not covered in this document

There is also caching-related functionality outside of cachefu’s control:

  • Standard zope RAM caching.
  • Resource registries (portal_css and friends) that combines css/javascript/kss into several large files with an ID suited to perpetual caching in the browser and the proxy cache. Cachefu handles that caching with a special rule, but the files themselves are handled by resource registries.
  • In-code memoizing.