Caching 1: initial setup and background
How to set up and tweak caching of plone websites.
Terminology
- Proxy cache: basically squid or varnish. Before requests reach plone,
they first go through the proxy cache, which gets a chance at serving up the
content instead of having plone do all the work. The proxy cache sits
in-between plone and the webserver.
- Browser cache: what your browser caches on your local hard disk.
- Plone: the plone website running on zope.
- Webserver: the front-end apache or nginx. This can be either between
Plone and the Proxy Cache or the Proxy Cache and the Browser.
- CacheFu: the caching helper add-on for plone which is responsible for
all caching headers added to requests. Install it or be doomed.
Basic setup
The basic setup consists of three pieces:
The cachefu product (Products.CacheSetup) is installed in the plone
site. The actual caching configuration happens here.
A proxy cache (most-times varnish) in front of plone. The varnish config is
generated by plone’s varnish recipe and doesn’t need additional tweaking
unless you’re doing wildly different things.
A word about the varnish cache size. Varnish’s FAQ says that you cannot regulate
the memory used by varnish: the OS does that. Just set the cache size to
the total amount of data that you guess will end up in there. Some extra
space doesn’t hurt (except regarding disk space). Read varnish’s architect
notes if you don’t
trust this statement.
Apache or nginx in front of or behind varnish as the front-end webserver, used
for rewriting requests.
Initial cachefu configuration actions
After installing cachefu, go to the cachefu configuration page. The initial
defaults are mostly fine, but some settings need to be changed on the cachefu
configuration initial page.
- Switch compression to “never” as compression has given grief with
company-internal microsoft proxy servers in at least two cases that I’ve
been involved with.
- This means that you can remove “accept-encoding’ from the “vary” header
setting, as that isn’t needed if we switch off compression like in the point
above.
- The purge setting is normally “Purge with VHM urls”, assuming apache does
the rewriting from www.example.com to
http://localhost:1234/Virtualhost... in the normal VirtualHostMonster
way.
- Set the proxy cache server’s address to your varnish’s address (just stick
to 127.0.0.1 most of the time) and the correct port number.
- Set your site’s address to the correct value (something like
http://zestsoftware.nl:80). Don’t forget to make a choice whether you
want to prepend www or not and redirect the non-chosen variant to the right
one in apache or nginx. The purge mechanism will use this site address
as the basis for the purge urls it sends to varnish when needed. (Purging is
explained later on).
- Yeah, switch on cachefu with the checkbox at the top :-) You’ll
learn to love this checkbox in development, though (switch off).
- Add your custom content types (and those of installed add-ons) to
the relevant “content” or “container” rules. Content for individual
page-like items and container for folder-like items.
Background: browser’s request handling
To understand the way caching works we have to look at the browser as that is
where the requests originate. A browser can do various things:
- Request something for the first time. Just a GET front-page
request. Similarly when it has no copy in the browser cache: it just
requests the item.
- When it has a copy in the browser cache, it can request a new page in a
bit more friendly way by additionally passing along an if-modified-since
header with as value the date it last requested the page. The reply can be
an actually fresh page (“200 OK”) or a short “304 not modified” message.
- When the server returned a special code (the “Etag” header) the last time
the item was requested, the browser can return that code with its next
request as an “If-None-Match” heading. If the special code still matches on
the server, the server again returns a “304 not modified” and a fresh page
otherwise.
As seen above, a server can send various headers along to the browser,
instructing it to store a page for a while, for instance.
- An “Etag” header asks the browser to remember the Etag’s value as a special
code and to return that as an “If-None-Match” header the next time the same
URL is requested. Etags are explained later on.
- “Last-Modified” is the date the object was last modified. The browser can
ask whether the object was actually modified since the last request in hope
for a “304 Not modified”. Note that some browsers themselves try to figure
out whether pages and especially images can be cached locally for a while
based on the last modification date. Setting an explicit expiration date can
help in those cases. Otherwise a item that hasn’t been modified in months
can be cached implicitly for weeks.
- “Expires” gives a specific expiration date. The browser is allowed to cache
the object and doesn’t have to re-request it before the expiration date is
over. Great way to restrict the number of requests, but you have no
possibility of refreshing the content in the browser.
- A “Pragma: no-cache” tells the browser not to store it in its browser cache,
resulting in a fresh copy all the time. Warning: don’t use this in
combination with https and IE (there’s a bug in IE).
The “Cache-Control” header can have several values that give hints on how to
cache the item:
- “max-age” gives a maximum age, calculated from the moment the browser
downloads it. It is somewhat similar to “Expires” in end result. Though
theoretically clear in intent, both “expires” and “max-age” differ in the
way browsers handle it. So if you see unexpected behaviour: google for it.
- “must-revalidate” tells the browser not to start thinking all by itself, but
to always ask whether some piece of content is still fresh
(“If-Modified-Since”) once it is expired. The reason: to improve
performance, some browsers make an educated guess on how long they can keep
serving an item without asking the server again. They guess based on the
last modification date and the mime type. To prevent that guess from mucking
up your caching strategy, you can use this header.
Background: proxy cache request handling
Varnish, as installed by plone’s varnish recipes handles requests in the
following way.
- POST (as opposed to the standard GET) requests are always passed through, as
POST is defined as having side effects. Your browser prompts you for
resending data if you reload a POST page, so the proxy cache is careful too.
- Etag requests (“If-None-Match”) requests from the browser are also passed
right through to plone as plone is the only one that can calculate whether
the Etag is still valid.
- Authenticated browser requests are also forwarded (with one exception: if
the item has been explicitly marked as public, see later).
- If the above rules don’t match, the URL is looked up in the proxy cache and
returned if found. IMPORTANT NOTE: the cached copy isn’t modified apart
from potentially adding an extra varnish header. So the headers will still
say that plone served up the content. This is expected behaviour. Such a
plone (or rather zope) header doesn’t mean it didn’t come out of the proxy
cache!
- Browser requests that aren’t in the proxy cache are forwarded to plone.
The proxy cache (varnish) can be influence by plone by having cachefu send
special headers.
Vary header. In a proxy cache, everything is URL based, so the proxy cache
normally stores only one copy per URL. That is what the “vary header”
configuration on cachefu’s main configuration screen is for: it tells the
proxy cache to store different versions of a URL differentiated by the
indicated header.
If a reply coming out of plone has a vary header of “Accept-Language”, the
proxy cache will store a different version of a URL for browsers requesting
that URL with “Accept-Language: nl, en” from those with “Accept-Language:
nl, en, de”. (This also means that a single-language site will be way more
efficiently cached, btw).
The “Cache-Control” header can have several values that give hints on how to
cache the item:
- A “public” header notifies the proxy cache that it is allowed to cache the
item. Especially useful for items that are requested by authenticated users
but that are viewable by all (like images).
- A “private” header notifies the proxy cache that it should not ever cache
the item.
- “s-max-age” is a max-age intended for the proxy cache. This way, you can
instruct the proxy cache to keep hold of an item for a different amount of
time than the browser (which listens only to “max-age”).
- “proxy-revalidate” orders the proxy cache to always re-request an item after
the expiration date has passed (and to not get smart and think it can serve
it a bit longer).
There are a couple of other parameters that can be used for tweaking microsoft
proxy servers, but that should only be done to solve some specific problem
after some heavy googling and testing.
Plone’s cachefu request handling
With cachefu installed, there are three stages in handling a request. Cachefu
ties into the start and the end of plone’s normal content handling:
- Cachefu intercepts the incoming request and determines if it can handle an
“If-None-Match” request or an “If-Modified-Since” by returning a “304 Not
modified” immediately. Or it can return a page from a cachefu-specific
memory cache (explained later).
- Plone serves the content.
- Cachefu intercepts the outgoing response and adds headers for the proxy
cache and for the browser. It can also store the page in the
cachefu-specific memory cache.
The central core of cachefu is the adding of headers to outgoing
responses. For configuring this, there are “header sets” and “rules” in
cachefu.
A single header set configuration screen contains checkboxes for all
possible headers that you might want to set. Expires, last modified, public,
private, Etag, etc. The expiration date and the max age can be configured
(in seconds). Headers that are checked are added to the outgoing response
when this header set is used.
Warning: Cachefu’s defaults are good. Before you tweak header
sets, make sure you google a lot. Setting both an expiry date and a
max age at the same time might be theoretically unnecessary, but it
is needed to get consistent behaviour out of various browsers, for
instance. So everything in the header sets has a reason. In rare
cases you can make the decision to change something: make sure it is
a really informed decision! Random tweaks mean customer phone calls.
One modification can be made without risk: expiration date and max age. It
is often handy to copy such an existing header set and set the copy’s
duration to 15 minutes instead of an hour or so.
Rules are searched in order and the first one that matches wins. The most
common rules define the contenttypes for which they’re applicable. There are
also template rules for matching specific templates/views like the sitemap.
Every rule lists the header set that it uses. Actually, you can specify a
separate header set for anonymous and for logged in users. Or, for maximum
flexibility, you can put in a bit of python code that returns the id of the
header set.
If you’ve created extra views for content types, you’ll have to add their
names to the list of view names. Otherwise only the default view will be
cached.
The last piece of cachefu is in-memory caching. You can have cachefu store
rendered pages (identified by url and Etag) which it can then look up when a
request enters plone. This way you get a bit of the Etag performance for
multiple users at the same time. This is all explained later.
Items not covered in this document
There is also caching-related functionality outside of cachefu’s control:
- Standard zope RAM caching.
- Resource registries (portal_css and friends) that combines
css/javascript/kss into several large files with an ID suited to perpetual
caching in the browser and the proxy cache. Cachefu handles that caching
with a special rule, but the files themselves are handled by resource
registries.
- In-code memoizing.