mod_gzip - serving compressed content by the Apache webserver - Page 9
Author: Michael Schröpl
Caching mod_gzip compressed data using proxy servers
Using a configurable compression function like mod_gzip ultimately must always be some kind of content negotiation, i. e. serving different content conditionally for the same requested URL, depending on specific information inside the HTTP headers.
On the other hand, HTTP allows the temporary storage of responses to HTTP requests in caches, especially when using proxy servers. If now
- a HTTP client sends a request,
- the corresponding response is served in compressed form and stored by some proxy and
- subsequently another HTTP client submits a request for the URL in question,
then the proxy server - not in possession of further information - has a problem:
- Is it entitled to serve the cached content to this second HTTP client as well, or
- must it forward the request to the HTTP server?
For only the HTTP server can ultimately find out (based upon its configuration containing the corresponding filter rules) whether the second HTTP client may receive compressed response data as well.
By the way this is not an effect of using a compression procedure alone but a general problem about caching HTTP data whose content cannot be specified unambiguously by an URL inside a proxy server's cache or similar memory equipped servers within the transport route. This includes negotiation procedures of any types as well as submitting additional informations in the HTTP headers, like Authentfication data or Cookies.
Of course one can try to avoid the problem by explicitly denying to cache the corresponding response's data (by using the corresponding HTTP headers Expires: and Pragma: in HTTP/1.0 and Cache-Control: in HTTP/1.1) to all proxy servers existing on the way between client and server .
But the goal of compression is to speed up the data transfer (by reducing the data volume) - and caching data serves the same goal (by reducing accesses to the HTTP server). And one performance optimization should not lead to another one being no longer usable especially as these two don't replace each other but can effectively complement each other in the case in question.
The HTTP specification contains the definition of the Vary HTTP header where the HTTP server can inform the proxy server about
- whether the response was the unique result of an URL request or
- whether other attributes of a request for the same URL could lead to different results.
Its value may contain a list of names of other HTTP headers whose content has been relevant for serving this very response for a request. Thus the HTTP server can even inform the proxy server about which HTTP headers have influenced the decision about the served content.
When a proxy server forwards a request to a HTTP server and wants to store the response inside its cache later then it should still be in possession of the HTTP headers of original request when the HTTP server's response arrives.
Now if the HTTP server marks a conditional content of a response by the corresponding Vary header then
- the proxy server must store inside its cache not only this response's content but all the relevant HTTP headers information (whose names were enumerated in the Vary HTTP header's value list of the response) from the request as well, and
- it must not serve this cached content as response to further requests unless the information of the corresponding HTTP headers of such a subsequent request at least 'matches' those of the original request, i. e. is semantically identical to the original request's values for each one of these headers.
The previous explanations have shown how a proxy server can handle the conditional delivery of HTTP responses (being the result of a Content Negotiation) correctly and with maximum utilization of its caching effect at the same time - assumed that
- the HTTP server provides the proxy server with sufficient information about the negotiation parameters and
- the proxy server is in possession of the corresponding information in case of a subsequent request to the same URL by another HTTP client.
But the latter one now means a restriction for the degrees of freedom for the negotiation process. For if the proxy server must decide about whether it may serve its cache content or not exclusively based upon information within a HTTP request then the negotiation rules of the HTTP servers must not refer to anything other than HTTP header contents!
But unfortunately this precondition is not fulfilled by mod_gzip, as of the six classes of filter rules provided
- two 'legal' ones (reqheader and uri) exclusively refer to HTTP header contents but
- four other 'illegal' ones (rspheader, handler, file and mime) refer to information that will be available only during evaluation of the request by the HTTP server.
So if a mod_gzip enhanced server uses one of these 'illegal' filter rules then the proxy server cannot any longer be able to correctly decide about the applicability of its cache content for responding to further requests.
In doing so it doesn't help the proxy server a lot either if mod_gzip would notify the proxy server about being evidently overtaxed (by supplying a complete list of the filter rule classes significant for this request within some Vary: header if that would even be legal). All the proxy server could do is using the occurrence of one of these four 'illegal' filter rule classes as criterion for not caching the response's content.
This alone would not be that bad - as long as the HTTP server limits itself to use nothing but 'legal' rules it would be able to cooperate optimally with a proxy server.
But unfortunately doing so is impossible with mod_gzip 184.108.40.206a.
The embedding of mod_gzip 220.127.116.11a into the Apache 1.3 architecture is done in a relatively complex way:
- In processing phase 1 processing mod_gzip checks whether it should be interested at all in handling this request's results and prepare for it - based upon the rules of four classes (reqheader, uri, file und handler, i. e. two 'legal' und two 'illegal' rule classes)
- In processing phase 2 mod_gzip checks whether it now should actually compress the (now available) response content - based upon the rules of two classes (rspheader und mime, both 'illegal' rule classes).
For the successful permission of a request for compression at least the fulfillment of one include rule from either of both phases is required (and the non-fulfillment of all exclude rules).
But as both include rule classes from phase 2 are 'illegal' each list of relevant filter rule classes for a successful compression in the current mod_gzip implementation must at least cover one 'illegal' rule class.
Thus it is impossible to provide a proxy server with information it can use for deciding about the applicability of some cache content - the submitted information will always overdo the comprehension of the proxy server.
Starting with version 18.104.22.168a, mod_gzip is sending Vary: headers - for each and every request where the module has been involved at least once (regardless whether compressed data have been served or not).
At this state of research for mod_gzip each request (regardless whether or not the response has actually been served in compressed form) is potentially a negotiation:
- at least about the Accept-Encoding HTTP header, and
- possibly about other HTTP headers as well (namely all those that occur within filter rules of the reqheader class)
As of now, mod_gzip is not yet able to generate the best possible, i. e. the minimum set of Vary: headers required - for this it would be necessary to rewrite the rule evaluation procedure of mod_gzip completely.
As a first step the module since Version 22.214.171.124a sends a Vary: header that contains
- the value Accept-Encoding as well as
- the names of every header being used within any reqheader rules,
because each one of these rules might make the difference for the result of the negotiation, and in each of these cases the result would depend on the values of the received HTTP headers. In certain cases this may be way too much (and then massively hinder the efficient caching of content), but at least is it something to begin with.
- uri rsp.
type - as the evaluation of this rule cannot have been dependent on the received HTTP headers, and therefore in these cases actually no negotiation (about dimensions that might contain different values for different HTTP requests) has taken place at all.
If you want to have no Vary: headers being sent for files that you are sure to never be served in compressed form because of other configuration rules, you would have to turn off mod_gzip being these.
An example for not sending Vary: headers for GIF images that might be cache by some proxy like Squid 2.4 might look like this:
For versions to come the following tasks remain open:
- Recognizing in all possible cases that the reaction to the current request can never cause compressed data to be served because some mod_gzip_item_exclude rule independent from the request's attributes is firing.
- Recognizing that some negotiation has taken place that cannot be described by a list of HTTP header names - in this case Vary: * ought to be sent (and the documentation for mod_gzip should explicitly point out that these directives be used only if absolutely required as using them will have a negative effect on the work of caching proxies).
- Doublechecking whether constellations are possible where only some subset of header names from all reqheader rules are required in a Vary: header - the fewer names there, the fewer variants have to be stored in the proxy cache in parallel.
In very special cases, i. e. when using certain configurations directives, some negotiation is done by mod_gzip about dimensions that cannot even be expressed in terms of HTTP header names. This applies to the directives
- mod_gzip_min_http (minimum HTTP version required) as well as
- mod_gzip_handle_methods (HTTP methods to be handled)
In both cases mod_gzip cannot explain to a proxy what has been done by telling the names of HTTP headers. The appropriate reaction according to the HTTP/1.1 specification is sending the Vary: * HTTP header.
mod_gzip 126.96.36.199a is sending a Vary: * header if the mod_gzip_min_http directive has been used.
As for the mod_gzip_handle_methods directive, it currently seems to be not yet absolutely clear whether two HTTP requests for the same URI but using different HTTP-methods actually ask for the same HTTP entity - this will decide whether a Vary: * header will have to be sent when using this directive as well, and be an issue to be solved in forthcoming releases.
But as in this case a proxy server cannot understand the type of negotiation performed it isn't entitled to store responses bearing this mark inside some cache.
Thus using one of these directives completely disables the proxy caching of each and every response being send by this HTTP server, whether in compressed or in uncompressed form. Therefore we advise you not to use one of these directives any more.
Storing variants of different negotiation parameters in parallel in a proxy cache may be reasonable if only a few possible values may actually occur - such like in the case of Content-Encoding. If there are a large number of possible values then a parallel storing of variants is no longer feasible.
Exactly this does apply to the UserAgent name as identification of the HTTP client. Each sub-version of a browser is sending a complex UserAgent string that contains not only name and version of the browsers but further information (national language, operating system name and version etc.). There are hundreds of known UserAgent strings - and beyond this a number of mechanisms to manipulate this UserAgent string. Some browsers (like Opera) even allow the user to explicitly select the content of this UserAgent strings as to pose as a different browser (because many technically incompetent web page creators build their site based on the name of a browser and unnecessarily exclude some browsers from it, or just because their user doesn't want to unnecessarily show details about the computer equipment they use, for the sake of keeping his/her privacy).
How reasonable serving compressed web pages conditionally on the identity of a HTTP client may ever be in some cases (like in respect to the numerous bugs of Netscape 4) still the downside of using the UserAgent strings as base of a HTTP negotiation will be that the content of this HTTP header on one hand is too varying to draw reliable conclusions out of it and on the other hand contains too many different values for any caching proxy to ever be able to keep in parallel the results of requests for the same URL for all these negotiation variants.
From version 188.8.131.52a on mod_gzip is sending a Vary: header describing the HTTP header User-Agent: as parameter of the negotiation, if a corresponding directive has been used in the configuration. But the probability for a successive request to contain an exactly identical User-Agent: value (so that this client may therefore receive the already stored content) is very low.
Actually, the HTTP server would treat even large sets of UserAgents (that are assumed to be functionally equivalent due to its configuration) identically during negotiation - but the Vary: header doesn't allow the HTTP server to tell the caching proxy which parts of the UserAgent strings were evaluated by the HTTP server as significant content during negotiation. The proxy server can only get to know that the UserAgent has played some role - and being aware of this, the proxy must treat individual UserAgents as being different even if the HTTP server would not act like this.
So using filter rules evaluating the UserAgent HTTP header will lead to totally disabling any caching for response packets created this way. The user of mod_gzip should be absolutely aware of this effect - and therefore use other filter methods (having a smaller number of possible different values) if at all possible, to provide the same type of differentiation between these HTTP Clients.