Feature: Store URL Rewriting?

Details

My main focus with this feature is to support caching various CDN-supplied content which maps the same resource/content to multiple locations. Initially I'm targetting Google content - Google Earth, Google Maps, Google Video, Youtube - but the same technique can be used to cache similar content from CDNs such as Akamai (think "Microsoft Updates".)

The current changes to Squid-2.HEAD implement the functionality through a number of structural changes:

  • The "Rewrite" code in client_side.c is broken out into client_side_rewrite.c;
  • This was used as a template for "store URL" rewriting in client_side_storeurl_rewrite.c;
  • An external helper (exactly the same data format is used as a redirect helper!) receives URLs and can rewrite them to a canonical form - these rewritten URLs are stored as "store_url" URLs, seperate from the normal URL;
  • The existing/normal URLs are used for ACL and forwarding
  • The "store_url" URLs are used for the store key lookup and storage
  • A new meta type has been added - STORE_META_STOREURL - which means the on-disk object format has slightly changed. There's no big deal here - Squid may warn about an unknown meta data type if you rollback to another squid version after trying this feature but it won't affect the operation of your cache.

Squid Configuration

First, you need to determine which URLs to send to the store url rewriter.

acl store_rewrite_list dstdomain mt.google.com mt0.google.com mt1.google.com mt2.google.com
acl store_rewrite_list dstdomain mt3.google.com
acl store_rewrite_list dstdomain kh.google.com kh0.google.com kh1.google.com kh2.google.com
acl store_rewrite_list dstdomain kh3.google.com
acl store_rewrite_list dstdomain kh.google.com.au kh0.google.com.au kh1.google.com.au
acl store_rewrite_list dstdomain kh2.google.com.au kh3.google.com.au

# This needs to be narrowed down quite a bit!
acl store_rewrite_list dstdomain .youtube.com

storeurl_access allow store_rewrite_list
storeurl_access deny all

Then you need to configure a rewriter helper.

storeurl_rewrite_program /Users/adrian/work/squid/run/local/store_url_rewrite

Then, to cache the content in Google Maps/etc, you need to change the defaults so content with "?"'s in the URL aren't automatically made uncachable. Search your configuration and remove these two lines:

#We recommend you to use the following two lines.
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY 

Make sure you check your configuration file for cache and no_cache directives; you need to disable them and use refresh_patterns where applicable to tell Squid what to not cache!

Then, add these refresh patterns at the bottom of your refresh_pattern section.

refresh_pattern -i (/cgi-bin/|\?)   0       0%      0
refresh_pattern .                   0       20%     4320

These rules make sure that you don't try caching cgi-bin and ? URLs unless expiry information is explictly given. Make sure you don't add the rules after a "refresh_pattern ." line; refresh_pattern entries are evaluated in order and the first match is used! The last entry must be the "." entry!

Storage URL re-writing Helper

Here's what I've been using:

   1 #!/usr/local/sbin/perl
   2 $| = 1;
   3 while (<>) {
   4         chomp;
   5         # print STDERR $_ . "\n";
   6         if (m/kh(.*?)\.google\.com(.*?)\/(.*?) /) {
   7                 print "http://keyhole-srv.google.com" . $2 . ".SQUIDINTERNAL/" . $3 . "\n";
   8                 # print STDERR "KEYHOLE\n";
   9         } elsif (m/mt(.*?)\.google\.com(.*?)\/(.*?) /) {
  10                 print "http://map-srv.google.com" . $2 . ".SQUIDINTERNAL/" . $3 . "\n";
  11                 # print STDERR "MAPSRV\n";
  12         } elsif (m/^http:\/\/([A-Za-z]*?)-(.*?)\.(.*)\.youtube\.com\/get_video\?video_id=(.*) /) {
  13                 # http://lax-v290.lax.youtube.com/get_video?video_id=jqx1ZmzX0k0
  14                 print "http://video-srv.youtube.com.SQUIDINTERNAL/get_video?video_id=" . $4 . "\n";
  15         } else {
  16                 print $_ . "\n";
  17         }
  18 }

A simple very fast rewriter called SQUIRM is also good to check out, it uses the regex lib to allow pattern matching.

An even faster and slightly more featured rewriter is jesred.

How do I make my own?

The helper program must read URLs (one per line) on standard input, and write rewritten URLs or blank lines on standard output. Squid writes additional information after the URL which a redirector can use to make a decision.

Input line received from Squid:

[channel-ID] URL [key-extras]
channel-ID
  • This is the concurrency channel number. When concurrency is turned off (set to 1) this field and the following space will be completely missing.

URL
  • The URL received from the client. In Squid with ICAP support, this is the URL after ICAP REQMOD has taken place.
key-extras
  • Starting with Squid-3.5 additional parameters passed to the helper which may be configured with url_rewrite_extras. For backward compatibility the default key-extras for URL helpers matches the format fields sent by Squid-3.4 and older in this field position:

  •  ip/fqdn ident method [urlgroup] kv-pair
ip
  • This is the IP address of the client. Followed by a slash (/) as shown above.

fqdn
  • The FQDN rDNS of the client, if any is known. Squid does not normally perform lookup unless needed by logging or ACLs. Squid does not wait for any results unless ACLs are configured to wait. If none is available - will be sent to the helper instead.

ident
  • The IDENT protocol username (if known) of the client machine. Squid will not wait for IDENT username to become known unless there are ACL which depend on it. So at the time re-writers are run the IDENT username may not yet be known. If none is available - will be sent to the helper instead.

method
  • The HTTP request method. URL alterations and particularly redirection are only possible on certain methods, and some such as POST and CONNECT require special care.
urlgroup
  • Squid-2 will send this field with the URL-grouping tag which can be configured on http_port. Squid-3.x will not send this field.

kv-pair
  • One or more key=value pairs. Only "myip" and "myport" pairs documented below were ever defined and are sent unconditionally by Squid-3.4 and older:

    myip=...

    Squid receiving address

    myport=...

    Squid receiving port

Result line sent back to Squid:

[channel-ID] [result] [kv-pair] [URL]
channel-ID
  • When a concurrency channel-ID is received it must be sent back to Squid unchanged as the first entry on the line.

result
  • One of the result codes:

    OK

    Success. A new URL is presented

    ERR

    Success. No change for this URL.

    BH

    Failure. The helper encountered a problem.

  • {i} the result field is only accepted by Squid-3.4 and newer.

kv-pair
  • One or more key=value pairs. The key names reserved on this interface for URL re-writing:

    clt_conn_tag=...

    Tag the client TCP connection (Squid-3.5)

    message=...

    reserved

    rewrite-url=...

    re-write the transaction to the given URL.

    tag=...

    reserved

    ttl=...

    reserved

    *_=...

    Key names ending in (_) are reserved for local administrators use.

  • {i} the kv-pair field is only accepted by Squid-3.4 and newer.

  • {i} the kv-pair returned by this helper can be logged by the %note logformat code.

URL
  • The URL to be used instead of the one sent by the client. If no action is required leave the URL field blank. The URL sent must be an absolute URL. ie starting with http:// or ftp:// etc.

Testing

Finally, restart Squid-2.HEAD and browse google maps; check your access.log and store.log to make sure URLs are being cached! Check store.log to make sure that the google maps/earth images are being stored in the cache (SWAPOUT) and not just RELEASEd immediately.


CategoryFeature

Features/StoreUrlRewrite (last edited 2013-07-27 04:58:20 by AmosJeffries)