Feature: Store ID

Details

"Store ID" is another name for the Squid cache key. By default, store IDs are computed by Squid so that different URLs are mapped to different store IDs. This feature allows the proxy admin to specify a custom store ID calculation algorithm via a helper program. It is usually used to assign the same store ID to transactions with different request URLs. Such mapping may reduce misses (i.e., increase hit ratio) when dealing with CDN URLs and similar cases where different URLs are known to point to essentially the same content.

Store ID violates HTTP and causes havoc if URLs pointing to different content are incorrectly mapped to the same Store ID. A Squid admin lacks control over URL-to-content mapping used by external CDNs and content providers. Even if the initial reverse engineering of their URL space is successful, maintaining the Store ID helper correctness is usually difficult because of sudden external mapping changes.

This feature is a port of the Squid-2.7 Store-URL feature, however it does work in a slightly different way and will make Squid-3.4 or later to apply all store\cache related work to be against the StoreID and not the request URL. This includes refresh_pattern. This allows more flexibility in the way admin will be able use the helper.

This feature will allow us later to implement Metalink support into squid.

Known Issues

  • Using StoreID on two URLs assumes that the resources presented by each are exact duplicates. Down to their metadata information used by HTTP conditional and revalidation requests.

    • {X} care must be taken when using StoreID helper that the URLs are indeed precise duplicates or the end result may be a reduced HIT-ratio and bad proxy performance rather than improved caching.

    • For example; HTTP ETag header values are only guaranteed to be unique per-URL and if a CDN uses different ETag from each server then conditional requests involving ETag will MISS or REFRESH more often despite the content object/file being identical. Possibly causing a larger bandwidth consumption than if StoreID was not present at all.
  • StoreID causes HTTP redirect loops if Squid is not configured to avoid caching redirection responses (HTTP allows caching of some redirection responses). If both the request URL and the corresponding redirect response Location URL are mapped to the same Store ID, the redirected request will hit on the cached redirection response, creating a loop. See Squid bug 3937.

  • ICP and HTCP support is missing.
    • URL queries received from cache_peer siblings are not passed through StoreID helper. So the resulting store/cache lookup will MISS on URLs normally alterd by StoreID.

Available Helpers

  • Eliezer Croitoru has designed several !Ruby helpers, including the example helper here.

  • storeid_file_rewrite by Alan Mizrahi is a simple helper which is packaged with Squid-3.4. It can be used to load a database of patterns without needing to edit the code of the helper internals.

  • {i} Any helper previously designed for the Squid-2.7 StoreURL feature is expected to work with Squid-3.4. However upgrading the response syntax sent back to Squid is advised for better performance and forward-compatibility with future Squid versions.

  • {i} Older URL-rewriter programs such as SQUIRM and Jesred will also work using the above backward-compatibility support. However newer URL-rewrite helpers designed for the Squid-3.4 response syntax WILL NOT work on the Store-ID interface unless they have specific Store-ID interface support.

A CDN Pattern Database

Since the feature by itself was designed and now there is only a need to allow basic and advanced usage we can move on towards a database of CDNs which can be shared by various helper designs.

The DB of patterns provides de-duplication for content such as SourceForge CDN network or Linux distributions repository mirrors. Contributions are welcome.

Do we want to cache youtube videos?

Rather then a question of "is is possible to be done?" the real question is "do we really want to cache youtube videos?"

The answer to that from my point of view is: In most places YOUTUBE videos will be quite close by their CDN network.

If you are in a place that YOUTUBE cdn networks or akamai is there already the you can try to consider caching youtube videos.

It is possible to cache youtube videos and content but since youtube videos are not a "small" files that should be cached it can cause sometimes bad performance due to bad fine tuning of squid by targeting this sole purpose.

Since A cache proxy server admin should consider couple aspects he\she should consider the true overhead of doing it.

Url only based StoreID compared to deep inspection based StoreID

There are couple ways to determine an object StoreID. Currently squid StoreID helper interface allows only to determine the StoreID based on the request url which is very limiting since not all urls contains static identification data.

One great example would be a token based access control downloads, the user never gets a url which can be related to some unique ID of the file\object what so ever in the request but instead gets a url with a random or encrypted token which will result in the download of the file. For example:

The above url will be unique on each download request and there for cannot be predicted using the urls only. In order to to predict this url and tie it to some StoreID there is a need for some Deep HTTP Content Inspection.

These days there are many sites that uses a POST request to fetch the unique download url\token. If we will inspect the full request and response using ICAP or eCAP we could easily know to what ID we can tie the token based urls that are embedded in the POST response.

In theory, an ID computed by an eCAP service can already be passed to Squid via eCAP annotations (a.k.a. meta headers) and then passed to the storeID helper via store_id_extras. Currently, ICAP services do not support the option to send a StoreID as a part of the request and response processing.

An adaptation service can use a memory only DB such as memcached or redis or others to store the StoreID for specific requests url that will later be fetched by the StoreID helper that will set them.

The above ICAP + StoreID helper idea works in production with more then one site for quite some time but it has some overheads and I would rate this kind of a setup as an Expert only.

Squid Configuration

A small example for StoreID refresh pattern

refresh_pattern ^http://(youtube|ytimg|vimeo|[a-zA-Z0-9\-]+)\.squid\.internal/.*  10080 80%  79900 override-lastmod override-expire ignore-reload ignore-must-revalidate ignore-private

acl rewritedoms dstdomain .dailymotion.com .video-http.media-imdb.com .c.youtube.com av.vimeo.com .dl.sourceforge.net .ytimg.com .vid.ec.dmcdn.net .videoslasher.com

store_id_program /usr/local/squid/bin/new_format.rb
store_id_children 40 startup=10 idle=5 concurrency=0
store_id_access allow rewritedoms !banned_methods
store_id_access deny all

An example for input and output of the helper:

root# /usr/local/squid/bin/new_format.rb

ERR
http://i2.ytimg.com/vi/95b1zk3qhSM/hqdefault.jpg
OK store-id=http://ytimg.squid.internal/vi/95b1zk3qhSM/hqdefault.jpg
  • {i} from Squid-3.5 this helper can support any value for the concurrency setting.

Developers info

Helper Example

  • /!\ This helper is an example. It is provided without any warranty or guarantees and is not recommended for production use.

There is a newer StoreID helper which has more URL patterns in it in a way you can learn URL patterns.

   1 #!/usr/bin/ruby
   2 # encoding: utf-8
   3 
   4 require "rubygems"
   5 require 'syslog'
   6 
   7 class Cache
   8         def initialize
   9         end
  10 
  11         def sfdlid(url)
  12                         m = url.match(/^http:\/\/.*\.dl\.sourceforge\.net\/(.*)/)
  13                         if m[1]
  14                                 return m[1]
  15                         else
  16                                 return nil
  17                         end
  18         end
  19 end
  20 
  21 def rewriter(request)
  22 case request
  23   when /^http:\/\/[a-zA-Z0-9\-\_\.]+\.dl\.sourceforge\.net\/.*/
  24           vid = $cache.sfdlid(request)
  25           url = "http://dl.sourceforge.net.squid.internal/" + vid if vid != nil
  26           return url    
  27   when /^quit.*/
  28           exit 0
  29   else
  30          return ""
  31   end
  32 end
  33 
  34 def log(msg)
  35  Syslog.log(Syslog::LOG_ERR, "%s", msg)
  36 end
  37 
  38 def eval
  39        request = gets
  40        if (request && (request.match(/^[0-9]+\ /)))
  41         conc(request)
  42         return true
  43        else
  44         noconc(request)
  45         return false
  46        end
  47 
  48 end
  49 
  50 
  51 def conc(request)
  52         return if !request
  53         request = request.split
  54                 if request[0] && request[1]
  55                         log("original request [#{request.join(" ")}].") if $debug
  56                         result = rewriter(request[1])
  57                 if result
  58                   url = request[0] +" OK store-id=" + result
  59                         else
  60                   url = request[0] +" ERR"
  61                 end
  62                 log("modified response [#{url}].") if $debug
  63                         puts url
  64                 else
  65                 log("original request [had a problem].") if $debug
  66                 url = request[0] + "ERR"
  67                 log("modified response [#{url}].") if $debug
  68                 puts url
  69                 end
  70 
  71 end
  72 
  73 def noconc(request)
  74         return if !request
  75         request = request.split
  76                 if request[0]
  77                         log("Original request [#{request.join(" ")}].") if $debug
  78                         result = rewriter(request[0])
  79                 if result && (result.size > 10)
  80                        url = "OK store-id=" + rewriter(request[0])
  81                        #url = "OK store-id=" + request[0] if ( ($empty % 2) == 0 )
  82                 else
  83                        url = "ERR"
  84                 end
  85                         log("modified response [#{url}].") if $debug
  86                         puts url
  87                 else
  88                 log("Original request [had a problem].") if $debug
  89                         url = "ERR"
  90                 log("modified response [#{url}].") if $debug
  91                 puts url
  92                 end
  93 end
  94 
  95 def validr?(request)
  96   if (request.ascii_only? && request.valid_encoding?)
  97     return true
  98   else
  99     STDERR.puts("errorness line#{request}")
 100     return false
 101   end
 102 
 103 
 104 end
 105 
 106 def main
 107   Syslog.open('new_helper.rb', Syslog::LOG_PID)
 108   log("Started")
 109   
 110   c = eval
 111 
 112         if c
 113          while request = gets
 114                 conc(request) if validr?(request)
 115          end
 116         else
 117          while request = gets
 118                 noconc(request) if validr?(request)
 119          end
 120         end
 121 end
 122 
 123 $debug = true
 124 $cache = Cache.new
 125 STDOUT.sync = true
 126 main

Helper Input\Output Example

#./new_helper.rb
http://freefr.dl.sourceforge.net/project/vlc/2.0.5/win32/vlc-2.0.5-win32.exe
OK store-id=http://dl.sourceforge.net.squid.internal/project/vlc/2.0.5/win32/vlc-2.0.5-win32.exe
http://www.google.com/
ERR
quit
#tail /var/log/messages
Feb 17 17:32:07 www1 new_helper.rb[21352]: Started
Feb 17 17:32:08 www1 new_helper.rb[21352]: Original request [http://freefr.dl.sourceforge.net/project/vlc/2.0.5/win32/vlc-2.0.5-win32.exe].
Feb 17 17:32:08 www1 new_helper.rb[21352]: modified response [OK store-id=http://dl.sourceforge.net.squid.internal/project/vlc/2.0.5/win32/vlc-2.0.5-win32.exe].
Feb 17 17:32:39 www1 new_helper.rb[21352]: Original request [http://www.google.com/].
Feb 17 17:32:39 www1 new_helper.rb[21352]: modified response [ERR].
Feb 17 17:32:51 www1 new_helper.rb[21352]: Original request [quit].

How do I make my own helper?

The helper program must read URLs (one per line) on standard input, and write OK with a unique identifier (ID) or ERR/BH lines on standard output. Squid writes additional information after the URL which a helper can use to make a decision.

Input line received from Squid:

[channel-ID] URL [key-extras]
channel-ID
  • This is the concurrency channel number. When concurrency is turned off (set to 1) this field and the following space will be completely missing.

URL
  • The URL received from the client. In Squid with ICAP support, this is the URL after ICAP REQMOD has taken place.
key-extras
  • Starting with Squid-3.5 additional parameters passed to the helper which may be configured with url_rewrite_extras. For backward compatibility the default key-extras for URL helpers matches the format fields sent by Squid-3.4 and older in this field position:

  •  ip/fqdn ident method [urlgroup] kv-pair
ip
  • This is the IP address of the client. Followed by a slash (/) as shown above.

fqdn
  • The FQDN rDNS of the client, if any is known. Squid does not normally perform lookup unless needed by logging or ACLs. Squid does not wait for any results unless ACLs are configured to wait. If none is available - will be sent to the helper instead.

ident
  • The IDENT protocol username (if known) of the client machine. Squid will not wait for IDENT username to become known unless there are ACL which depend on it. So at the time re-writers are run the IDENT username may not yet be known. If none is available - will be sent to the helper instead.

method
  • The HTTP request method. URL alterations and particularly redirection are only possible on certain methods, and some such as POST and CONNECT require special care.
urlgroup
  • Squid-2 will send this field with the URL-grouping tag which can be configured on http_port. Squid-3.x will not send this field.

kv-pair
  • One or more key=value pairs. Only "myip" and "myport" pairs documented below were ever defined and are sent unconditionally by Squid-3.4 and older:

    myip=...

    Squid receiving address

    myport=...

    Squid receiving port

Result line sent back to Squid:

[channel-ID] result kv-pair
channel-ID
  • When a concurrency channel-ID is received it must be sent back to Squid unchanged as the first entry on the line.

result
  • One of the result codes:

    OK

    Success. A new storage ID is presented for this URL.

    ERR

    Success. No change for this URL.

    BH

    Failure. The helper encountered a problem.

kv-pair
  • One or more key=value pairs. The key names reserved on this interface for URL re-writing:

    clt_conn_tag=...

    Tag the client TCP connection (Squid-3.5)

    message=...

    reserved

    store-id=...

    set the cache storage ID for this URL.

    tag=...

    reserved

    ttl=...

    reserved

    *_=...

    Key names ending in (_) are reserved for local administrators use.

  • {i} the kv-pair returned by this helper can be logged by the %note logformat code.

{i} This interface will also accept responses in the syntax delivered by Store URL-rewrite feature helpers written for Squid-2.7. However thst syntax is deprecated and such helpers should be upgraded as soon as possible to use this Store-ID syntax.


CategoryFeature

Features/StoreID (last edited 2016-01-27 16:27:00 by AlexRousskov)