ClientStreams provides an API to retrieve and manipulate data from squid, from inside squid. Squid's ClientSide processing uses ClientStreams to fulfill standard client HTTP requests.
What follows is a very slightly edited transcript (with permission) of an IRC chat about ClientStreams, it needs to be cleaned up and made more organised...
| 14:48 | nicholas | Hi. I'm working on bug 1160 (analyze HTML to prefetch embedded objects). I can't figure out why, but even though it |
| 14:49 | lifeless | I'd use the same mechanism ESI does. |
| 14:49 | nicholas | Ok, that's client streams. |
| 14:49 | lifeless | the fwdState api is on the wrong side of the store |
| 14:49 | nicholas | doh! |
| 14:49 | lifeless | so it doesn't have any of the required logic - cachability, vary handling, updates of existing opbjects... |
| 14:50 | lifeless | things like store digests just haven't been updated to use client streams yet. |
| 14:50 | nicholas | What, concisely, is a store digest? |
| 14:51 | lifeless | a bitmap that lossilly represents the contents of an entire squid cache, biased to hits. |
| 14:51 | lifeless | uses a thing called a bloom filter |
| 14:52 | lifeless | it lets squid predict that another cache will have a given object, for predictive routing (as opposed to ICP which is |
| 14:52 | nicholas | That's strange, but ok. I suppose it's necessary for performance when you have a large number of cached objects. |
| 14:52 | lifeless | well its an optional feature. |
| 14:53 | nicholas | Ok, I tried to track a standard request through the code and it runs through http.cc. Http.cc uses the store, but it |
| 14:53 | lifeless | right. |
| 14:54 | lifeless | http.cc just retrieves http objects, like ftp.cc retrieves ftp objects. |
| 14:54 | nicholas | I'm on the wrong side of the fence. Gotcha. |
| 14:54 | nicholas | Honestly, I spent about 5 days trying to understand the client stream API. |
| 14:55 | nicholas | Concisely, what is a client stream? I suggested that they're a chain of observers to the results of a request. Is that |
| 14:56 | lifeless | http://www.squid-cache.org/Doc/FAQ/FAQ-16.html |
| 14:57 | lifeless | client streams.. |
| 14:57 | lifeless | they are a 'chain of responsibility' pattern. |
| 14:58 | lifeless | sortof. |
| 14:58 | lifeless | the clientStream code was started in C in the squid 2.6 timeframe, it needs an overhaul badly, now we can actually |
| 14:59 | nicholas | Right, I noticed that the code is in flux. Might I add that I don't like CBDATA either ... not that I'm offering to do |
| 15:00 | nicholas | For a ClientStreamData, I'm supposed to create my own Data class which is derived from, er, Refcountable? Then let the |
| 15:01 | nicholas | See, I don't really understand what my callbacks are really supposed to do, since I only want "default" behaviour. As |
| 15:02 | lifeless | well you don't want that. |
| 15:02 | lifeless | because you don't want to parse requests. |
| 15:02 | lifeless | ClientSocketContext is likely to be the closest thing to what you want though. |
| 15:03 | lifeless | so your readfunc needs to eat all the data it recieves. |
| 15:04 | lifeless | you can throw it away. |
| 15:04 | lifeless | your detach function can just call clientStreamDetach(node, http); |
| 15:04 | nicholas | so do I add my function into ClientSocketContext's read function? |
| 15:04 | lifeless | see clientSocketDetach |
| 15:04 | nicholas | or do I add another node in the clientStream? |
| 15:04 | lifeless | no, you should have all your stuff in its own .cc file. |
| 15:04 | lifeless | you'll construct a new clientStream to service your requests. |
| 15:05 | nicholas | Oh it is, but somebody has to enter my .cc file at some point, right? |
| 15:05 | lifeless | right, you should have that already written though - whatever is doing the parsing should already be a clientStream |
| 15:06 | nicholas | Nope. I just hacked it into http.cc. |
| 15:06 | lifeless | if its not, then don't worry for now, get it working is the first step. |
| 15:06 | nicholas | Not that I can't move it pretty easily. |
| 15:06 | nicholas | Everything works, except that it doesn't cache what it fetches. And now I know why. |
| 15:06 | lifeless | your Status calls should always return prev()->status() |
| 15:07 | lifeless | the callback call is the one that is given the data, it too should throw it away. |
| 15:08 | nicholas | and someone else will cache it? |
| 15:08 | lifeless | yes |
| 15:08 | nicholas | ok, I assume you're talking about just the fetching part? |
| 15:08 | lifeless | I'm talking about the clientStream node you need to implement. |
| 15:09 | nicholas | so when I know a URL that I want to prefetch, I create my clientStream with this one node that you just described. |
| 15:10 | lifeless | ESIInclude.cc shows this well |
| 15:10 | nicholas | I've spent a lot of time reading it, but since I didn't understand clientStreams, I never managed to quite figure it |
| 15:11 | lifeless | ok, start with ESIInclude::Start |
| 15:11 | lifeless | this calls clientBeginRequest |
| 15:12 | nicholas | esiBufferRecipient seems to do a lot of work, including checking whether the HTTP stream succeeded or failed, and |
| 15:12 | lifeless | it passes in the clientStream callbacks - esiBufferRecipient, esiBufferDetach, the streamdata (stream.getRaw()), the |
| 15:12 | nicholas | oh right, this code. Yes, I cut'n'pasted this in, but I never got it working for me. |
| 15:12 | lifeless | esiBuffer recipient copies the object back into the ESI master document. |
| 15:12 | lifeless | so it has to do a bunch more work than you'll need to. |
| 15:13 | nicholas | stream.getRaw() is a pointer to the node, yes? I could the code around that confusing. |
| 15:14 | lifeless | stream is a ESIStreamContext which is a clientStream node that pulls data from a clientstream, instances of which are |
| 15:14 | lifeless | (different instances, but hte logic is shared by composition) |
| 15:14 | lifeless | that is pased into ESIInclude::Start because ESI includes have a primary include and an 'alternate' include. |
| 15:16 | lifeless | so all you need to start the chain is: |
| 15:16 | nicholas | I see. I won't need to worry about any of that. |
| 15:16 | lifeless | HttpHeader tempheaders(hoRequest); |
| 15:17 | lifeless | if (clientBeginRequest(METHOD_GET, url, aBufferRecipient, aBufferDetach, aStreamInstance, &tempheaders, |
| 15:17 | lifeless | { |
| 15:17 | lifeless | /* handle failure */ |
| 15:17 | lifeless | } |
| 15:17 | lifeless | httpHeaderClean (&tempheaders); |
| 15:18 | lifeless | that will cause callbacks to aBufferRecipient, aBufferDetach to occur |
| 15:19 | lifeless | then in the buffer recipient you throw them away, just check for status codes etc. |
| 15:19 | lifeless | and I've given you the skeleton for detach above. |
| 15:20 | lifeless | aStreamInstance is just a cbdata class that has your context. |
| 15:20 | lifeless | i.e. |
| 15:21 | lifeless | class myStream { |
| 15:21 | lifeless | public |
| 15:21 | lifeless | : |
| 15:21 | lifeless | static void BufferData (clientStreamNode *, ClientHttpRequest *, HttpReply *, StoreIOBuffer); |
| 15:21 | lifeless | static void Detach (clientStreamNode *, ClientHttpRequest *); |
| 15:22 | lifeless | private: |
| 15:22 | lifeless | CBDATA_CLASS2(myStream); |
| 15:22 | lifeless | void buferData (clientStreamNode *, ClientHttpRequest *, HttpReply *, StoreIOBuffer); |
| 15:22 | lifeless | void detach (clientStreamNode *, ClientHttpRequest *); |
| 15:22 | lifeless | } |
| 15:22 | lifeless | ; |
| 15:23 | lifeless | then in your .cc file... |
| 15:23 | lifeless | CBDATA_CLASS_INIT(myStream); |
| 15:23 | nicholas | the cbdata init line, i presume? |
| 15:23 | lifeless | those CBDATA macros setup new and delete to do the right thing. |
| 15:23 | lifeless | then your static functions are just |
| 15:23 | nicholas | i don't need to write my own void *operator new? |
| 15:24 | lifeless | no, you don't. |
| 15:24 | lifeless | void |
| 15:24 | nicholas | phew. :) |
| 15:24 | lifeless | myStream::BufferData (clientStreamNode *node, ClientHttpRequest *, HttpReply *, StoreIOBuffer) |
| 15:24 | lifeless | { |
| 15:24 | lifeless | if (!cbdataReferenceValid(node->data)) |
| 15:25 | lifeless | /* something weird has happened - your data has been freed, but a callback has still been issued. deal here */ |
| 15:25 | lifeless | static_cast<myStream *>(node->data)->bufferData(node, ...); |
| 15:25 | lifeless | } |
| 15:25 | lifeless | and likewise for the Detach static method |
| 15:26 | lifeless | is this making sense ? |
| 15:27 | nicholas | yes, but just let me reread a litt.e |
| 15:27 | lifeless | ok, theres one more important thing :) |
| 15:27 | nicholas | "static_cast<myStream *>(node->data)->bufferData(node, ...)" calls myStream::BufferData doesn't it? So why am I calling |
| 15:28 | lifeless | lowercase bufferData :) |
| 15:28 | nicholas | oh man, i thought that was just a typo. now i have to reread all of it! |
| 15:28 | lifeless | the static functions (denoted with the initial Capital) are thunks into the actual instance methods. |
| 15:29 | nicholas | which makes sense. yes. |
| 15:29 | lifeless | http://www.squid-cache.org/~robertc/squid-3-style.txt |
| 15:29 | nicholas | but what does bufferData actually do? let's see if i do understand this ... |
| 15:29 | nicholas | ... it'll receive the contents of the page that I requested from clientBeginRequest, so I just discard them. check? |
| 15:29 | lifeless | bufferData needs to do two things. it needs to check the status of node->next() |
| 15:30 | lifeless | and on everything other than error or end-of-stream, it needs to issue a new read. |
| 15:30 | nicholas | hm, ok. |
| 15:31 | lifeless | if something like a 404 occurs, you'll get that as the HttpReply in the first call to bufferData. |
| 15:31 | nicholas | and it will already be (negatively) entered into the cache for me |
| 15:31 | nicholas | so i just ... don't do anything. |
| 15:31 | lifeless | exactly. |
| 15:31 | lifeless | just swallow the data until node->next()->status() returns an error. |
| 15:32 | nicholas | if it was a successful read, but the connection is still open, i read more. |
| 15:32 | nicholas | ok. |
| 15:32 | nicholas | now let me ask you about the other half: analyzing pages that come in. |
| 15:32 | lifeless | if its not an error, to swallow more data you call ->readfunc() |
| 15:32 | lifeless | you'll need a buffer area in your class instance. |
| 15:32 | lifeless | (although to be tricky you could use a static buffer in your class, as you don't care about the data) |
| 15:33 | nicholas | (ah, nice trick! didn't think of that.) |
| 15:33 | nicholas | I told you earlier that I just hacked my analyzer into http.cc. While this works for me, is there a better place to put |
| 15:34 | lifeless | wbut I wouldn't worry about that - just have a HTTP_REQBUF_SZ char array in your private data. |
| 15:34 | nicholas | I was using SM_PAGE_SIZE. |
| 15:35 | lifeless | ok, where to put the analyzer ? we've got some rework we want to do in the request flow that would make this a lot |
| 15:35 | lifeless | I think that the right place for now, is exactly where esi goes, and after esi in the chain. |
| 15:35 | lifeless | the problem with where you are is that ftp pages won't be analysed. and if its an esi upstream then the urls could be |
| 15:35 | nicholas | http requests that come in from clients have a client stream chain? |
| 15:36 | lifeless | yup |
| 15:36 | nicholas | hunh. i didn't even notice. |
| 15:36 | lifeless | client_side_reply.cc line 1927 |
| 15:36 | nicholas | who installs ESIs ... |
| 15:36 | -!- | Irssi: Pasting 11 lines to #squiddev. Press Ctrl-K if you wish to do this or Ctrl-C to cancel. |
| 15:36 | lifeless | #if ESI |
| 15:36 | lifeless | if (http->flags.accel && rep->sline.status != HTTP_FORBIDDEN && |
| 15:36 | lifeless | !alwaysAllowResponse(rep->sline.status) && |
| 15:36 | lifeless | esiEnableProcessing(rep)) { |
| 15:36 | lifeless | debug(88, 2) ("Enabling ESI processing for %s\n", http->uri); |
| 15:36 | lifeless | clientStreamInsertHead(&http->client_stream, esiStreamRead, |
| 15:36 | lifeless | esiProcessStream, esiStreamDetach, esiStreamStatus, NULL); |
| 15:36 | lifeless | } |
| 15:36 | lifeless | #endif |
| 15:36 | nicholas | yep, i've got the code up here. |
| 15:37 | nicholas | clientStreamInsertHead. awesome. |
| 15:37 | lifeless | this says - if its an accelerated request that isn't an deny-error page, and its a response that is amenable to |
| 15:37 | nicholas | Nod. For me, I just need to know whether the mime-type is HTML or not. |
| 15:38 | lifeless | you'll want to add your head before esi, so that you come after esi in the processing. |
| 15:38 | nicholas | So the headers need to be complete and processed before I know whether to add myself. |
| 15:38 | lifeless | so right before that #if ESI line. |
| 15:39 | nicholas | Oh, I see it has the body at this point already? |
| 15:39 | nicholas | Or does it just have a partial body? |
| 15:39 | lifeless | it may have some body, but it definately has the reply metadata |
| 15:39 | nicholas | Because my code is rigged to work with partial data. |
| 15:39 | nicholas | ok, good. |
| 15:39 | nicholas | Then that's *exactly* right. |
| 15:39 | lifeless | so you can just look in rep-> to get the headers already parsed. |
| 15:39 | nicholas | yep. |
| 15:40 | lifeless | and you'll get called with whatever data is available in your buffer function. |
| 15:40 | nicholas | Perfect. |
| 15:40 | lifeless | your buffer function should analyse, then call node->next()->callback(node->next(), ...) |
| 15:41 | lifeless | when a read is issued, there is one complication : |
| 15:41 | nicholas | So that ESI or whomever can do it. |
| 15:41 | nicholas | s/it/their thing/ |
| 15:41 | lifeless | if the client wants a range request, the read issued to you may be for partial data. |
| 15:41 | nicholas | Will there be a flag on those? So I can avoid them? |
| 15:42 | lifeless | so you have a choice. like ESI you can force ranges off for what you request, and filter out what you supply according |
| 15:42 | lifeless | alternatively, and for you I think better, just don't add yourself to the chain at all if its a range request. |
| 15:42 | nicholas | Well, what I request will never be ranged. But, what I analyze isn't necessarily what I requested. |
| 15:43 | nicholas | It will normally be the request from the user agent. That's the point. |
| 15:43 | lifeless | in your if block in client_side_reply just check http->request->range |
