While listening to a podcast about how RSS feed readership is measured, an idea for improving measurement of readership of feed and also web pages occurred to me. One of the things that makes it difficult to accurately measure traffic to any internet resource is that there may be proxy servers between the resource and the reader. Web-based feed reading services like Bloglines, for example, may fetch your feed once an hour, but then turn around and display it to 100 subscribers. If you only consider your server log, your readership estimate would be off by 9,900%! ...or 99%, depending on which direction you're counting. Either way, the error is huge.
How do we fix the problem? One approach is to put "web bugs" in the feed. Web bugs are little images, possibly even transparent images, that won't get cached by services like Bloglines. So while the feed may only get accessed once, the image may get accessed 100 times, giving you a better metric.
One problem with web bugs is that while Bloglines won't cache the image, web proxy servers might. So your numbers are still likely to be off.
Another problem is that people don't like web bugs. They may be used innocently in many cases, but in other cases, they're used for privacy-invading purposes. That results in people being suspicious of all web bugs.
So, what's my big idea? It's not one that will make the metrics problem go away, but it could help: create a new HTTP request header that proxies (whether web proxies, feed proxies, or whatever) can send whenever they refresh their caches that tells the origin server how many requests have been received for the resource. The first time, it might look like this:
X-Proxy-Count: 1
The first person requested it, so I'm asking you for it. When the proxy's cache expires and they ask for it again, it might look like this:
X-Proxy-Count: 100
Wow! 100 people have requested this resource since the last time the proxy fetched it!
The exact meaning of the header might be a little different for web proxies and feed proxies. For web proxies, it would be the actual number of requests received (where any request that included a X-Proxy-Count header itsself would count as the number of requests claimed in that X-Proxy-Count header). For feed proxies, ideally it would be the number of subscribers to the feed who had checked their subscriptions since the last refresh. But it might be the total number of subscribers to the feed, whether they'd checked the feed recently or not. That might be a good detail to nail down.
Since not all proxy servers would support this header, the metrics wouldn't be perfect, but it would help. And since only aggregate numbers would be sent, not the IP addresses of each subscriber, privacy advocates would be less likely to be bothered.
February 19th, 2007 at 10:54 am
Tim Bray wrote about this issue a few days ago. Expanding on what I wrote above, here's part of what I posted in his comments:
Proxy-Fetch-Count: 1000
Proxy-Active-Subscribers: 100
Proxy-Total-Subscribers: 300
The first would indicate how many times the resource was accessed out of a caching proxy's cache since the last time the cache was refreshed. The second and third would be specific to proxies that handle subscriptions (ie. online feed readers). The second would indicate the number of subscribers who had accessed the feed out of the proxy's cache since the last time the cache was refreshed. The third would be the total number of clients currently subscribed to the resource, even if they hadn't accessed it recently.