Bayesian filtering for prioritizing digest reading?
by ZetaGecko | Add Your Comments | Atom/RSS
A few days ago I read that someone is integrating Bayesian filtering into a newsfeed aggregator (sorry, I can't find the reference). Last month, I wrote about the difficulty of finding quality newsfeeds and proposed some ideas for creating a better newsfeed directory (and then built a newsfeed directory based on those ideas). Bayesian filtering attacks the need to sift the wheat from the chaff, or to use a metaphor more commonly associated with Bayesian filtering, the ham from the spam, in a different way. Rather than finding the best newsfeeds and subscribing only to them, you subscribe to all kinds of feeds, and let the aggregator learn to show you the stories you're most likely to be interested in.
My initial reaction was that it sounded like a great idea, but upon further reflection, I suspect it might not work as well as hoped. Here are the issues:
1) If your aggregator learns what you're definitely interested in and what you definitely don't want to see, the things you read about all the time should bubble to the top, and those you don't want to see to the bottom (or be blocked completely). No problem so far. But what about the stories that aren't similar to anything you've read before? They'll end up in the middle. If you subscribe to too many newsfeeds, you'll never get down to the middle, so the filter's opinion of what you want to see will get more and more skewed as time goes by. You could avoid this problem by having the filter just give each item a thumbs up or thumbs down, but not prioritize the items that got the thumbs up. It would still be useful, but not as useful as I originally imagined.
2) If you subscribe to a number of similar feeds, the filter will show you the items in each that you're likely to be interested in, which is good. But what if everyone is talking about the same thing? It won't remove the redundant items. So you still can't subscribe to too many feeds without getting overloaded. By not reading the redundant items, you may even inadvertently teach your filter that you're not interested in the topics you are most interested in. (The software should give you a way of giving each item an explicit thumbs up, thumbs down, or neutral rating to help avoid this).
3) Finally, if people DO succeed in using Bayesian filtering to enable them to subscribe to more newsfeeds without getting overloaded, the bandwidth issues I am always ranting about will grow.