reddit
reddit is a source for what's new and popular online. reddit learns what you like as you vote on existing links or submit your own!
DIY Bayesian filtering of RSS feeds on linux with rss2mail and a standard mail client (programming.reddit.com)
22 points posted 1 day ago by joelthelion

(log in to vote on this article, comment on it, or share it with friends)

info comments related details

sort by

style

You are viewing a single comment's thread. Click the comments tab above to view all of the comments on this link.
joelthelion 6 points 1 day ago *

Pondering the idea of bayesian filtering (among the applications of which a per-user bayesian reddit isn't the least!), I started looking at an easy way of implementing it. After considering hacking existing RSS aggregators, I think I finally found a way to get bayesian filtering of RSS feeds without coding anything: rss2email coupled with a spam-enabled mail client. Anybody can have it running in a few minutes, especially if you're running linux. Here's how you do it:

  1. Install rss2email. On fedora this is simply "sudo yum install rss2email". I guess it must be packaged for most distributions.
  2. Verify the sendmail service is running.
  3. Configure rss2email (r2e) to send mail to your local email: r2e new login@localhost
  4. Install a new mail client. You don't want to use your main mail client because you would screw your spam filter. I used claws-mail, which is packaged for most distros and has a bogofilter plugin.
  5. Configure the client to read your local mail.
  6. run: r2e run && claws-mail. Voila! Your rss feeds arrive in your mailbox. Throw away items you don't like by marking them as spam. The mail client will soon learn about your tastes and show you only content you like. (well, mostly! :-))

EDIT: don't forget to mark good messages as "ham". Bogofilter requires training both on ham and spam.

permalink
ffrinch 5 points 18 hours ago

I don't know how well it'd work for reddit, though.

  • There aren't very many tokens in the link titles (and a lot of them are poorly-named besides).
  • The RSS feed doesn't include user names, so the filter would have no way of knowing that, say, "dons" is another word for "Haskell".
  • The RSS feed doesn't include votes. A nice feature would be automatic whitelisting based on a vote threshold.
  • The system should have special logic for tokenizing URLs, so good/bad domains can be recognized.

It just seems to me that you'd get something a lot more effective with a little custom programming (and this is programming.reddit). How long would it really take to whip something up with Reverend or Orange ?

permalink parent
joelthelion 3 points 14 hours ago

You raise some very good points. I agree what I did is very crude, but I wanted a way to actually test my ideas without spending too much time on it. I did try to do something with Reverend and straw, but understanding the existing code of the rss reader and hacking the GUI are two non trivial tasks, especially if you don't know gtk.

Anyways, I've been testing my experiment for a few hours now, and it turns out that it already gives interesting results. So I guess it would be very intersting to do something better, but I really don't have the time to do it.

permalink parent
joelthelion 2 points 13 hours ago *

I just emailed the reddit staff about the content of the rss feeds, requesting them to add the nick of the submitter, the domain name (ex: slate.com) and a few words from the page, just like google does in its search results. We'll see what they'll do about it :)

permalink parent
indigoviolet 4 points 19 hours ago *

opera does something like this out of the box.

www.opera.com

Opera's RSS reader is integrated with its mail client, M2, which in turn has trainable filters.

permalink parent
akkartik 3 points 19 hours ago

Tell me more (or point me to a link)

permalink parent
indigoviolet 3 points 18 hours ago

I edited the comment above to add some information.

permalink parent
joelthelion 2 points 10 hours ago

I've created this little python script to help explore bogofilter's database:

           
            #!/usr/bin/python

def print_list(list):
    for i in list:
        print i

if __name__=="__main__":
    show=11
    significative_threshold=15

    import os
    os.system(" cat /usr/share/dict/words | bogoutil -p ~/.bogofilter/wordlist.db > /home/schaerer/tmp/bogostats")
    f=open("http://programming.reddit.com/home/schaerer/tmp/bogostats")
    data=[a.split() for a in f.readlines()][1:]

    print("ham count")
    print("---------")
    data.sort(key=lambda x:int(x[2]),reverse=True) #sort based on ham count
    print_list( data[:show])

    print("spam count")
    print("---------")
    data.sort(key=lambda x:int(x[1]),reverse=True) #sort based on spam count
    print_list( data[:show])

    signif_data=[i for i in data if int(i[1]) + int(i[2]) >= significative_threshold] #keep only data with significative number of hits
    print("ham indicators")
    print("--------------")
    signif_data.sort(key=lambda x:x[3])
    print_list( signif_data[:show])

    print("spam indicators")
    print("--------------")
    signif_data.sort(key=lambda x:x[3],reverse=True)
    print_list( signif_data[:show])
           
          
permalink parent