The protocols powering the real-time web

May 25, 2009

In the past few weeks there has been a lot of discussion around the rise of the real-time web, including posts from TechCrunch, GigaOm, ReadWriteWeb and Scoble. A lot of the talk has been around Twitter, Facebook, Friendfeed, OneRiot and of course Google. You don’t have to be a genius to figure out that real-time is the future of the web. I believe there is a huge need for the tech community to develop new protocols that will power this fundamental shift in how web apps work.

The problem is our existing protocols are request driven instead of event driven. The web we know and love wasn’t built with real-time in mind.

Tim O’Reilly sent a tweet from OSCON08 that really captures the essence of the polling problem:

On monday friendfeed polled flickr nearly 3 million times for 45000 users, only 6K of whom were logged in. Architectural mismatch. #oscon08

At EventVue we have a dedicated server that does little more than poll for new blog posts from attendees. We have a few tricks to reduce the pain, but we’re still polling thousands of blogs every hour even though 99% of them haven’t added any fresh content since the last time we checked. With blog posts, people are used to having a small delay before they show up in Google Reader or other services. We’re not so forgiving when it takes 30 minutes for a tweet to show up in a client application, even though getting real-time data from twitter using polling is virtually impossible.

So what is the solution?

Some people have said that XMPP holds the answer, but how many developers do you know who have set up an XMPP server before? Right. Me too. XMPP may be a viable transport method but I think we’d be better off using something that is simpler and more familiar to developers.

Another prominent response to the polling problem is the Simple Update Protocol (SUP) that was proposed by Paul Buchheit from Friendfeed. SUP is certainly an improvement over our current protocols, but what frustrates me is that it only reduces polling instead of eliminating it altogether. It may make sense for FriendFeed, but it’s not something I would add to my blog.

My favorite approach is PubSubHubbub that was proposed by Brad Fitzpatrick and Brett Slatkin from Google. PubSubHubbub might have a horrible name, but the protocol is exactly what we need to fix our polling problems. It’s lightweight, simple to understand and built on top of basic HTTP.

PubSubHubbub is a simple extension to ATOM that uses webhook callbacks to deliver practically instant notifications between servers when a feed is updated. The protocol is decentralized and free. Anyone can run a hub. Anyone can be a publisher or a subscriber. I like that it eliminates polling altogether and is incredibly simple to implement. I took a stab at writing the PHP client library and was able to take it from protocol spec to code in less than 2 hours.

If you’re interested, you can check out my PubSubHubbub PHP library and download and install the PubSubHubbub WordPress plugin I wrote as well.

It’s worth mentioning the role that Gnip plays in all of this. Gnip has been leading the charge against the evils of polling. I’ve been a big fan of their service and have written before how they helped EventVue. But at the end of the day, the winning technology shouldn’t be in the hands of one company — it should be open and distributed. Open protocols don’t eliminate the need for Gnip. Trusted hubs like Gnip will play an important role in handling the flow of data between publishers and subscribers. Companies will pay good money to off-load that work, and Gnip is already at the center of that opportunity. I’d love to see Gnip embrace the open protocols that are being developed and lead the drive for adoption of PubSubHubbub in particular.

I’m excited about PubSubHubbub for a few reasons. First, it opens the door for a whole new range of real-time applications that simply aren’t possible today. It’s also a chance for me to contribute to solving a really big problem and an opportunity for me to get in on the ground level of something I believe is going to be huge. I wasn’t able to contribute to the design of HTTP or sit in on the conversations that led to the development of the RSS protocol. But one day I’m going to be able to brag that Online Aspect was the very first blog on the web to support PubSubHubbub. And for a geek like me, that’s pretty cool.

Josh Fraser
Entrepreneur, world traveler and rock climber.
Software engineer and co-founder of Din, Torbit and EventVue.
Read more...

Comments

dthompson said at 11:52 pm on May 25th, 2009:

Totally agree.

I would like to point out that yes there is a higher barrier to entry for xmpp, but there are also more rewards. Besides being a great protocol for pushing data around, it has well developed spec for pubsub (XEP-60), and some open source servers.

I do agree that there needs to be some work done to make starting out with xmpp easier, and that the solution is probably a mixture of http apis (like pubsubhubbub) and existing xmpp servers.

Good rundown of the realtime web solutions out there.
Josh Fraser said at 4:31 am on May 26th, 2009:

Brad talks about XEP-60 specifically in the FAQ. http://moderator.appspot.com/#15/e=43e1a&t=42…

Basically the goal w/ PubSubHubbub was to create something that could be used by anyone — including the guy with the $5/month shared hosting account that doesn't have access to XMPP.
David Banes said at 7:16 pm on May 25th, 2009:

Having been an early supporter of the Jabber (now XMPP) protocol, leaving the space for a while, and then coming back I can see that it looks a bit complex at first. There certainly are a lot of add-on specs now.

However the core spec is easy to understand if you now a bit of XML and there are plenty of libraries out there to make it really easy to implement client side apps.

I don't get into HTTP vs XMPP debates, they both have their place, however I do think that bastardising protocols to suit new use cases is less than an ideal or elegant solution.
Josh Fraser said at 4:35 am on May 26th, 2009:

I understand your desire to use protocols in the way they were initially intended. That said, it would be a tragedy for us to come up with a totally pure implementation that no one uses. Reducing friction and getting adoption has got to be our first priority.
Pete Warden said at 3:03 pm on May 26th, 2009:

Great stuff Josh! I hadn't run across PubSubHubbub, digging around I found the slide-show here to be the best introduction:
http://pubsubhubbub.appspot.com/

I think the real power of the HTTP approach is illustrated by the fact that you can publish from your browser using a bookmarklet! http://pubsubhubbub.appspot.com/bookmarklet_confi…

Sure it's ugly and inelegant, but it's ready to go right now on every platform in the universe, reusing well-supported technologies.
Mike Mahoney said at 4:15 pm on May 26th, 2009:

Hey Josh, thanks for pointing PubsubHubbub out. I hadn't heard of it and it looks like it could be a good solution in certain cases.

One problem I think it will eventually run into is that pubsub is a pretty extensive problem and the requirements can start to get complicated quickly. How does PsHb handle subscriber durability, topic access, guaranteed delivery, etc? (maybe Atom has solutions for some of these things, I'm not as familiar with it as I should be) As people adopt it they'll start to ask for this functionality, the spec will have to expand, and it will be begin to look like one of the more 'complex' pubsub systems.

I'm just playing a bit of devil's advocate here. I do think there is a need for a simple, open, push based pubsub system. But if someone has any requirements beyond the basics I think XMPP and XEP-0060 is the way to go.
Josh Fraser said at 6:14 pm on May 26th, 2009:

A lot of that complexity can be pushed to the hub. People can publish or subscribe without having to worry about any of those details.
Mike Mahoney said at 7:22 pm on May 26th, 2009:

I agree, but for each piece you push to the hub it becomes more complicated to deploy.
Stephen Paul Weber said at 11:17 am on May 26th, 2009:

“simpler and more familiar to developers”

To what developers? Most developers have never set up a webserver either. Just because devs you know have more experience with HTTP than XMPP doesn’t mean that abusing it is the right solution 😛
Robin said at 7:43 pm on May 26th, 2009:

I agree. Most developers use appropriate libraries to gain access to the desired protocol, technologies, etc… without sweating the details of those things. Experts in those fields will write the libraries and most of us will simply download and use them.

Also, setting up a server to use for one is no different than the other (Http vs. XMPP). This sounds like a neat solution, but inevitably it is trying to hammer a square peg in a round hole, which will work fine until the peg is too large to fit anymore.

Besides, the first time I set up an OpenFire server it took me about 10 minutes, of which 5 was finding someone with the appropriate ldap information I needed.
William Sylvester said at 6:34 pm on May 26th, 2009:

What happened to XMLRPC and Pings? The weblogs.com, blo.gs, etc… Doesnt that existing infrastructure enable all of this? Simple http post or xmlrpc, based on Dave Winer's original weblogs.com? I know six apart had their cloud, and blo.gs (yahoo) went to their own real-time streams. But these aggregation points seemed to work pretty well 4 years ago for the Real-Time Web…
Josh Fraser said at 7:06 pm on May 26th, 2009:

Glad you brought that up. There are a few services that should be solving this pain point (pingomatic, etc), at least for blogs (web apps are another story). The problem is most of the people who are holding that ping data aren't sharing it. It's stored in closed silos. They work well for notifying search engines, but are pretty useless to individual developers. The only exception I know of is the google blogsearch change log (http://blogsearch.google.com/changes.xml?last=120… But even with this change log available, there's too much data. It's a firehose and there's no easy way to filter it to subscribe to only the blogs you are interested in.
Josh Fraser said at 10:10 pm on May 26th, 2009:

I just realized that weblogs publish their own change log as well:
http://rpc.weblogs.com/shortChanges.xml
Dave Winer said at 3:37 pm on May 26th, 2009:

In general, you should look at prior art and invent as little new stuff as possible. This keeps it simple for the people implementing it and keeps the barriers to adoption the lowest. The tradeoff is that you don't get to say you invented it.

This philosophy is consistent with Postel's Law, but he didn't say that being conservative in what you send applied to everyone. I wish he were alive so we could ask him, but I don't have any doubts that it applies. There should be as few ways of doing something as possible — because that limits the opportunity for incompatibility and makes the network more useful to users, and reduces lockin.

Second point — look at the prior art that exists beforehand. There are two forms that I am aware of. Yes, I designed both of them, but that doesn't mean they're not useful. 🙂

1. Support for the weblogs.com ping protocol is everywhere. Start with that. I have it all my software as does all the blogging software out there. The installed base is huge. Show that you care about those people, the developers might not even be around anymore. I told the FF guys they were blowing it by not starting there. Never got a response from them. Bad.

2. Look at the <cloud> element in RSS 2.0. The installed base there is negligable, but it is in a spec that's widely deployed. If you can't use it you should say why.

And btw, the bit about some implementations not sharing their changes.xml is a totally bogus excuse to reinvent the wheel. weblogs.com itself always shared it, that's how the whole thing bootstrapped. Just because some people were closed doesn't invalidate the prior art.
Dave Winer said at 10:39 pm on May 26th, 2009:

One more thing — I don't know why you say your namespace only works with Atom. If that's true you need to fix that. Namespaces shouldn't care what they're deployed in.
Josh Fraser said at 9:47 pm on June 2nd, 2009:

We're listening! PSHB has been updated to support RSS which means my WordPress plugin now works with RSS 0.92, 1.0 and 2.0.

http://code.google.com/p/pubsubhubbub/wiki/RssFee…

We're now talking about the best way to work with weblogs.com and letting you subscribe to a bunch of feeds at once using an OPML feed. I think we can get close to the "killer app" you are looking for.

http://www.scripting.com/stories/2009/05/28/googl…
Brett Slatkin said at 11:51 pm on May 26th, 2009:

We should do a better job of acknowledging the prior art that's out there. I've added this wiki page to make it clear that these specs exist and how PubSubHubbub is different:
http://code.google.com/p/pubsubhubbub/wiki/PriorA…

One of the goals of PubSubHubbub is to provide a consistent interface for all three parts of Internet-scale publish-subscribe. The two specs you mention address the publish part of the problem, but miss subscription and the role of the hub. The hub is the hardest part and crucial to making the protocol easy for publishers and subscribers to implement. Perhaps the fact that these older specs haven't caught on for realtime pub-sub on the web supports this claim.

I agree that bootstrapping using existing protocols is useful. The changes.xml format is probably the biggest example of an existing install-base. To that end we're already running an experimental "bridge" that converts the Google Blog Search changes.xml file (http://www.google.com/help/blogsearch/pinging_API… into pings to http://pubsubhubbub.appspot.com. If you subscribe to any Blogger.com blog on that hub we will *push* you updates as fast as the change.xml file updates. We'll probably enhance this over time and make it more reliable.

Otherwise, there's no reason PubSubHubbub could not extend to RSS. We'd like to provide a good mapping between Atom and RSS to further simplify the implementations of publishers and subscribers. We're looking into this more concretely now.
Daily Links #64 | CloudKnow said at 9:29 am on May 27th, 2009:

[…] The protocols powering the real-time web […]
julien said at 5:43 am on June 3rd, 2009:

Arg… my comment was eaten by IntenseDebate? Was it? Please tell me it's just being moderated. (I'll repost it of needed)
Josh Fraser said at 5:52 am on June 3rd, 2009:

I didn't moderate it. Can you please repost? Sorry about that.
julien said at 4:06 pm on June 3rd, 2009:

Hey!

Ok, so I was saying that it was once (again!) a nice conversation about Real-Time… and unfortunately the war between pro and anti-XMPP is not over and everybody is standing on their position.

I tend to agree 100% that most of the time it is just better to keep protocols for what they were intended. Like you, and like a lot of people it seems I have spent a lot of time trying to 'tweak' existing stuff (namely HTTP stuff) to make things that XMPP does very very well. Unfortunately, for us, hackers, I guess it's always sexier to actually try to hack things, instead of taking the time to read (yes that is boring) the documentation.

I think that the XMPP crowd has a huge work to do to actually evangelize. Servers are now as easy to install as Apache (apt-get packages…) or Nginx. However, my understanding is that there is a lack of easy libraries/framework (check out babylon in Ruby : http://github.com/julien51/babylon/tree/master) and moreover, a lack of great use-cases (besides chat…)

At http://superfeedr.com we have really been forced to acknowledge that XMPP *was* our solution and I see many many services around here that would definetely benefit from that.

As a conclusion, you guys should really check XMPP out. It is not that difficult and the community will be more than happy to help!
JChauncey said at 3:01 pm on June 3rd, 2009:

Gnip is a great solution to a problem like this. The only thing is that they control who they poll. I really want to create a central repository of feeds (initially) that people contribute to it socially. Allow them to rank and categorize those feeds
and then allow my servers to consume the information.

From there people can create their own streams using a combination of keywords and topics. Then subscribe to that stream using a number of delivery mechanisms.

The reason why I want to do it this way, is that I have a real problem with the way content is distrubted. I can only consume feeds I find and then I have to receive the articles from that feed that I may not want to read (not relevant to my need).

I am still looking to see if there are ways to make this a bit easier (architecturally) and I think that pubsubhubbub may work. Just need to do a bit more research.
Josh Fraser said at 4:16 pm on June 3rd, 2009:

Sounds like you might be interested in http://superfeedr.com that was linked to above. They don't solve the polling problem, but they deal with it so you don't have to. If you're just looking for a firehose of blog feeds, you might also want to take a look at the google blogsearch changes.xml file. PSHB would be a good distribution method for people to subscribe to the various keyword/topic feeds.
JChauncey said at 7:13 pm on June 3rd, 2009:

By having a centralized hub (unlike PSHB's decentralized model) that consumes the various formats of information and then allows people to build and subscribe to streams in any format they wish will help with adoption of the lesser known formats (XMPP being one of them).

and I definitely check out suprfeedr – Although parsing RSS isnt too big of a problem… But it would help to offload that to someone.
julien said at 9:17 pm on June 3rd, 2009:

Hey JChauncey,

Feel free to make any feedback/request/comment about superfeedr (julien.genestoux@gmail.com). Our goal is really to make things easy and remove the hassle from the thousands of websites/webserivces who actually need to poll feeds.
Michael Lewkowitz said at 2:48 am on June 4th, 2009:

Interesting Josh. I've been thinking about this too from a bit of a different perspective – as in what would a coordinated effort look like to create a common infrastructure. This is a core problem for this emerging medium and something that might more easily be addressed like Visa did back in the 70's – having a bunch of independents coming together to solve a common problem. I've put up some initial thoughts up over at http://igniter.com/post428.
Choosing your audience said at 1:03 am on August 21st, 2009:

[…] along a good bit of traffic, but I also enjoy writing posts that will never be found via Google. The protocols powering the real-time web doesn’t get much traffic from Google, but the discussion that post generated was amazing. […]
Head to head: PubSubHubbub vs. rssCloud said at 9:54 pm on September 9th, 2009:

[…] The post I wrote was titled RSSCloud Vs. PubSubHubbub: Why The Fat Pings Win and was really a follow up to my previous post on The protocols powering the real-time web. […]