Content aggregation: Am I missing something?

With new web services appearing every day our information is increasingly spread out over a myriad of different platforms. As a result, active web users are forced to log into dozens of different sites to keep up with their online life. There are a lot of people currently working to solve this problem with content aggregation. The idea is that you should be able to log in to one website and see all your relevant information in one place. It’s a great idea, but there are some big challenges that are worth considering.

Let me start by sharing a personal story. When the Google Maps API first came out I built a mash-up that allowed people to move around a map and see the names, addresses and phone numbers of everyone who lived in that area. I used a reverse geocoder to convert GPS coordinates to street addresses and then an address lookup service to get the information about the people who lived on each street. Since I didn’t have access to these databases myself, I extracted the data from several other websites. Everything worked great until someone found my site and posted a link on digg. About 10 minutes after my website made the front page of digg, two of the services I was using blocked my IP address. My 10 minutes of fame were over. My site was as dead as a doornail.

I learned an important lesson from this incident. When you pull content from another website you are ALWAYS at their mercy.

A good number of websites now provide API’s that allow third parties to interact with their data. Theoretically this makes the task of content aggregation more reliable since you have defined methods by which to access information from that provider. Unfortunately, even with API’s you are still defenseless to the actions of the API provider. For example, last week Facebook made several code updates that broke the majority of applications on their platform. I’ve experienced the same issues while using API’s for PayPal Pro and Google Maps. In each case, they made an “update” to their code which put my website out of business for a day. When you use an API, you must understand that it can (and will) change at any time. There will also be times when it will be broken or unavailable.

The bigger concern is when you need to aggregate content from websites that don’t have an API. The reality of the internet today is that API-offering websites make up a tiny percentage of all the websites on the internet. To provide a truly valuable aggregator you need to scrape data from all relevant websites, not just those that offer API access. The problem is that you are completely vulnerable when you are scraping data. Any time a change is made in their code, it has the potential to break everything you are doing. If they ever decide they don’t like you anymore, blocking you is as simple as adding your IP address to a restriction list. No matter what, you are at their mercy.

Keep in mind that few websites have an economic incentive to let you scrape their data. This is especially true for websites that make their money from advertising. When you scrap their content you are depriving them of their main source of revenue. If they ever decide they don’t like you, there’s not much you can do about it.

For those of you who are currently working on building content aggregators, I am curious to hear how you plan to address these issues. Is there another side to this that I am missing?

  • Tom Chikoore

    Great post, I would like to add my own $0.02

    I like the fact that more and more web properties are exposing an API for other developers to use. I agree with you that you are at the mercy of the API provider but this is nothing new. This has been going on for a long time on the desktop. In the 90s, on the desktop it was common to have your app broken because of an API change but this has gotten better since the late 90s. I think the web is still experiencing what was going on on the desktop in the 90s.

    Speaking as person who has written a major API that is used by thousands of developers worldwide, these are some of the rules that I followed in order not have irate developers. i) Remember the ‘I’ in API stands for Interface. For the most part interfaces should be immutable. For me, an interface IS immutable period (that’s why I put in a lot of thought before publishing an interface). This means that the interface should not change. If you want to change an interface, extend the current one and make a new interface. I also call this the “_ex” rule because when you look at most APIs on the desktop, you will see the ones with “_ex”, for the most part, these are the interfaces that have been extended. If you keep this rule you will not break your developers’ code. This also gives the user base time and a migration path to the new API. If the old API is to be obsolete, give the user base enough notice and clearly state the date or release or version that an API will be obsolete ii) if its so happens that you need to change the behavior of a current interface for some reason and you cannot extend the current interface, then publicize, publicize, publicize your intentions over and over and over again to your user base. Obviously if you are fixing a bug you do not need to extend the current interface but notify your user base.

    What I am seeing right now with the Facebook issue is that the notion of APIs is still in its infancy in the on the web. I would personally NOT shy away from using these APIs because, the way APIs are published for the web will mature soon and these nuances will go away. Just a side note: if web services were being implemented as originally spec’ed with the use of UDDI, the mutating API problem would be lessened because it would be harder (due to process and maybe SLAs) to change a published API.

    From my experience, an API should be published with a good SLA or Licensing Agreement in good faith by the provider. Even though most of these agreements mention that the API may change at any time without notice, what we have done in the past is to respect our user base and taking great steps to avoid inconveniencing the developers using our API. I know some web properties offer some form of SLA or usage agreements today.