Posts tagged ‘php’

Google Reader export to bookmarks.htm

I was one of the small group of Google Reader users who actively used the sharing functionality before Google killed it with their latest upgrade. While the number of people I shared with was small, the quality was incredibly high. I don’t blame Google for wanting to consolidate their social graphs (makes sense to me), but I will miss the conversations I had there.

A friend asked me if I knew how to export the shared items JSON file to a standard bookmarks.htm file. I didn’t, but I managed to whip up a quick PHP script to do the trick. Here’s the code for anyone who is interested.


// bump this limit up as it can be quite memory intensive if you have a lot of shared items
ini_set('memory_limit', '64M');

// update to use your own file here
$json_file = "/tmp/shared-items.json";

// output the std header
echo <<< EOT
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

$json = json_decode(file_get_contents($json_file));
echo "<DL>\n\t<DT>";
foreach ($json->items as $item) {
    echo '\t<A HREF="'.$item->alternate[0]->href.'" ADD_DATE="'.$item->published.'" LAST_VISIT="'.round($item->crawlTimeMsec/1000).'" LAST_MODIFIED="'.$item->updated.'">'.utf8_decode($item->title).'</A>\n';
echo "</DL>";


Update on Rolling Curl

Back in 2009 I blogged about using curl_multi() in PHP without blocking. The goal was to provide a better way to process multiple HTTP requests in parallel. The code was well received and I ended up turning my original snippet of code into a full blown PHP Class.

And then I got busy. Meanwhile the list of bugs and feature requests began to pile up.

Thankfully, a few guys have picked up my slack on the project. Alexander Makarow has been diligently maintaining the code for me, fixing bugs and making it better. Fabian Franz forked it on Github and added some of the top requested features.

Thanks to their efforts, Rolling Curl is in better shape than ever. This is why I love open source.


Splitting names

You can find the latest version of this code on Github. There are libraries for both PHP and JavaScript.

The quest

I’m on a ongoing search to find the best algorithm for splitting a full name into a first name and a last name. I’m sure this sounds like a ridiculously trivial quest — just explode the string on a space, right?

The challenge

But how do you tell the difference between people with double first names like Jo Ann Smith and people with double last names like Jo Von Trapp? What would you do if I gave you a double first name AND and a double last name at the same time?

Did you remember that you might need to parse out prefixes (Mr, Mrs, etc) and suffixes (II, Jr, PhD, etc)?

How do you turn Paul T. S. Williams into Paul Williams while intelligently deducing that that T. James Adams probably wants to go by James Adams, but T. Adams should probably stay as T. Adams?

And how do you straighten out the capitalization? I MIGHT WRITE IN ALL CAPS or all lowercase. Most names have the first letter capitalized and everything else in lowercase, but of course there are exceptions. J.P. likes to have both initials capitalized and Mr. McDonald always gets fussy when you forget to capitalize the D. Oh, and I hope you’re prepared for other anomalies like people with dashes in their name.

As you have probably realized by now, splitting a full name into its proper parts is a little more complicated than it appears on the surface.

I wrote the first version of my name-parsing algorithm two years ago and I’ve been gradually refining it ever since. It’s not perfect, but it’s improved a lot over time. I’m posting this code along with a demo in hopes that it will spur contributions to improve its accuracy even more. Throw the hardest names you know at it and let me know how it performs. I know I’m missing words for the various dictionaries of prefixes, suffixes, and compound name identifiers. Please let me know what I missed.

The algorithm

We start by splitting the full name into separate words. We then do a dictionary lookup on the first and last words to see if they are a common prefix or suffix. Next, we take the middle portion of the string (everything minus the prefix & suffix) and look at everything except the last word of that string. We then loop through each of those words concatenating them together to make up the first name. While we’re doing that, we watch for any indication of a compound last name. It turns out that almost every compound last name starts with 1 of 15 prefixes (Von, Van, Vere, etc). If we see one of those prefixes, we break out of the first name loop and move on to concatenating the last name. We handle the capitalization issue by checking for camel-case before uppercasing the first letter of each word and lowercasing everything else. I wrote special cases for periods and dashes. We also have a couple other special cases, like ignoring words in parentheses all-together.

The code


// split full names into the following parts:
// - prefix / salutation  (Mr., Mrs., etc)
// - given name / first name
// - middle initials
// - surname / last name
// - suffix (II, Phd, Jr, etc)
function split_full_name($full_name) {
    $full_name = trim($full_name);
    // split into words
    $unfiltered_name_parts = explode(" ",$full_name);
    // completely ignore any words in parentheses
    foreach ($unfiltered_name_parts as $word) {
        if ($word{0} != "(")
            $name_parts[] = $word;
    $num_words = sizeof($name_parts);

    // is the first word a title? (Mr. Mrs, etc)
    $salutation = is_salutation($name_parts[0]);
    $suffix = is_suffix($name_parts[sizeof($name_parts)-1]);

    // set the range for the middle part of the name (trim prefixes & suffixes)
    $start = ($salutation) ? 1 : 0;
    $end = ($suffix) ? $num_words-1 : $num_words;

    // concat the first name
    for ($i=$start; $i < $end-1; $i++) {
        $word = $name_parts[$i];
        // move on to parsing the last name if we find an indicator of a compound last name (Von, Van, etc)
        // we use $i != $start to allow for rare cases where an indicator is actually the first name (like "Von Fabella")
        if (is_compound_lname($word) && $i != $start)
        // is it a middle initial or part of their first name?
        // if we start off with an initial, we'll call it the first name
        if (is_initial($word)) {
            // is the initial the first word?  
            if ($i == $start) {
                // if so, do a look-ahead to see if they go by their middle name
                // for ex: "R. Jason Smith" => "Jason Smith" & "R." is stored as an initial
                // but "R. J. Smith" => "R. Smith" and "J." is stored as an initial
                if (is_initial($name_parts[$i+1]))
                    $fname .= " ".strtoupper($word);
                    $initials .= " ".strtoupper($word);
            // otherwise, just go ahead and save the initial
            } else {
                $initials .= " ".strtoupper($word);
        } else {
            $fname .= " ".fix_case($word);

    // check that we have more than 1 word in our string
    if ($end-$start > 1) {
        // concat the last name
        for ($i; $i < $end; $i++) {
            $lname .= " ".fix_case($name_parts[$i]);
    } else {
        // otherwise, single word strings are assumed to be first names
        $fname = fix_case($name_parts[$i]);

    // return the various parts in an array
    $name['salutation'] = $salutation;
    $name['fname'] = trim($fname);
    $name['initials'] = trim($initials);
    $name['lname'] = trim($lname);
    $name['suffix'] = $suffix;
    return $name;

// detect and format standard salutations
// I'm only considering english honorifics for now & not words like
function is_salutation($word) {
    // ignore periods
    $word = str_replace('.','',strtolower($word));
    // returns normalized values
    if ($word == "mr" || $word == "master" || $word == "mister")
        return "Mr.";
    else if ($word == "mrs")
        return "Mrs.";
    else if ($word == "miss" || $word == "ms")
        return "Ms.";
    else if ($word == "dr")
        return "Dr.";
    else if ($word == "rev")
        return "Rev.";
    else if ($word == "fr")
        return "Fr.";
        return false;

//  detect and format common suffixes
function is_suffix($word) {
    // ignore periods
    $word = str_replace('.','',$word);
    // these are some common suffixes - what am I missing?
    $suffix_array = array('I','II','III','IV','V','Senior','Junior','Jr','Sr','PhD','APR','RPh','PE','MD','MA','DMD','CME');
    foreach ($suffix_array as $suffix) {
        if (strtolower($suffix) == strtolower($word))
            return $suffix;
    return false;

// detect compound last names like "Von Fange"
function is_compound_lname($word) {
    $word = strtolower($word);
    // these are some common prefixes that identify a compound last names - what am I missing?
    $words = array('vere','von','van','de','del','della','di','da','pietro','vanden','du','st.','st','la','ter');
    return array_search($word,$words);

// single letter, possibly followed by a period
function is_initial($word) {
    return ((strlen($word) == 1) || (strlen($word) == 2 && $word{1} == "."));

// detect mixed case words like "McDonald"
// returns false if the string is all one case
function is_camel_case($word) {
    if (preg_match("|[A-Z]+|s", $word) && preg_match("|[a-z]+|s", $word))
        return true;
    return false;

// ucfirst words split by dashes or periods
// ucfirst all upper/lower strings, but leave camelcase words alone
function fix_case($word) {
    // uppercase words split by dashes, like "Kimura-Fay"
    $word = safe_ucfirst("-",$word);
    // uppercase words split by periods, like "J.P."
    $word = safe_ucfirst(".",$word);
    return $word;

// helper function for fix_case
function safe_ucfirst($seperator, $word) {
    // uppercase words split by the seperator (ex. dashes or periods)
    $parts = explode($seperator,$word);
    foreach ($parts as $word) {
        $words[] = (is_camel_case($word)) ? $word : ucfirst(strtolower($word));
    return implode($seperator,$words);


How to use variable variables in PHP

One of the biggest time-savers in PHP is the ability to use variable variables.  While often intimidating for newcomers to PHP, variable variables are extremely powerful once you get the hang of them.

Variable variables are just variables whose names can be programatically set and accessed.  For example, the code below creates a variable called $hello and outputs the string “world”.  The double dollar sign declares that the value of $a should be used as the name of newly defined variable.

$a = 'hello';
$$a = 'world'
echo $hello;

When I started with PHP about 10 years ago, everyone was still using global variables.  That meant that anything you passed as a GET variable could be used as a local variable.  It was very convenient, but unfortunately not very secure.  For me, typing $HTTP_GET_VARS[‘count’] just wasn’t as fun as being able to use $count.  I found myself adding long declaration lists to the top of my files that did nothing but convert my GET/POST variables to local variables.  My code started to look like this:

$salutation = $HTTP_GET_VARS['salutation'];
$fname = $HTTP_GET_VARS['fname'];
$lname = $HTTP_GET_VARS['lname'];
$email = $HTTP_GET_VARS['email'];

Do that for a couple dozen variables and you’ll start telling yourself there has to be a better way.  Nowadays you can use $_GET instead of $HTTP_GET_VARS, but the better solution is to use variable variables. Now my code looks more like this:

// create an array of all the GET/POST variables you want to use
$fields = array('salutation','fname','lname','email','company','job_title','addr1','addr2','city','state',

// convert each REQUEST variable (GET, POST or COOKIE) to a local variable
foreach($fields as $field)
    ${$field} = sanitize($_REQUEST[$field]);

This has several benefits.  I reduced 14 lines of code down to 3.  I now have one place to sanitize all my external input. And if I ever decide to change a variable name, I have one less place in my code to fix.

This benefit of this technique increases as you use the $fields array throughout your code.  I now utilize the $fields array when saving my form data to the database.  I use it for loading existing user values from the database.  I use it for passing my form fields back to smarty:

$form = array();
foreach($fields as $field)
    $form[] = $_REQUEST[$field];

Variable variables have become one of my favorite features of PHP. They’ve allowed me to tighten up a lot of my code and made it a lot more maintainable.

Have you done anything cool with variable variables?  What other PHP tricks have revolutionized the way you write code?


How to use curl_multi() without blocking

You can find the latest version of this library on Github.

A more efficient implementation of curl_multi()
curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests.

The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time. The problem with this is that each group has to wait for the slowest request to download. In a group of 100 requests, all it takes is one slow one to delay the processing of 99 others. The larger the number of requests you are dealing with, the more noticeable this latency becomes.

The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting. I also created a queue of cURL requests to allow for maximum throughput. Each time a request is completed, I add a new one from the queue. By dynamically adding and removing links, we keep a constant number of links downloading at all times. This gives us a way to throttle the amount of simultaneous requests we are sending. The result is a faster and more efficient way of processing large quantities of cURL requests in parallel.

function rolling_curl($urls, $callback, $custom_options = null) {

    // make sure the rolling window isn't greater than the # of urls
    $rolling_window = 5;
    $rolling_window = (sizeof($urls) &lt; $rolling_window) ? sizeof($urls) : $rolling_window;

    $master = curl_multi_init();
    $curl_arr = array();

    // add additional curl options here
    $std_options = array(CURLOPT_RETURNTRANSFER =&gt; true,
    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;

    // start the first batch of requests
    for ($i = 0; $i &lt; $rolling_window; $i++) {
        $ch = curl_init();
        $options[CURLOPT_URL] = $urls[$i];
        curl_multi_add_handle($master, $ch);

    do {
        while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
        if($execrun != CURLM_OK)
        // a request was just completed -- find out which one
        while($done = curl_multi_info_read($master)) {
            $info = curl_getinfo($done['handle']);
            if ($info['http_code'] == 200)  {
                $output = curl_multi_getcontent($done['handle']);

                // request successful.  process output using the callback function.

                // start a new request (it's important to do this before removing the old one)
                $ch = curl_init();
                $options[CURLOPT_URL] = $urls[$i++];  // increment i
                curl_multi_add_handle($master, $ch);

                // remove the curl handle that just completed
                curl_multi_remove_handle($master, $done['handle']);
            } else {
                // request failed.  add error handling.
    } while ($running);
    return true;

Note: I set my max number of parallel requests ($rolling_window) to 100 5. Be sure to update this value according to the bandwidth available on your server / servers you are curling. Be nice and read this first.

Updated 3/6/09: Fixed a missing semi-colon. Thanks to Steve Gricci for catching the typo.

Updated 4/2/09: Made some changes to increase reusability. rolling_curl now expects a $callback parameter for a function that will process each response. It also accepts an array called $options that let’s you add custom curl options such as authentication, custom headers, etc

Updated 4/8/09: Fixed a new bug that was introduced with the last update. Thanks to Damian Clement for alerting me to the problem.


How to detect the RSS feed for a blog

Every wondered how to automatically figure out the RSS feed for a blog?

Generally speaking, it’s a simple task — just download the HTML for the given blog and use a fancy regular expression to find the associated RSS feed. In PHP, it looks something like this:

$bloghtml = file_get_contents($blogurl);
preg_match('/<link.*types*=s*["']*application/rss+xml["']*.*hrefs*=s*["']?([^'" >]+)['" >]/i', $bloghtml, $match);
$rssurl = $match[1];

The main problem with this approach is that some blogs take a long time to load — and that often translates to your application being slow as well. On top of that, it’s frustrating to have to download and process an entire page of HTML just to extract one URL.

Recently Google came out with a better solution in the form of their AJAX Feed API. Using their API, detecting feeds is now easier, faster and more reliable:

$lookup_url = "".urlencode($blogurl);
$result = curl($lookup_url);

I’ve been using this API for about a month now and have really appreciated the improvements. If you need to detect feeds, give it a try. I think you’ll like it.