How to use curl_multi() without blocking

You can find the latest version of this library on Github.

A more efficient implementation of curl_multi()
curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests.

The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time. The problem with this is that each group has to wait for the slowest request to download. In a group of 100 requests, all it takes is one slow one to delay the processing of 99 others. The larger the number of requests you are dealing with, the more noticeable this latency becomes.

The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting. I also created a queue of cURL requests to allow for maximum throughput. Each time a request is completed, I add a new one from the queue. By dynamically adding and removing links, we keep a constant number of links downloading at all times. This gives us a way to throttle the amount of simultaneous requests we are sending. The result is a faster and more efficient way of processing large quantities of cURL requests in parallel.

function rolling_curl($urls, $callback, $custom_options = null) {

    // make sure the rolling window isn't greater than the # of urls
    $rolling_window = 5;
    $rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;

    $master = curl_multi_init();
    $curl_arr = array();

    // add additional curl options here
    $std_options = array(CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_MAXREDIRS => 5);
    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;

    // start the first batch of requests
    for ($i = 0; $i < $rolling_window; $i++) {
        $ch = curl_init();
        $options[CURLOPT_URL] = $urls[$i];
        curl_setopt_array($ch,$options);
        curl_multi_add_handle($master, $ch);
    }

    do {
        while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
        if($execrun != CURLM_OK)
            break;
        // a request was just completed -- find out which one
        while($done = curl_multi_info_read($master)) {
            $info = curl_getinfo($done['handle']);
            if ($info['http_code'] == 200)  {
                $output = curl_multi_getcontent($done['handle']);

                // request successful.  process output using the callback function.
                $callback($output);

                // start a new request (it's important to do this before removing the old one)
                $ch = curl_init();
                $options[CURLOPT_URL] = $urls[$i++];  // increment i
                curl_setopt_array($ch,$options);
                curl_multi_add_handle($master, $ch);

                // remove the curl handle that just completed
                curl_multi_remove_handle($master, $done['handle']);
            } else {
                // request failed.  add error handling.
            }
        }
    } while ($running);
   
    curl_multi_close($master);
    return true;
}

Note: I set my max number of parallel requests ($rolling_window) to 100 5. Be sure to update this value according to the bandwidth available on your server / servers you are curling. Be nice and read this first.

Updated 3/6/09: Fixed a missing semi-colon. Thanks to Steve Gricci for catching the typo.

Updated 4/2/09: Made some changes to increase reusability. rolling_curl now expects a $callback parameter for a function that will process each response. It also accepts an array called $options that let’s you add custom curl options such as authentication, custom headers, etc

Updated 4/8/09: Fixed a new bug that was introduced with the last update. Thanks to Damian Clement for alerting me to the problem.

  • Michael

    Hey Josh,

    Is it possible to pass a value (e.g. $row['id']) into RollingCurl so that it's available for use within the callback function?

    foreach ($rows as $row) {
    // Add each request to the RollingCurl object.
    $request = new RollingCurlRequest($row['url']);
    $rc->add($request);
    }

    (Basically, it's the MySQL primary key for each row ($row['id']) that I'm trying to pass and make available within the callback function.)

    Thanks.

    • http://www.onlineaspect.com Josh Fraser

      Sure, an easy way to do this is to add a GET variable to the end of the URL you are fetching (ie. ?mysql_id=42) and then parse out that ID when the request completes from the CURL info array.

      • Michael

        Ok. I got it implemented using a hash tag to pass the monitor id and then getting this value from the $request ($info['url'] doesn't retain the hash tag on the URL for whatever reason). This way, by using a hash tag, I figure there is no possibility that it'll ever change the URL that is checked. Still it'd be cool, if RollingCurl had a way to pass a value without affecting the URL. But this is working for now. Thanks for sharing RC!

  • Bogdan

    For the given simple script bellow, how can I use RollingCurl library to make POST request for each of http://www.site_01.com, http://www.site_02.com and http://www.site_03.com using parsed variables catched with "my_request" function from the GET request ? Thank you.

    <?PHP
    require("RollingCurl.php");

    function my_request($response) {

    ……………
    ……………

    (code used to parse some variables to use later in POST request)

    ……………
    ……………

    }

    $urls = array("http://www.site_01.com&quot;,
    "http://www.site_02.com&quot;,
    "http://www.site_03.com&quot;);

    $rc = new RollingCurl("my_request");

    $rc->window_size = 3;

    foreach ($urls as $url) {
    $request = new RollingCurlRequest($url, "GET");
    $rc->add($request);
    }

    $rc->execute();
    ?>

  • Bob

    Hi, nice job.
    Just some changes for me :
    // for your pb of number of $urls must be more than number of rolling_window
    $rolling_window = min(array(5,count($urls)));

    // this ligne, just after the "for" (start the first batch of requests)
    // because if you have 5 windows, you get out from this for with $i == 5 (last $i++)
    // then, when you get the next url in the do while, you make another $i++ witch do not take the 5th url !!!
    $i–;

    thats all for me ! thank you again, this makes me save some hours !!!!

  • Zeke

    Not sure if I did it correctly, but my problem with the code is with the callback function:

    for example:
    call_user_func($callback,$urls[$z],$output);

    When I called the callback function, the $output does not match the url, since I want to display the link with the output to match each other. What I am getting is the $output will either come before or after the next url…

    tried to fix with sleep and curl_multi_select(which suppose to wait for activity on the connect), but can’t fix the problem…

  • Zeke

    If you trying to return the links like I am, don’t do the stupid way that I am doing >.<!

    use the url (from getInfo) parameter instead…
    $info['url'].

  • Zeke

    Forgot to mention use: curl_multi_select($master);
    to lower your CPU spike when running…

  • Saagar

    Hey, i am using curl_mult_exec for processing thousands of URLs. Currently it is breaking down at around 15 to 20k.. plz help me on that… plzzzzzz

  • http://pokemonepisode.org pokemon

    This is a very old post, but I just thought I might aswell reply as I think I know what part you don't understand.
    When you say "overwrite $ch every time" you must take notice of "curl_multi_add_handle($master, $ch);".
    Before $ch is being overwritten, its "data" is being added to $master.
    From the opening part of your comment, it seems to me that you are noticing a variable being overwritten and wondering how its old contents still exist for CURL to see, you are correct here, that data is gone.

  • Viacheslav

    You should be careful with while loops. Without appropriate sleep the become insane))

    You should insert
    usleep(10000);
    before
    } while ($running);

    Spent an hour investigating why my simple scripts are using processor to 100% )

  • http://www.manuelfink.de Manu

    Hi, I like the idea, however I was wondering how I know which handle/url made the original call?

    You only process the returned data, as it seems to me? However for me it's important to know which made it since I have to write data back into the corresponding record in the database. Any idea on how to achieve this? Can I access the url the handle called?

  • Jonathan Rodan

    The line
    while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);

    is a CPU killer. You can probably use curl_multi_select instead or suffer the los time of a microsleep().
    Kaolin Fire's solution is much better than the line I specified.

  • https://www.malaysia29.com/ ترافل

    I like your idea of a rolling window

  • http://www.parisnakitakejser.com Paris Nakita Kejser

    i have use this method to handle my images downloader from external partners, i use xdebug so if you use this remember to disable xdebug.max_nesting_level i have make a cap on 1milion and make a inner loop to call next curl instenas, i can pase on our company internetline ( not so good ) like 1500-2000 images in a minut.

    Thanks a lot for this guide and its helping a lot! :)

  • http://eliosh.wordpress.com eliosh

    This is my solution for Multi Head Requests:

    https://gist.github.com/anonymous/1a9eb381f6a5f260bd20