How to use curl_multi() without blocking

You can find the latest version of this library on Github.

A more efficient implementation of curl_multi()
curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests.

The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time. The problem with this is that each group has to wait for the slowest request to download. In a group of 100 requests, all it takes is one slow one to delay the processing of 99 others. The larger the number of requests you are dealing with, the more noticeable this latency becomes.

The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting. I also created a queue of cURL requests to allow for maximum throughput. Each time a request is completed, I add a new one from the queue. By dynamically adding and removing links, we keep a constant number of links downloading at all times. This gives us a way to throttle the amount of simultaneous requests we are sending. The result is a faster and more efficient way of processing large quantities of cURL requests in parallel.

function rolling_curl($urls, $callback, $custom_options = null) {

    // make sure the rolling window isn't greater than the # of urls
    $rolling_window = 5;
    $rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;

    $master = curl_multi_init();
    $curl_arr = array();

    // add additional curl options here
    $std_options = array(CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_MAXREDIRS => 5);
    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;

    // start the first batch of requests
    for ($i = 0; $i < $rolling_window; $i++) {
        $ch = curl_init();
        $options[CURLOPT_URL] = $urls[$i];
        curl_setopt_array($ch,$options);
        curl_multi_add_handle($master, $ch);
    }

    do {
        while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
        if($execrun != CURLM_OK)
            break;
        // a request was just completed -- find out which one
        while($done = curl_multi_info_read($master)) {
            $info = curl_getinfo($done['handle']);
            if ($info['http_code'] == 200)  {
                $output = curl_multi_getcontent($done['handle']);

                // request successful.  process output using the callback function.
                $callback($output);

                // start a new request (it's important to do this before removing the old one)
                $ch = curl_init();
                $options[CURLOPT_URL] = $urls[$i++];  // increment i
                curl_setopt_array($ch,$options);
                curl_multi_add_handle($master, $ch);

                // remove the curl handle that just completed
                curl_multi_remove_handle($master, $done['handle']);
            } else {
                // request failed.  add error handling.
            }
        }
    } while ($running);
   
    curl_multi_close($master);
    return true;
}

Note: I set my max number of parallel requests ($rolling_window) to 100 5. Be sure to update this value according to the bandwidth available on your server / servers you are curling. Be nice and read this first.

Updated 3/6/09: Fixed a missing semi-colon. Thanks to Steve Gricci for catching the typo.

Updated 4/2/09: Made some changes to increase reusability. rolling_curl now expects a $callback parameter for a function that will process each response. It also accepts an array called $options that let’s you add custom curl options such as authentication, custom headers, etc

Updated 4/8/09: Fixed a new bug that was introduced with the last update. Thanks to Damian Clement for alerting me to the problem.

  • Pingback: Asynchronous/parallel HTTP requests using PHP multi_curl :: Jaisen Mathai()

  • tokyorefugee

    Thanks very much for your good solution, Josh.

    But your code failed to deal with huge URLs(round 100,000) and exit without any warning.

    If you want , I can send the code to you.

  • Marc

    But if you overwrite the $ch every time how can you have 1000 requests in this $ch? I'm referring to the example in http://www.php.net/manual/en/function.curl-multi-… It uses $ch1, $ch2 and so on. I don't understand why your code works…

    I need to check external links, get all status codes, error messages, request time and the location header if 301. I have currently no idea how to get the location header… Also with your example I have no idea what original link was that have been checked if a 301 was given on the first request. With CURLOPT_MAXREDIRS => 5 you follow 5 redirects, but loose all information about a reference to the original link requested link. By this way you can use your example to download 1000 files to disk, but if you need to handle the status codes specific to the results it is very difficult.

    Will it change anything about blocking if the variables are named $ch1, $ch2? Sorry, but I don't understand how it currently works and I tried to debug it some good time… I'm only like to be save that nothing goes wrong in my code.

  • Marc

    Always keep in mind that some firewalls are blocking you if you open more than 6 requests to one hostname. This is not allowed per RFC definition and you can – no you WILL overload the remote server. You will bring the server down if you open 100 simultaneous requests per second! Do not overload other severs… this is like a DDOS attack and IDS (intrusion detection systems) will block you completely.

  • thnx guys this was useful in debugging some of the issues i was facing 🙂

  • This is a good reminder for everyone. Marc, thanks for bringing this up.

  • There is no standard answer. Can you ask the people you're scraping?

    • Damian Clement

      Josh,

      I've tried asking the people I'm scraping but haven't had any replies from any enquiries I've made to the webmaster email addresses so I guess I'll try upping my limit by 1 at a time and see how I go on??

      I tried using the example you gave using twitter: –

      >Sure. Here's a simple example using twitter:

      >function twitter_callback($output) {
      >$results = json_decode($output);
      >print_r($results);
      >echo "<hr>";
      >}

      >$urls[] = "http://twitter.com/users/show.json?screen_name=joshfraz&quot;;
      >$urls[] = "http://twitter.com/users/show.json?screen_name=eventvue&quot;;

      >rolling_curl($urls,'twitter_callback');

      I replaced the two URL's with

      $urls[] = "http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7001&quot;;
      $urls[] = "http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7002&quot;;

      and modified the function to parse $output to suit my requirements but my browser just takes the returned HTML from the URL's I've visited and displays it on screen. How do I stop this from happening?

      • sounds like you have an extra print_r or echo in your code somewhere.

        • Damian Clement

          Josh,

          The only code I used is as follows: –

          <?php

          include("rolling_curl.php"); //Unmodified rolling_curl copied straight from this page

          function korail_callback($output) {
          //Do nothing
          }

          $urls[] = "http://www.google.co.uk&quot;;

          rolling_curl($urls,'korail_callback');

          ?>

          The result on screen is googles homepage. I was expecting a blank screen, am I wrong to expect this?

          • whoops. looks like i introduced a bug w/ my last update.

            if ($custom_options) {
            $options = $std_options + $custom_options;
            }

            should really have been:

            $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;

            I've updated the post with this fix so if you copy the new code it should work. Sorry about that!

  • Damian Clement

    Josh,

    please excuse the beginners question, but what variable type is $output returned in? I can use "strlen" to determine the length of $output but "substr" doesn't seem to work.

    In my original linear code I used the expression, $output = file_get_contents($urls), and was able to parse $output for various HTML fragments, however it doesn't work on $output returned using your Rolling_curl function. Does it need modifying to suit my needs?

    My code:

    <?php

    include("rolling_curl.php");

    function korail_callback($output) {
    echo strlen ($output); //returns a value no problem
    echo "
    ";
    echo substr($output,0,10); //doesn't do anything
    echo "
    ";
    }

    $urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7001&#039;;
    $urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7002&#039;;
    $urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7003&#039;;
    $urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7004&#039;;
    $urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7005&#039;;

    rolling_curl($urls,'korail_callback');

    ?>

  • a nit that embarassed me a touch (though I blame it on being up all night and then some). At the point where you: "// start a new request (it's important to do this before removing the old one)" … it's best to check to make sure $i < count($urls);

    I also changed the last few lines to:

    $ready = curl_multi_select($mh);
    if ($ready != -1) $execrun = curl_multi_exec($mh,$running);
    } while ($running && ($ready != -1)); // do…

    but I don't know if that was necessary. I was just having "issues".

    • Kaolin, I'm almost done with a new version that will bring a lot of improvements to this code. Look for it some time later this week. I've fixed a few small problems like the one you mentioned and made it object oriented for increased reusability.

      • Olmo

        Hey Josh! Thanks for the code snippet. Any news on the OO class that you were working on?

  • marcelo

    really good solution
    congratulations !

  • Vinicios

    If $callback takes too long, won't all the requests finish before starting another? Is there a way over it?

  • Michael

    Great solution Josh!

    What if I need an additional variable (say URL id from the database) going through the callback function? I tried passing the $info as well but it loses the original URL when redirected, so I cannot use the callback function to update the URL status on my database.

    • Michael,

      Great question. I don't have an answer for you off the top of my head. I'll be thinking about it and will let you know if I come up with something.

      Josh

    • What I'm doing, which may be implicitly buggy (working through it right now–it seems to fail nondeterministically) is keeping a lookup hash for what $ch is attached to what $id ..

      • Michael

        Would be happy to see your code. Ideally I would have the id (or even the $i) sent to the callback function

    • Amit Shah

      Hi Michael,

      I too was looking out information on the same and came across this simple solution which may help all of us. After one gets the following code executed,
      $info = curl_getinfo($done['handle']);
      we can just simply retrieve the URL by adding this code.
      $url = $info['url'];
      Once you have the URL, you surely can come up with the ID used. And with the code of the above function, you can send across the response back to the callback by adding this as additional parameter.

      Hope this works out.

      • Michael

        I'm not sure this will work since the whole issue started with URLs being redirected and changed when $done. I ended up doing something similar to what kaolie had suggested (see commented below):

        $handles = array(); // create a handles array

        for ($i = 0; $i < $rolling_window; $i++) {
        $ch = curl_init();
        $handles[(int)$ch] = $url['key']; // store each ch handle along with the relevant $i (can be also the original url itself)
        $options[CURLOPT_URL] = $urls[$i];
        curl_setopt_array($ch,$options);
        curl_multi_add_handle($master, $ch);
        }

        // request successful. process output using the callback function.
        $callback($output, $handles[(int)$ch]); // you now have the original url id/key/value to your callback

        Hope this helps anyone,

        Michael

  • Pingback: Non-blocking asynchronous requests usando curl_multi e php | Sana inside()

  • On non standard http requests, the $info['http_code'] == 200 will fail. Any solution? Thanks.

    Maybe checking $info['header_size'] == 1 ?!

    • I love that my blog commenters are so much smarter than me. 🙂

      We should probably check for any 2xx http_code. Is there any good reason to continue if we get any of the other codes?

      • Michael

        I do:

        if(curl_errno($ch) == 0 && $info['http_code'] == 200)

  • ronny

    i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??
    any ideas why?

    • Are you hitting 1 site or 900 different sites? If you're hitting 1 site you might want to make sure you're not getting blocked.

      Otherwise, my guess is there is some setting or limitation on your server that is limiting the number of connections. How much memory do you have? I've run into the problem of dropped urls before, but only with a window size of several hundred. One of the problems with multi_curl is that it tends to fail silently. Please let me know if you figure out what is going on. I'd love to find a solution besides \”use a smaller window size\”.

      • ronny

        none of the 900 urls is on the same server, and i have 8gb of ram with 16 cores.. so i doubt that i have any server limitation..
        if i use a different function that gathers all the data by "blocks" of 500 for example, and then parses everything it seems to work.. the part that parses the data as soon as it comes in your function might be making things a bit shaky,
        try getting 900 urls, and callback to a function that counts the amount of urls that were called back (declare the count var global in the function to enable shared access from all the callbacks) and you'll see very few get called back.. even when you make a like a window size of 50… whereas if i use another function that doesn't process results as soon as they arrive, all pages get fetched and i process later but its slower… it would be awesome if you found out what's the reason in your code.. as i'm sure it would be faster!

        • Michael

          why don't you echo something simpler on the callback function (say "donen") and count the number of times it displays? do the same with errors (i.e. replacing "// request failed. add error handling." with "errorn"). I had similar issues and it was because of a faulty success and error callback functions.

          • ronny

            just tried it with 800 urls.
            21 done,
            4 failed.
            775 vanished??

          • Very weird. I've done extensive testing on my own and haven't had any problems — especially to that extent.

  • Bruce

    KICKASS!!! This is way better than I was doing it before. Thanks a bunch.

  • Pingback: links for 2009-06-10 | Mobile Technology Blog()

  • ramsepumsel

    some questions…
    $info['http_code'] == 200 its clearly but why shouldnt done this at the "else":
    – $i++
    – starting a new request

    i think everytime when a request finished a new url should be added
    and it should be testet if enough urls are in the array before starting a new one with an empty url
    if (count($urls) < $i+1)

    please correct me

    • i also thought that "if (count($urls) < $i+1)" should be used but I tried it and it sent me in an infinite loop. I cant understand the reason why it is doing so, but apparently it doesnt work when if (count($urls) < $i+1) is added.

      • You don't want to do that because that would start every request
        running at the same time — which would create issues if you had a lot
        of URLs to fetch. I intentionally used a rolling window to limit the
        number of simultaneous requests. It sounds like you might just want
        to increase the size of the rolling window. The default is 5, but you
        can safely bump that up to 100 or so as long as you are hitting
        distributed resources.

  • Jamie

    "// start a new request (it's important to do this before removing the old one)"

    can you say why? is there a bug if it is not performed in that order?

    • I'm struggling to remember exactly what happens, but yeah, basically it doesn't work if you switch the order. try it. 🙂

  • Prashanth

    Thanks a lot…this works great. But I am surprised….I see that no matter what, $urls[$i++] is added to the queue. How come there is no error when $urls[$i] is the last one in queue…ie…there is no $urls[$i++]???

    Apologies if the question is naive.

    • $i++ is a post-increment. This means that $urls[$i] is added and THEN $i is incremented. We would probably have problems if we used ++$i. Also note that $i is just a counter, it doesn't control when the while loop stops, the variable $done handles that. Make sense?

      • Priit

        If you process the errors in else {} part, you will see that up to 5 requests will return errors with code value 0 ($info['http_code'] == 0). You much check if you have already sent all the urls (in $urls) as requests before making another one (if (sizeof($urls) > $i +1){create new request})

  • Superwayne

    @Prashanth
    This question is not naive at all … I had the same concerns, and they were confirmed by the following notice produced by PHP*:
    [php]Notice: Undefined offset: 1692 in /home/me/bin/foobar.php on line 80[/php]

    Such notices are thrown out about a bunch of times, however, you can fix this problem easily by just adding the following if statement:

    [php]
    if($i < count($urls)) {
    $ch = curl_init();
    $options[CURLOPT_URL] = $urls[$i++];
    curl_setopt_array($ch, $options);
    curl_multi_add_handle($mh, $ch);
    }
    [/php]

    * Notices are only displayed when you set the error reporting to E_ALL | E_NOTICE. This is why the author of this article probably just hasn't stumbled upon this problem yet.

  • johnrembo

    don't forget to set CURLOPT_BINARYTRANSFER=1, otherwise it will corrupt you files being transfered (*.gz, *.png etc…)

  • Felix

    I too have had problems with having too big a window/too many urls. If I have a window of 100+ with 2000 urls, it'll only call back a random number of successfully fetched urls.. like 100-300. It's very irritating and I can't find any reason why. And it's not to do with memory, the box has 8 cores and 32gb of ram.. and the script process takes very little resources really.

    Would love to find out the cause.. since I have to check roughly 2 million urls every day and it gets slow with a window of only 50.. in fact, regular curl_multi with 500 threads is faster right now. Let me know if you have any thoughts. I could even pay you if you find out the cause.

    • ronny

      felix, i too crawl more than a million pages a day and have the same issues on a huge box…
      have you found the solution?
      best regards,
      Ronny

      fyi my post above:
      "i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??"

  • g00d

    Hi, Josh.

    I think this code not good way, see below..
    // start a new request (it's important to do this before removing the old one)
    $ch = curl_init();
    $options[CURLOPT_URL] = $urls[$i++]; // increment i
    curl_setopt_array($ch,$options);
    curl_multi_add_handle($master, $ch);

    I think probably good way it make small check

    if (isset($urls[$i++])) {
    $ch = curl_init();
    $options[CURLOPT_URL] = $urls[$i]; // increment i
    curl_setopt_array($ch,$options);
    curl_multi_add_handle($master, $ch);
    }

    • I've added this to the Google Code project.

  • ram

    Nice work!
    restart a new request not only if $info['http_code'] == 200
    restart it every time a request completed.

    to set usleep use sys_getloadavg() to take a look at cpu performance.
    like:
    $cpu = sys_getloadavg();
    if ($cpu[0] * 100 > 80) { $usleep += 10000; }
    if ($cpu[0] * 100 < 60) { $usleep -= 10000; }
    if ($cpu[0] * 100 < 50) { $usleep = 10000; }
    if ($usleep < 0) { $usleep = 0; }
    usleep($usleep);

    if $rolling_window 's getting bigger then available $urls
    it open ne request with an empty url this will slow down
    the script at the end

    I recognized that if I use different open $rolling_window s
    for different ping times the download time getting smaller
    (usefull if you download much stuff from the same server)
    it would be helpful to write a script which get the best combination
    of open $rolling_window s for different ping times and for
    the own pc power.

  • Kia

    I've looked at your code at googlecode.com.

    Is there a reason that you don't start a new request if the previous request failed? It seems to me that you should start a new request every time a previous request is done…

    I also wonder if it is possible to send the original url to the callback function. That way it would be easier to identify which content origins to which url. Since requests can be redirected, the url in $info can be different than the original url.

    • Yeah, I'm not sure what I was thinking there. I've changed the code so that it starts a new request regardless of whether the previous one was successful or not.

      I'll have to look into your other question about keeping track of the URL.

      • Kia

        I hope you find a solution. If I don't know which content that belongs to a certain URL it is hard to use the code. But maby I'm missing something? I haven't found any example code of how to use your library.

        • To get the URL, set a second parameter in the callback function. The second parameter contains the information passed to curl. So, for example, if your callback function was:
          <pre>
          function request_callback($result,$info) {
          echo md5($result)."
          ";
          echo $info["url''];
          }
          </pre>

          $info["url"] will return the URL of the request.

          Thanks a lot for taking the time to put this together, Josh. It has really helped me on a few projects.

  • Pingback: Everything comes down to the IP address()

  • Thanks for the code Josh, I've only just started with it but the php doc pages were basically useless, so I hope I can accomplish what I want with your class..
    What is the correct way to add curl options? I just want the header returned so I'm trying this

    $options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
    $rc = new RollingCurl("request_callback");
    $rc->window_size = 5;
    foreach ($urls as $url) {
    $request = new Request($url, null, null, $options);
    //$request = new Request($url);
    }
    $rc->execute();

    but I still get the full page returned. If I set these options manually in the RollingCurl.php object it works as I want, but I'd prefer to be able to do it dynamically.

    Also, I was getting a PHP notice about this line
    if ($i < sizeof($requests) && isset($this->requests[$i++]) && $i < count($this->requests)) {

    so I changed it to
    if ($i < sizeof($this->requests) && isset($this->requests[$i++]) && $i < count($this->requests)) {

    Not sure if it's the right thing, but it made the notice go away.

    • The code for adding separate options for each request isn't right. I'm looking at fixing it now. In the meantime, it works fine if you add the options to $rc instead of each request individually:

      $urls = array(…);
      $options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
      $rc = new RollingCurl("request_callback");
      $rc->window_size = 5;

      foreach ($urls as $url) {
      $request = new Request($url);
      $rc->add($request);
      }
      $rc->options = $options;
      $rc->execute();

      Good catch on the missing $this. I've pushed your fix to Google Code.

    • The code for adding separate options for each request isn't right. I'm looking at fixing it now. In the meantime, it works fine if you add the options to $rc instead of each request individually:

      $urls = array(…);
      $options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
      $rc = new RollingCurl("request_callback");
      $rc->window_size = 5;

      foreach ($urls as $url) {
      $request = new Request($url);
      $rc->add($request);
      }
      $rc->options = $options;
      $rc->execute();

      Good catch on the missing $this. I've pushed your fix to Google Code.

    • Okay, if you grab the latest version that issue should be fixed. Thanks for the heads up.

  • Brad

    Josh,

    There is an off by one bug in the rolling window logic when the window_size is less than the total number of urls (e.g. 6 urls and window size is 3).
    In this example, the 4th url gets skipped. If I change the window size to 4, the 5th url gets skipped. I've tracked it down to the logic in this if statement:

    if ($i < sizeof($this->requests) && isset($this->requests[$i++])
    && $i < count($this->requests)) {
    $ch = curl_init();
    $options = $this->get_options($this->requests[$i]);
    curl_setopt_array($ch,$options);
    curl_multi_add_handle($master, $ch);
    }

    You are incrementing $i in the if statement, but you are using $i in the $this->requests[$i]. So it makes sense that the first link after the window size gets skipped.

    Thank you for your library. I very much hope to use it soon.

    • Ah, good catch & shame on me for not noticing that first. It should be fixed now on Google Code.

      • Brad

        Thank you for addressing this so quickly!

  • How to access Rolling Curl's Callback Function within a Parent Class?
    I am calling RollingCurl from within another class.
    How can I get RollingCurl to target the callback function within the class that called it? I am a little unfamilar with callback functions and not quite sure how to implement it within a class. Many thanks in advance.

    Psudeocode:

    $scraper= new DummyScraper
    $urls= $scraper->getUrls()
    $rc = $scraper->rollingCurl($urls)

    Class DummyScraper{
    public $rc;

    function __construct(){
    $this->rc = new RollingCurl("request_callback");
    $this->rc->window_size = 20;
    }

    public function getURLs(){
    //get a bunch of urls to pass to Rolling Curl
    return $urls;
    }

    public function rollingCurl($url){

    foreach ($urls as $url)
    {
    $request = new Request($url);
    $this->rc->add($request);
    }

    $this->rc->execute();
    }

    public function request_callback($response, $info) {
    $titles[] = $m->xpath($response,"/html/body//a"); //Get all the links and store in an array
    }

    }

    Again. Many thanks in advance.

    • Interesting question. The first thing that comes to mind is to use create_function (http://ca.php.net/create_function) to create an anonymous function for the callback. It's got a bit of an ugly syntax, but it works great. Using the example I have on Google Code, it would look something like this:

      $callback = create_function('$response,$info', '
      // parse the page title out of the returned HTML
      if (eregi ("<title>(.*)</title>", $response, $out)) {
      $title = $out[1];
      }
      echo "$title
      ";
      print_r($info);
      echo "<hr>";
      ');

      $rc = new RollingCurl($callback);

      • Ugh, my commenting system slaughtered the code example. Hopefully you can still get the gist of it.

  • Ok, after a little testing, the below works:
    Works:
    $callback = $this->sayHello("Is this parent method being called?");
    $this->rc = new RollingCurl($callback);

    Doesn't Work: ($response, $info is null, $scrape works):
    $callback = $this->attributeHTMLScraper($response,$info,$scrape="/html/body//a");
    $this->rc = new RollingCurl($callback);

    Here is the attributeHTMLScraper method in the parent class:
    public function attributeHTMLScraper($response,$info,$scrape)
    {
    var_dump($response); var_dump($info); //Both Null
    $dom = new DOMDocument();
    $dom->loadHTML($reponse);

    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate($scrape);

    if(!is_null($hrefs)):
    for ($i = 0; $i < $hrefs->length; $i++):
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $result[] =$url;
    endfor;
    endif;

    //Return a simple variable if 1 value is returned. Else return an array
    if(count($result)==1):
    return $result[0];
    else:
    return $result;
    endif;

    return $result;
    }

    How can I pass these parent methods to the callback variable?

    • My question above my be more of an OOP usage question than a rolling-curl question. That said, any help is greatly appreciated.

    • The below works 🙂
      Does anyone know if there is a way to do it without mucking up the code in the Rolling Curl class? Many thanks in advance.

      I added a reference to the parent class in the Rolling Curl constructer:
      function __construct($callback = null,$cls=null) {
      $this->parentClass = $cls;
      $this->callback = $callback;

      To target the callback function of the parent class, I replaced:
      // Send the return values to the callback function.
      $callback = $this->callback;
      $callback($output, $info);
      With:
      $this->callback= str_replace('$this->','',$this->callback); //remove '$this->' from the string
      eval('$this->parentClass->'.$this->callback.';');

      In my Parent Class, I called rolling curl like this:
      $callback = '$this->attributeHTMLScraper($output,$info,"/html/body//a")';
      $this->rc = new RollingCurl($callback,$this);

      As an alternative, I tried the below but could not get the $vArgs array to show up in the attributeHTMLScraper function.
      $vArgs = array($response,$info,'/html/body//a');
      call_user_func_array($this->parentClass->attributeHTMLScraper,$vArgs );

      If anyone knows a more elegant way to do this, I would be much appreciative. Still deep in that learning phase.

  • Freaky_gerbil

    Perfect just what I was looking for and easy for a NOOB like myself to implement 😀 Thank you

    • Freaky_gerbil

      I get the same problem with the disappearing URL's as previously mentioned. I have kept the rolling window at 5 and experimented with retrieving XML feeds from Amazon. Once I get up to 50 URLs I am not getting the expected number of results. I have tried adding error handling but there are none, the URL's just disappear.

      About the only thing I can think of at the moment is flagging each URL in the array and recursively processing until they are either flagged as completed or error. I will keep you posted.

  • Eric

    Rolling Curl simply just rocks! Thanks for all your time & effort on this. I'm amazed at what this can accomplish on so few CPU cycles.

    Who cares whether it's technically forking, threading, or otherwise….it works as advertised.

    One small issue on the blog presentation. While I know it should be obvious that the current code lives on google, I think it would be advantages (from a visual quickscan standpoint) to replace the old code in the black area (on top of the blog post) with the current meat and potatoes, end result, functionality from the example on Google:

    <code>
    /*
    authored by Josh Fraser (http://www.joshfraser.com)
    released under Apache License 2.0
    */

    // a little example that fetches a bunch of sites in parallel and echos the page title and response info for each request

    require("RollingCurl.php");

    // top 20 sites according to alexa (11/5/09)
    $urls = array("http://www.google.com&quot;,
    "http://www.facebook.com&quot;,
    "http://www.yahoo.com&quot;,
    "http://www.youtube.com&quot;,
    "http://www.live.com&quot;,
    "http://www.wikipedia.com&quot;,
    "http://www.blogger.com&quot;,
    "http://www.msn.com&quot;,
    "http://www.baidu.com&quot;,
    "http://www.yahoo.co.jp&quot;,
    "http://www.myspace.com&quot;,
    "http://www.qq.com&quot;,
    "http://www.google.co.in&quot;,
    "http://www.twitter.com&quot;,
    "http://www.google.de&quot;,
    "http://www.microsoft.com&quot;,
    "http://www.google.cn&quot;,
    "http://www.sina.com.cn&quot;,
    "http://www.wordpress.com&quot;,
    "http://www.google.co.uk&quot;);

    function request_callback($response, $info) {
    // parse the page title out of the returned HTML
    if (eregi ("<title>(.*)</title>", $response, $out)) {
    $title = $out[1];
    }
    echo "$title
    ";
    print_r($info);
    echo "<hr>";
    }

    $rc = new RollingCurl("request_callback");
    $rc->window_size = 20;
    foreach ($urls as $url) {
    $request = new Request($url);
    $rc->add($request);
    }
    $rc->execute();
    </code>

    And provide the direct Google trunk link for those who are not used to working with a repository:

    http://code.google.com/p/rolling-curl/source/brow

    I know for myself, when I'm scanning a bunch of sites for the right solution, it's nice to see a quick visual reference.

    Cheers.

  • Eric P

    I need to access Rolling Curl results based on specific URL sequencing.

    For smaller requests, I'm using the code below and it's working fine.

    For larger requests, I could write the results to disk with sequential file naming for post sequential compilation, rather than storing all the results in memory.

    I'm thinking the best way to handle it dynamically, would be to create a function to write sequentially named files to disk, based on a defined size/memory threshold, otherwise, handle via an in-memory associative array as below.

    Thoughts?

    function request_callback($response,$info) {
    global $rc_result ;
    // get last character of URL to enable indexing of results
    // e.g., $url[] = http://mydomain.com?request=1, $url[] = http://mydomain.com?request=2 , etc.
    $index_id = $info[url][strlen($info[url])-1];
    $rc_result[$index_id] = $response ;
    }

    $rc = new RollingCurl("request_callback");
    $rc->window_size = 10;
    foreach ($urls as $url) {
    $request = new Request($url);
    $rc->add($request);
    }
    $rc->execute();

    ksort($rc_result,SORT_NUMERIC);
    print_r($rc_result);

  • no english

    I would like to add two attributes (public $ response; public $ output;), when no callback function to facilitate the use of external processing, thank you.
    Sorry, my English Henlan, and use Google Translation

  • Any ideas on referencing the requests to the responses?

    I need a response to be identifiable after it fulfilled a request.

    • Ken

      trying to implement this, still no luck ;(

      • tojochacko

        try the second parameter in the callback function. Its an array and you may be able to reference the response to the request using the url key.

  • No Solution yet. Anyone?

  • Priit

    Thank you very much for this piece code!

    There is just one minor error I noticed. Before creating a new handle you should check if there are any url's in the array left to add. If you don't check that you will create up to 5 empty requests (that will return error code 0) after you have processed all the URL's in the $urls array. So it should look like this:

    // start a new request (it's important to do this before removing the old one)
    if (sizeof($urls) > $i + 1) { …. start new request …. }

    • Mark

      if (sizeof($urls) >= $i + 1) {

  • Im trying to access the cookies from the callback function is it possible? I check my cookie and this one is still empty when I try to access it from the callback function

    Thanks a lot

  • Im trying to access the cookies from the callback function is it possible? I check my cookie and this one is still empty when I try to access it from the callback function

    Thanks a lot

  • Subodh

    Hi I am using ur code for one of my projects where I have a set of files to download which I do simultaneously.
    I hit a problem in the callback

    function requestCallback($response, $info, $urls_array = null) {
    $filelocation = getFileLocation($urls_array, $info['url']);
    if (isset($filelocation)){
    if (file_put_contents($filelocation.".zip", $response)) {
    $zip = new ZipArchive();
    if ($zip->open($filelocation.".zip") === TRUE) {
    echo "n unzipping ".$filelocation.".zip n";
    makeDirectory($filelocation);
    $zip->extractTo($filelocation);
    echo "Completed unzipping ".$filelocation.".zip n";
    } else {
    logErrorMail("Archive ".$filelocation.".zip is invalid or corrupt");
    }
    $zip->close();
    } else {
    logErrorMail("Error: unable to write to zip file");
    }
    } else {
    logErrorMail("Error: Cannot find file location");
    }
    }

  • Subodh

    This is the callback that is called. Is the callback synchronous i.e, because when one of the handle is done it calls this function. The output is zip so I write it to a zip file locally. If successful I unzip it.
    This writing to file and unziping takes time and after it gets completed only then the download of next items resume

    In my case
    I am downloading 3 files in batch
    if 1st file is downloaded successfully it will call the callback
    after the writing to zip and unzipping is finished
    I have got this error

    * Connection #2 to host xxxxx left intact
    92 138M 92 127M 0 0 23996 0 1:40:41 1:32:50 0:07:51 129
    99 129M 99 128M 0 0 24250 0 1:33:16 1:32:50 0:00:26 327* Connection #2 seems to be dead!
    * Closing connection #2
    * About to connect() to xxxxx port xx (#2)
    * Trying X.X.X.X…
    92 138M 92 127M 0 0 23665 0 1:42:06 1:34:09 0:07:57 126
    99 129M 99 128M 0 0 23916 0 1:34:34 1:34:09 0:00:25 275* Connected to XXXX (X.X.X.X) port X (#2)
    > GET XXXX/daily.zip HTTP/1.1
    User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
    Host: XXXXX
    Accept: */*

  • Nick Smith

    Excellent code, thanks 🙂 Just to let others who might be struggling to get it to work, curl_multi_info_read() doesn't work in PHP versions before 5.2.0, and returns NULL immediately.

  • Firstly, apologies if this is a bit of a noob question but can anyone give any tips on how to get proper responses from the following site using RollingCurl? Is it to do with cookies? I am not a programmer by any stretch of the imagination so could do with some pointers.

    The URL is "http://logis.korail.go.kr/getcarinfo.do?car_no=&quot; with a number appended to it, ranging from 8201 through 8286. The first time you enter the URL into your browser you get a login screen back. If you refresh, or enter a URL with a different number appended, you get one of two different responses. The first response has two input boxes, one of which is populated with the number you appended to the URL, the second being empty. The second response is the same as the first with the addition of two tables, with various data fields, underneath the two input boxes.

    When you use the code below the html received back is for the login page in every instance. How do I implement RollingCurl so I get on of the other two repsonses back?

    <?php

    // PROCESS RESPONSE
    function request_callback($response, $info) {
    echo($response);
    echo "<hr>";
    }

    // REQUIRE ROLLING CURL
    require("RollingCurl.php");

    // POPULATE LOCO ARRAY
    $class = 8201;
    $class_size = 5; //RESTRICTED TO 5 FOR TESTING PURPOSES, SHOULD BE 86!
    for ($i=0;$i<$class_size;$i++){
    $loco[]=$class+$i;
    }

    // POPULATE URL ARRAY
    $url = 'http://logis.korail.go.kr/getcarinfo.do?car_no=&#039;;
    for ($i=0;$i<sizeof($loco);$i++){
    $urls[]=$url.$loco[$i];
    }

    // FETCH URLS
    $rc = new RollingCurl("request_callback");
    $rc->window_size = $class_size;
    foreach ($urls as $url) {
    $request = new Request($url);
    $rc->add($request);
    }
    $rc->execute();

    ?>

  • floesen

    Indeed. It's a beutiful piece of code!
    This post, together with your post on http://www.askapache.com/php/curl-multi-downloads… just made me understand how tu use multihandles in useful way!

  • t's important to do this before removing the old one

  • I'm not php expert and looking for solution to download multi images in one time..I'm not php expert and looking for solution to download multi images in one time..
    this class look like work for me but i dont know how to work it out for saving files. at defined directory:

    eg:
    $imgs = array("http://l.yimg.com/a/i/us/pim/dclient/cg504_5/img/md5/509840ceb0dd52f5f024dba77099b4b0_1.gif&quot;,
    "http://www.onlineaspect.com/wp-content/themes/onlineaspect/images/lego.png&quot;);

    save to
    $maindir = "images";
    $dirs = array ($maindir ."/yahoo", $maindir ."/onlineaspect");

    i dont kow how to pass var dir to callback to save each images to directoy
    thanks for any example

    • Ah, great question. The trick here is to make $maindir a global variable inside your callback like this:

      function callback($result) {
      global $maindir;

      }

  • Parse error: syntax error, unexpected ';' in /home/migcybe1/public_html/php/multi_eksekusi_3.php on line 6
    when i test this script
    <?php
    function rolling_curl($urls, $callback, $custom_options = null) {

    // make sure the rolling window isn't greater than the # of urls
    $rolling_window = 5;
    $rolling_window = (sizeof($urls) &lt; $rolling_window) ? sizeof($urls) : $rolling_window;

    $master = curl_multi_init();
    $curl_arr = array();

    // add additional curl options here
    $std_options = array(CURLOPT_RETURNTRANSFER =&gt; true,
    CURLOPT_FOLLOWLOCATION =&gt; true,
    CURLOPT_MAXREDIRS =&gt; 5);
    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;

    // start the first batch of requests
    for ($i = 0; $i &lt; $rolling_window; $i++) {
    $ch = curl_init();
    $options[CURLOPT_URL] = $urls[$i];
    curl_setopt_array($ch,$options);
    curl_multi_add_handle($master, $ch);
    }

    do {
    while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
    if($execrun != CURLM_OK)
    break;
    // a request was just completed — find out which one
    while($done = curl_multi_info_read($master)) {
    $info = curl_getinfo($done['handle']);
    if ($info['http_code'] == 200) {
    $output = curl_multi_getcontent($done['handle']);

    // request successful. process output using the callback function.
    $callback($output);

    // start a new request (it's important to do this before removing the old one)
    $ch = curl_init();
    $options[CURLOPT_URL] = $urls[$i++]; // increment i
    curl_setopt_array($ch,$options);
    curl_multi_add_handle($master, $ch);

    // remove the curl handle that just completed
    curl_multi_remove_handle($master, $done['handle']);
    } else {
    // request failed. add error handling.
    }
    }
    } while ($running);

    curl_multi_close($master);
    return true;
    }
    ?>
    how to solve it??? thanks

  • C4rter

    Hi Josh,

    I'm using your nice piece of code for a lot of data (3.5 million requests).
    My rolling window is 10.
    But after a while my working machine is out of memory.

    I'm trying to find the memory leak and I noticed you don't close the single curl handles.
    I think, after
    curl_multi_remove_handle($master, $done['handle']);
    you have to call
    curl_close($done['handle']);
    to totally close the handle, because
    curl_multi_close($master);
    closes the master but not the single handles.

    And can you explain me, why you have to "// start a new request" if you just startet all of the requests in the "for" loop?

    • Ah, good catch. Yes, you probably want to close those.

      I haven't tried reusing the connections, or at least I don't remember experimenting with that. Would be interesting to see if you can make that work. If you do, please post your results here so others can gain from them.

      Thanks!

  • Yaffle

    Hello! Great article!
    I wrote simple class with similar functionality, i use anonymous functions for callbacks: https://github.com/Yaffle/MultiGet