Please visit http://code.google.com/p/rolling-curl/ for the latest version.
A more efficient implementation of curl_multi()
curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests.
The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time. The problem with this is that each group has to wait for the slowest request to download. In a group of 100 requests, all it takes is one slow one to delay the processing of 99 others. The larger the number of requests you are dealing with, the more noticeable this latency becomes.
The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting. I also created a queue of cURL requests to allow for maximum throughput. Each time a request is completed, I add a new one from the queue. By dynamically adding and removing links, we keep a constant number of links downloading at all times. This gives us a way to throttle the amount of simultaneous requests we are sending. The result is a faster and more efficient way of processing large quantities of cURL requests in parallel.
// make sure the rolling window isn't greater than the # of urls
$rolling_window = 5;
$rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;
$master = curl_multi_init();
$curl_arr = array();
// add additional curl options here
$std_options = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5);
$options = ($custom_options) ? ($std_options + $custom_options) : $std_options;
// start the first batch of requests
for ($i = 0; $i < $rolling_window; $i++) {
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i];
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
}
do {
while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
if($execrun != CURLM_OK)
break;
// a request was just completed -- find out which one
while($done = curl_multi_info_read($master)) {
$info = curl_getinfo($done['handle']);
if ($info['http_code'] == 200) {
$output = curl_multi_getcontent($done['handle']);
// request successful. process output using the callback function.
$callback($output);
// start a new request (it's important to do this before removing the old one)
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
// remove the curl handle that just completed
curl_multi_remove_handle($master, $done['handle']);
} else {
// request failed. add error handling.
}
}
} while ($running);
curl_multi_close($master);
return true;
}
Note: I set my max number of parallel requests ($rolling_window) to 100 5. Be sure to update this value according to the bandwidth available on your server / servers you are curling. Be nice and read this first.
Updated 3/6/09: Fixed a missing semi-colon. Thanks to Steve Gricci for catching the typo.
Updated 4/2/09: Made some changes to increase reusability. rolling_curl now expects a $callback parameter for a function that will process each response. It also accepts an array called $options that let’s you add custom curl options such as authentication, custom headers, etc
Updated 4/8/09: Fixed a new bug that was introduced with the last update. Thanks to Damian Clement for alerting me to the problem.


[...] http://onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/ (in case you need to throttle the # of calls your making simultaneously) [...]
Thanks very much for your good solution, Josh.
But your code failed to deal with huge URLs(round 100,000) and exit without any warning.
If you want , I can send the code to you.
Always keep in mind that some firewalls are blocking you if you open more than 6 requests to one hostname. This is not allowed per RFC definition and you can – no you WILL overload the remote server. You will bring the server down if you open 100 simultaneous requests per second! Do not overload other severs… this is like a DDOS attack and IDS (intrusion detection systems) will block you completely.
thnx guys this was useful in debugging some of the issues i was facing
This is a good reminder for everyone. Marc, thanks for bringing this up.
There is no standard answer. Can you ask the people you're scraping?
Josh,
I've tried asking the people I'm scraping but haven't had any replies from any enquiries I've made to the webmaster email addresses so I guess I'll try upping my limit by 1 at a time and see how I go on??
I tried using the example you gave using twitter: –
>Sure. Here's a simple example using twitter:
>function twitter_callback($output) {
>$results = json_decode($output);
>print_r($results);
>echo "<hr>";
>}
>$urls[] = "http://twitter.com/users/show.json?screen_name=joshfraz";
>$urls[] = "http://twitter.com/users/show.json?screen_name=eventvue";
>rolling_curl($urls,'twitter_callback');
I replaced the two URL's with
$urls[] = "http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7001";
$urls[] = "http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7002";
and modified the function to parse $output to suit my requirements but my browser just takes the returned HTML from the URL's I've visited and displays it on screen. How do I stop this from happening?
sounds like you have an extra print_r or echo in your code somewhere.
Josh,
The only code I used is as follows: –
<?php
include("rolling_curl.php"); //Unmodified rolling_curl copied straight from this page
function korail_callback($output) {
//Do nothing
}
$urls[] = "http://www.google.co.uk";
rolling_curl($urls,'korail_callback');
?>
The result on screen is googles homepage. I was expecting a blank screen, am I wrong to expect this?
whoops. looks like i introduced a bug w/ my last update.
if ($custom_options) {
$options = $std_options + $custom_options;
}
should really have been:
$options = ($custom_options) ? ($std_options + $custom_options) : $std_options;
I've updated the post with this fix so if you copy the new code it should work. Sorry about that!
Josh,
please excuse the beginners question, but what variable type is $output returned in? I can use "strlen" to determine the length of $output but "substr" doesn't seem to work.
In my original linear code I used the expression, $output = file_get_contents($urls), and was able to parse $output for various HTML fragments, however it doesn't work on $output returned using your Rolling_curl function. Does it need modifying to suit my needs?
My code:
<?php
include("rolling_curl.php");
function korail_callback($output) {
echo strlen ($output); //returns a value no problem
echo "
";
echo substr($output,0,10); //doesn't do anything
echo "
";
}
$urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7001';
$urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7002';
$urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7003';
$urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7004';
$urls[]='http://logis.korail.go.kr/driveinfo/TrainLocSearchp.jsp?carNo=7005';
rolling_curl($urls,'korail_callback');
?>
It returns a string as stated in the documentation for curl_multi_getcontent():
http://us.php.net/manual/en/function.curl-multi-g…
a nit that embarassed me a touch (though I blame it on being up all night and then some). At the point where you: "// start a new request (it's important to do this before removing the old one)" … it's best to check to make sure $i < count($urls);
I also changed the last few lines to:
$ready = curl_multi_select($mh);
if ($ready != -1) $execrun = curl_multi_exec($mh,$running);
} while ($running && ($ready != -1)); // do…
but I don't know if that was necessary. I was just having "issues".
Kaolin, I'm almost done with a new version that will bring a lot of improvements to this code. Look for it some time later this week. I've fixed a few small problems like the one you mentioned and made it object oriented for increased reusability.
really good solution
congratulations !
If $callback takes too long, won't all the requests finish before starting another? Is there a way over it?
Great solution Josh!
What if I need an additional variable (say URL id from the database) going through the callback function? I tried passing the $info as well but it loses the original URL when redirected, so I cannot use the callback function to update the URL status on my database.
Michael,
Great question. I don't have an answer for you off the top of my head. I'll be thinking about it and will let you know if I come up with something.
Josh
What I'm doing, which may be implicitly buggy (working through it right now–it seems to fail nondeterministically) is keeping a lookup hash for what $ch is attached to what $id ..
Would be happy to see your code. Ideally I would have the id (or even the $i) sent to the callback function
[...] me deparei com algumas classes e funções interessantes para usar o curl_multi. Uma delas está neste post, que contém a implementação eficiente de uma função para quem precisar fazer grande número de [...]
Hey Josh! Thanks for the code snippet. Any news on the OO class that you were working on?
Hi Michael,
I too was looking out information on the same and came across this simple solution which may help all of us. After one gets the following code executed,
$info = curl_getinfo($done['handle']);
we can just simply retrieve the URL by adding this code.
$url = $info['url'];
Once you have the URL, you surely can come up with the ID used. And with the code of the above function, you can send across the response back to the callback by adding this as additional parameter.
Hope this works out.
I'm not sure this will work since the whole issue started with URLs being redirected and changed when $done. I ended up doing something similar to what kaolie had suggested (see commented below):
$handles = array(); // create a handles array
for ($i = 0; $i < $rolling_window; $i++) {
$ch = curl_init();
$handles[(int)$ch] = $url['key']; // store each ch handle along with the relevant $i (can be also the original url itself)
$options[CURLOPT_URL] = $urls[$i];
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
}
// request successful. process output using the callback function.
$callback($output, $handles[(int)$ch]); // you now have the original url id/key/value to your callback
Hope this helps anyone,
Michael
On non standard http requests, the $info['http_code'] == 200 will fail. Any solution? Thanks.
Maybe checking $info['header_size'] == 1 ?!
I love that my blog commenters are so much smarter than me.
We should probably check for any 2xx http_code. Is there any good reason to continue if we get any of the other codes?
i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??
any ideas why?
I do:
if(curl_errno($ch) == 0 && $info['http_code'] == 200)
Are you hitting 1 site or 900 different sites? If you're hitting 1 site you might want to make sure you're not getting blocked.
Otherwise, my guess is there is some setting or limitation on your server that is limiting the number of connections. How much memory do you have? I've run into the problem of dropped urls before, but only with a window size of several hundred. One of the problems with multi_curl is that it tends to fail silently. Please let me know if you figure out what is going on. I'd love to find a solution besides \”use a smaller window size\”.
none of the 900 urls is on the same server, and i have 8gb of ram with 16 cores.. so i doubt that i have any server limitation..
if i use a different function that gathers all the data by "blocks" of 500 for example, and then parses everything it seems to work.. the part that parses the data as soon as it comes in your function might be making things a bit shaky,
try getting 900 urls, and callback to a function that counts the amount of urls that were called back (declare the count var global in the function to enable shared access from all the callbacks) and you'll see very few get called back.. even when you make a like a window size of 50… whereas if i use another function that doesn't process results as soon as they arrive, all pages get fetched and i process later but its slower… it would be awesome if you found out what's the reason in your code.. as i'm sure it would be faster!
why don't you echo something simpler on the callback function (say "donen") and count the number of times it displays? do the same with errors (i.e. replacing "// request failed. add error handling." with "errorn"). I had similar issues and it was because of a faulty success and error callback functions.
just tried it with 800 urls.
21 done,
4 failed.
775 vanished??
Very weird. I've done extensive testing on my own and haven't had any problems — especially to that extent.
KICKASS!!! This is way better than I was doing it before. Thanks a bunch.
[...] How to use curl_multi() without blocking curl_multi is a great way to process multiple HTTP requests in parallel in PHP. curl_multi is particularly handy when working with large data sets (like fetching thousands of RSS feeds at one time). Unfortunately there is very little documentation on the best way to implement curl_multi. As a result, most of the examples around the web are either inefficient or fail entirely when asked to handle more than a few hundred requests. (tags: PHP, curl_multi) [...]
some questions…
$info['http_code'] == 200 its clearly but why shouldnt done this at the "else":
- $i++
- starting a new request
i think everytime when a request finished a new url should be added
and it should be testet if enough urls are in the array before starting a new one with an empty url
if (count($urls) < $i+1)
please correct me
i also thought that "if (count($urls) < $i+1)" should be used but I tried it and it sent me in an infinite loop. I cant understand the reason why it is doing so, but apparently it doesnt work when if (count($urls) < $i+1) is added.
You don't want to do that because that would start every request
running at the same time — which would create issues if you had a lot
of URLs to fetch. I intentionally used a rolling window to limit the
number of simultaneous requests. It sounds like you might just want
to increase the size of the rolling window. The default is 5, but you
can safely bump that up to 100 or so as long as you are hitting
distributed resources.
"// start a new request (it's important to do this before removing the old one)"
can you say why? is there a bug if it is not performed in that order?
I'm struggling to remember exactly what happens, but yeah, basically it doesn't work if you switch the order. try it.
Thanks a lot…this works great. But I am surprised….I see that no matter what, $urls[$i++] is added to the queue. How come there is no error when $urls[$i] is the last one in queue…ie…there is no $urls[$i++]???
Apologies if the question is naive.
$i++ is a post-increment. This means that $urls[$i] is added and THEN $i is incremented. We would probably have problems if we used ++$i. Also note that $i is just a counter, it doesn't control when the while loop stops, the variable $done handles that. Make sense?
@Prashanth
This question is not naive at all … I had the same concerns, and they were confirmed by the following notice produced by PHP*:
[php]Notice: Undefined offset: 1692 in /home/me/bin/foobar.php on line 80[/php]
Such notices are thrown out about a bunch of times, however, you can fix this problem easily by just adding the following if statement:
[php]
if($i < count($urls)) {
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++];
curl_setopt_array($ch, $options);
curl_multi_add_handle($mh, $ch);
}
[/php]
* Notices are only displayed when you set the error reporting to E_ALL | E_NOTICE. This is why the author of this article probably just hasn't stumbled upon this problem yet.
don't forget to set CURLOPT_BINARYTRANSFER=1, otherwise it will corrupt you files being transfered (*.gz, *.png etc…)
I too have had problems with having too big a window/too many urls. If I have a window of 100+ with 2000 urls, it'll only call back a random number of successfully fetched urls.. like 100-300. It's very irritating and I can't find any reason why. And it's not to do with memory, the box has 8 cores and 32gb of ram.. and the script process takes very little resources really.
Would love to find out the cause.. since I have to check roughly 2 million urls every day and it gets slow with a window of only 50.. in fact, regular curl_multi with 500 threads is faster right now. Let me know if you have any thoughts. I could even pay you if you find out the cause.
Hi, Josh.
I think this code not good way, see below..
// start a new request (it's important to do this before removing the old one)
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
I think probably good way it make small check
if (isset($urls[$i++])) {
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
}
felix, i too crawl more than a million pages a day and have the same issues on a huge box…
have you found the solution?
best regards,
Ronny
fyi my post above:
"i'm sorry to inform you, I have tried your code with 900 urls on a dedicated server with 1000Mbit connection, and window size of 10 only and it does not crawl all the 900 urls… between 30 to 140 urls randomly ??"
Nice work!
restart a new request not only if $info['http_code'] == 200
restart it every time a request completed.
to set usleep use sys_getloadavg() to take a look at cpu performance.
like:
$cpu = sys_getloadavg();
if ($cpu[0] * 100 > 80) { $usleep += 10000; }
if ($cpu[0] * 100 < 60) { $usleep -= 10000; }
if ($cpu[0] * 100 < 50) { $usleep = 10000; }
if ($usleep < 0) { $usleep = 0; }
usleep($usleep);
if $rolling_window 's getting bigger then available $urls
it open ne request with an empty url this will slow down
the script at the end
I recognized that if I use different open $rolling_window s
for different ping times the download time getting smaller
(usefull if you download much stuff from the same server)
it would be helpful to write a script which get the best combination
of open $rolling_window s for different ping times and for
the own pc power.
I've looked at your code at googlecode.com.
Is there a reason that you don't start a new request if the previous request failed? It seems to me that you should start a new request every time a previous request is done…
I also wonder if it is possible to send the original url to the callback function. That way it would be easier to identify which content origins to which url. Since requests can be redirected, the url in $info can be different than the original url.
Yeah, I'm not sure what I was thinking there. I've changed the code so that it starts a new request regardless of whether the previous one was successful or not.
I'll have to look into your other question about keeping track of the URL.
I've added this to the Google Code project.
I added this to the Google Code project.
I hope you find a solution. If I don't know which content that belongs to a certain URL it is hard to use the code. But maby I'm missing something? I haven't found any example code of how to use your library.
[...] years the code I’ve used to fetch a feed has drastically improved. I made optimizations to fetch feeds in parallel. I started keeping track of feeds that failed regularly so I could fix or eliminate them. I [...]
To get the URL, set a second parameter in the callback function. The second parameter contains the information passed to curl. So, for example, if your callback function was:
<pre>
function request_callback($result,$info) {
echo md5($result)."
";
echo $info["url''];
}
</pre>
$info["url"] will return the URL of the request.
Thanks a lot for taking the time to put this together, Josh. It has really helped me on a few projects.
Thanks for the code Josh, I've only just started with it but the php doc pages were basically useless, so I hope I can accomplish what I want with your class..
What is the correct way to add curl options? I just want the header returned so I'm trying this
$options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
$rc = new RollingCurl("request_callback");
$rc->window_size = 5;
foreach ($urls as $url) {
$request = new Request($url, null, null, $options);
//$request = new Request($url);
}
$rc->execute();
but I still get the full page returned. If I set these options manually in the RollingCurl.php object it works as I want, but I'd prefer to be able to do it dynamically.
Also, I was getting a PHP notice about this line
if ($i < sizeof($requests) && isset($this->requests[$i++]) && $i < count($this->requests)) {
so I changed it to
if ($i < sizeof($this->requests) && isset($this->requests[$i++]) && $i < count($this->requests)) {
Not sure if it's the right thing, but it made the notice go away.
The code for adding separate options for each request isn't right. I'm looking at fixing it now. In the meantime, it works fine if you add the options to $rc instead of each request individually:
$urls = array(…);
$options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
$rc = new RollingCurl("request_callback");
$rc->window_size = 5;
foreach ($urls as $url) {
$request = new Request($url);
$rc->add($request);
}
$rc->options = $options;
$rc->execute();
Good catch on the missing $this. I've pushed your fix to Google Code.
The code for adding separate options for each request isn't right. I'm looking at fixing it now. In the meantime, it works fine if you add the options to $rc instead of each request individually:
$urls = array(…);
$options = array(CURLOPT_HEADER => true, CURLOPT_NOBODY => true);
$rc = new RollingCurl("request_callback");
$rc->window_size = 5;
foreach ($urls as $url) {
$request = new Request($url);
$rc->add($request);
}
$rc->options = $options;
$rc->execute();
Good catch on the missing $this. I've pushed your fix to Google Code.
Okay, if you grab the latest version that issue should be fixed. Thanks for the heads up.
Josh,
There is an off by one bug in the rolling window logic when the window_size is less than the total number of urls (e.g. 6 urls and window size is 3).
In this example, the 4th url gets skipped. If I change the window size to 4, the 5th url gets skipped. I've tracked it down to the logic in this if statement:
if ($i < sizeof($this->requests) && isset($this->requests[$i++])
&& $i < count($this->requests)) {
$ch = curl_init();
$options = $this->get_options($this->requests[$i]);
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
}
You are incrementing $i in the if statement, but you are using $i in the $this->requests[$i]. So it makes sense that the first link after the window size gets skipped.
Thank you for your library. I very much hope to use it soon.
Ah, good catch & shame on me for not noticing that first. It should be fixed now on Google Code.
How to access Rolling Curl's Callback Function within a Parent Class?
I am calling RollingCurl from within another class.
How can I get RollingCurl to target the callback function within the class that called it? I am a little unfamilar with callback functions and not quite sure how to implement it within a class. Many thanks in advance.
Psudeocode:
$scraper= new DummyScraper
$urls= $scraper->getUrls()
$rc = $scraper->rollingCurl($urls)
Class DummyScraper{
public $rc;
function __construct(){
$this->rc = new RollingCurl("request_callback");
$this->rc->window_size = 20;
}
public function getURLs(){
//get a bunch of urls to pass to Rolling Curl
return $urls;
}
public function rollingCurl($url){
foreach ($urls as $url)
{
$request = new Request($url);
$this->rc->add($request);
}
$this->rc->execute();
}
public function request_callback($response, $info) {
$titles[] = $m->xpath($response,"/html/body//a"); //Get all the links and store in an array
}
}
Again. Many thanks in advance.
Interesting question. The first thing that comes to mind is to use create_function (http://ca.php.net/create_function) to create an anonymous function for the callback. It's got a bit of an ugly syntax, but it works great. Using the example I have on Google Code, it would look something like this:
$callback = create_function('$response,$info', '
// parse the page title out of the returned HTML
if (eregi ("<title>(.*)</title>", $response, $out)) {
$title = $out[1];
}
echo "$title
";
print_r($info);
echo "<hr>";
');
$rc = new RollingCurl($callback);
Ugh, my commenting system slaughtered the code example. Hopefully you can still get the gist of it.
Ok, after a little testing, the below works:
Works:
$callback = $this->sayHello("Is this parent method being called?");
$this->rc = new RollingCurl($callback);
Doesn't Work: ($response, $info is null, $scrape works):
$callback = $this->attributeHTMLScraper($response,$info,$scrape="/html/body//a");
$this->rc = new RollingCurl($callback);
Here is the attributeHTMLScraper method in the parent class:
public function attributeHTMLScraper($response,$info,$scrape)
{
var_dump($response); var_dump($info); //Both Null
$dom = new DOMDocument();
$dom->loadHTML($reponse);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate($scrape);
if(!is_null($hrefs)):
for ($i = 0; $i < $hrefs->length; $i++):
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$result[] =$url;
endfor;
endif;
//Return a simple variable if 1 value is returned. Else return an array
if(count($result)==1):
return $result[0];
else:
return $result;
endif;
return $result;
}
How can I pass these parent methods to the callback variable?
My question above my be more of an OOP usage question than a rolling-curl question. That said, any help is greatly appreciated.
The below works
Does anyone know if there is a way to do it without mucking up the code in the Rolling Curl class? Many thanks in advance.
I added a reference to the parent class in the Rolling Curl constructer:
function __construct($callback = null,$cls=null) {
$this->parentClass = $cls;
$this->callback = $callback;
To target the callback function of the parent class, I replaced:
// Send the return values to the callback function.
$callback = $this->callback;
$callback($output, $info);
With:
$this->callback= str_replace('$this->','',$this->callback); //remove '$this->' from the string
eval('$this->parentClass->'.$this->callback.';');
In my Parent Class, I called rolling curl like this:
$callback = '$this->attributeHTMLScraper($output,$info,"/html/body//a")';
$this->rc = new RollingCurl($callback,$this);
As an alternative, I tried the below but could not get the $vArgs array to show up in the attributeHTMLScraper function.
$vArgs = array($response,$info,'/html/body//a');
call_user_func_array($this->parentClass->attributeHTMLScraper,$vArgs );
If anyone knows a more elegant way to do this, I would be much appreciative. Still deep in that learning phase.
Thank you for addressing this so quickly!
Perfect just what I was looking for and easy for a NOOB like myself to implement
Thank you
I get the same problem with the disappearing URL's as previously mentioned. I have kept the rolling window at 5 and experimented with retrieving XML feeds from Amazon. Once I get up to 50 URLs I am not getting the expected number of results. I have tried adding error handling but there are none, the URL's just disappear.
About the only thing I can think of at the moment is flagging each URL in the array and recursively processing until they are either flagged as completed or error. I will keep you posted.
Rolling Curl simply just rocks! Thanks for all your time & effort on this. I'm amazed at what this can accomplish on so few CPU cycles.
Who cares whether it's technically forking, threading, or otherwise….it works as advertised.
One small issue on the blog presentation. While I know it should be obvious that the current code lives on google, I think it would be advantages (from a visual quickscan standpoint) to replace the old code in the black area (on top of the blog post) with the current meat and potatoes, end result, functionality from the example on Google:
<code>
/*
authored by Josh Fraser (http://www.joshfraser.com)
released under Apache License 2.0
*/
// a little example that fetches a bunch of sites in parallel and echos the page title and response info for each request
require("RollingCurl.php");
// top 20 sites according to alexa (11/5/09)
;
$urls = array("http://www.google.com",
"http://www.facebook.com",
"http://www.yahoo.com",
"http://www.youtube.com",
"http://www.live.com",
"http://www.wikipedia.com",
"http://www.blogger.com",
"http://www.msn.com",
"http://www.baidu.com",
"http://www.yahoo.co.jp",
"http://www.myspace.com",
"http://www.qq.com",
"http://www.google.co.in",
"http://www.twitter.com",
"http://www.google.de",
"http://www.microsoft.com",
"http://www.google.cn",
"http://www.sina.com.cn",
"http://www.wordpress.com",
"http://www.google.co.uk"
function request_callback($response, $info) {
// parse the page title out of the returned HTML
if (eregi ("<title>(.*)</title>", $response, $out)) {
$title = $out[1];
}
echo "$title
";
print_r($info);
echo "<hr>";
}
$rc = new RollingCurl("request_callback");
$rc->window_size = 20;
foreach ($urls as $url) {
$request = new Request($url);
$rc->add($request);
}
$rc->execute();
</code>
And provide the direct Google trunk link for those who are not used to working with a repository:
http://code.google.com/p/rolling-curl/source/brow…
I know for myself, when I'm scanning a bunch of sites for the right solution, it's nice to see a quick visual reference.
Cheers.
I need to access Rolling Curl results based on specific URL sequencing.
For smaller requests, I'm using the code below and it's working fine.
For larger requests, I could write the results to disk with sequential file naming for post sequential compilation, rather than storing all the results in memory.
I'm thinking the best way to handle it dynamically, would be to create a function to write sequentially named files to disk, based on a defined size/memory threshold, otherwise, handle via an in-memory associative array as below.
Thoughts?
function request_callback($response,$info) {
global $rc_result ;
// get last character of URL to enable indexing of results
// e.g., $url[] = http://mydomain.com?request=1, $url[] = http://mydomain.com?request=2 , etc.
$index_id = $info[url][strlen($info[url])-1];
$rc_result[$index_id] = $response ;
}
$rc = new RollingCurl("request_callback");
$rc->window_size = 10;
foreach ($urls as $url) {
$request = new Request($url);
$rc->add($request);
}
$rc->execute();
ksort($rc_result,SORT_NUMERIC);
print_r($rc_result);
I would like to add two attributes (public $ response; public $ output;), when no callback function to facilitate the use of external processing, thank you.
Sorry, my English Henlan, and use Google Translation
Any ideas on referencing the requests to the responses?
I need a response to be identifiable after it fulfilled a request.
No Solution yet. Anyone?
Thank you very much for this piece code!
There is just one minor error I noticed. Before creating a new handle you should check if there are any url's in the array left to add. If you don't check that you will create up to 5 empty requests (that will return error code 0) after you have processed all the URL's in the $urls array. So it should look like this:
// start a new request (it's important to do this before removing the old one)
if (sizeof($urls) > $i + 1) { …. start new request …. }
If you process the errors in else {} part, you will see that up to 5 requests will return errors with code value 0 ($info['http_code'] == 0). You much check if you have already sent all the urls (in $urls) as requests before making another one (if (sizeof($urls) > $i +1){create new request})
trying to implement this, still no luck ;(
Im trying to access the cookies from the callback function is it possible? I check my cookie and this one is still empty when I try to access it from the callback function
Thanks a lot
Im trying to access the cookies from the callback function is it possible? I check my cookie and this one is still empty when I try to access it from the callback function
Thanks a lot
Hi I am using ur code for one of my projects where I have a set of files to download which I do simultaneously.
I hit a problem in the callback
function requestCallback($response, $info, $urls_array = null) {
$filelocation = getFileLocation($urls_array, $info['url']);
if (isset($filelocation)){
if (file_put_contents($filelocation.".zip", $response)) {
$zip = new ZipArchive();
if ($zip->open($filelocation.".zip") === TRUE) {
echo "n unzipping ".$filelocation.".zip n";
makeDirectory($filelocation);
$zip->extractTo($filelocation);
echo "Completed unzipping ".$filelocation.".zip n";
} else {
logErrorMail("Archive ".$filelocation.".zip is invalid or corrupt");
}
$zip->close();
} else {
logErrorMail("Error: unable to write to zip file");
}
} else {
logErrorMail("Error: Cannot find file location");
}
}
This is the callback that is called. Is the callback synchronous i.e, because when one of the handle is done it calls this function. The output is zip so I write it to a zip file locally. If successful I unzip it.
This writing to file and unziping takes time and after it gets completed only then the download of next items resume
In my case
I am downloading 3 files in batch
if 1st file is downloaded successfully it will call the callback
after the writing to zip and unzipping is finished
I have got this error
* Connection #2 to host xxxxx left intact
92 138M 92 127M 0 0 23996 0 1:40:41 1:32:50 0:07:51 129
99 129M 99 128M 0 0 24250 0 1:33:16 1:32:50 0:00:26 327* Connection #2 seems to be dead!
* Closing connection #2
* About to connect() to xxxxx port xx (#2)
* Trying X.X.X.X…
92 138M 92 127M 0 0 23665 0 1:42:06 1:34:09 0:07:57 126
99 129M 99 128M 0 0 23916 0 1:34:34 1:34:09 0:00:25 275* Connected to XXXX (X.X.X.X) port X (#2)
> GET XXXX/daily.zip HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
Host: XXXXX
Accept: */*
try the second parameter in the callback function. Its an array and you may be able to reference the response to the request using the url key.
Excellent code, thanks
Just to let others who might be struggling to get it to work, curl_multi_info_read() doesn't work in PHP versions before 5.2.0, and returns NULL immediately.
Firstly, apologies if this is a bit of a noob question but can anyone give any tips on how to get proper responses from the following site using RollingCurl? Is it to do with cookies? I am not a programmer by any stretch of the imagination so could do with some pointers.
The URL is "http://logis.korail.go.kr/getcarinfo.do?car_no=" with a number appended to it, ranging from 8201 through 8286. The first time you enter the URL into your browser you get a login screen back. If you refresh, or enter a URL with a different number appended, you get one of two different responses. The first response has two input boxes, one of which is populated with the number you appended to the URL, the second being empty. The second response is the same as the first with the addition of two tables, with various data fields, underneath the two input boxes.
When you use the code below the html received back is for the login page in every instance. How do I implement RollingCurl so I get on of the other two repsonses back?
<?php
// PROCESS RESPONSE
function request_callback($response, $info) {
echo($response);
echo "<hr>";
}
// REQUIRE ROLLING CURL
require("RollingCurl.php");
// POPULATE LOCO ARRAY
$class = 8201;
$class_size = 5; //RESTRICTED TO 5 FOR TESTING PURPOSES, SHOULD BE 86!
for ($i=0;$i<$class_size;$i++){
$loco[]=$class+$i;
}
// POPULATE URL ARRAY
$url = 'http://logis.korail.go.kr/getcarinfo.do?car_no=';
for ($i=0;$i<sizeof($loco);$i++){
$urls[]=$url.$loco[$i];
}
// FETCH URLS
$rc = new RollingCurl("request_callback");
$rc->window_size = $class_size;
foreach ($urls as $url) {
$request = new Request($url);
$rc->add($request);
}
$rc->execute();
?>
Indeed. It's a beutiful piece of code!
This post, together with your post on http://www.askapache.com/php/curl-multi-downloads… just made me understand how tu use multihandles in useful way!
t's important to do this before removing the old one
I'm not php expert and looking for solution to download multi images in one time..I'm not php expert and looking for solution to download multi images in one time..
this class look like work for me but i dont know how to work it out for saving files. at defined directory:
eg:
;
$imgs = array("http://l.yimg.com/a/i/us/pim/dclient/cg504_5/img/md5/509840ceb0dd52f5f024dba77099b4b0_1.gif",
"http://www.onlineaspect.com/wp-content/themes/onlineaspect/images/lego.png"
save to
$maindir = "images";
$dirs = array ($maindir ."/yahoo", $maindir ."/onlineaspect");
i dont kow how to pass var dir to callback to save each images to directoy
thanks for any example
Ah, great question. The trick here is to make $maindir a global variable inside your callback like this:
function callback($result) {
global $maindir;
…
}
Parse error: syntax error, unexpected ';' in /home/migcybe1/public_html/php/multi_eksekusi_3.php on line 6
when i test this script
<?php
function rolling_curl($urls, $callback, $custom_options = null) {
// make sure the rolling window isn't greater than the # of urls
$rolling_window = 5;
$rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;
$master = curl_multi_init();
$curl_arr = array();
// add additional curl options here
$std_options = array(CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5);
$options = ($custom_options) ? ($std_options + $custom_options) : $std_options;
// start the first batch of requests
for ($i = 0; $i < $rolling_window; $i++) {
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i];
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
}
do {
while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
if($execrun != CURLM_OK)
break;
// a request was just completed — find out which one
while($done = curl_multi_info_read($master)) {
$info = curl_getinfo($done['handle']);
if ($info['http_code'] == 200) {
$output = curl_multi_getcontent($done['handle']);
// request successful. process output using the callback function.
$callback($output);
// start a new request (it's important to do this before removing the old one)
$ch = curl_init();
$options[CURLOPT_URL] = $urls[$i++]; // increment i
curl_setopt_array($ch,$options);
curl_multi_add_handle($master, $ch);
// remove the curl handle that just completed
curl_multi_remove_handle($master, $done['handle']);
} else {
// request failed. add error handling.
}
}
} while ($running);
curl_multi_close($master);
return true;
}
?>
how to solve it??? thanks
Hi Josh,
I'm using your nice piece of code for a lot of data (3.5 million requests).
My rolling window is 10.
But after a while my working machine is out of memory.
I'm trying to find the memory leak and I noticed you don't close the single curl handles.
I think, after
curl_multi_remove_handle($master, $done['handle']);
you have to call
curl_close($done['handle']);
to totally close the handle, because
curl_multi_close($master);
closes the master but not the single handles.
And can you explain me, why you have to "// start a new request" if you just startet all of the requests in the "for" loop?
Ah, good catch. Yes, you probably want to close those.
I haven't tried reusing the connections, or at least I don't remember experimenting with that. Would be interesting to see if you can make that work. If you do, please post your results here so others can gain from them.
Thanks!
Hello! Great article!
I wrote simple class with similar functionality, i use anonymous functions for callbacks: https://github.com/Yaffle/MultiGet
Hey Josh,
Is it possible to pass a value (e.g. $row['id']) into RollingCurl so that it's available for use within the callback function?
foreach ($rows as $row) {
// Add each request to the RollingCurl object.
$request = new RollingCurlRequest($row['url']);
$rc->add($request);
}
(Basically, it's the MySQL primary key for each row ($row['id']) that I'm trying to pass and make available within the callback function.)
Thanks.
Sure, an easy way to do this is to add a GET variable to the end of the URL you are fetching (ie. ?mysql_id=42) and then parse out that ID when the request completes from the CURL info array.
Ok. I got it implemented using a hash tag to pass the monitor id and then getting this value from the $request ($info['url'] doesn't retain the hash tag on the URL for whatever reason). This way, by using a hash tag, I figure there is no possibility that it'll ever change the URL that is checked. Still it'd be cool, if RollingCurl had a way to pass a value without affecting the URL. But this is working for now. Thanks for sharing RC!
For the given simple script bellow, how can I use RollingCurl library to make POST request for each of http://www.site_01.com, http://www.site_02.com and http://www.site_03.com using parsed variables catched with "my_request" function from the GET request ? Thank you.
<?PHP
require("RollingCurl.php");
function my_request($response) {
……………
……………
(code used to parse some variables to use later in POST request)
……………
……………
}
$urls = array("http://www.site_01.com",
;
"http://www.site_02.com",
"http://www.site_03.com"
$rc = new RollingCurl("my_request");
$rc->window_size = 3;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url, "GET");
$rc->add($request);
}
$rc->execute();
?>
Hi, nice job.
Just some changes for me :
// for your pb of number of $urls must be more than number of rolling_window
$rolling_window = min(array(5,count($urls)));
// this ligne, just after the "for" (start the first batch of requests)
// because if you have 5 windows, you get out from this for with $i == 5 (last $i++)
// then, when you get the next url in the do while, you make another $i++ witch do not take the 5th url !!!
$i–;
thats all for me ! thank you again, this makes me save some hours !!!!
Not sure if I did it correctly, but my problem with the code is with the callback function:
for example:
call_user_func($callback,$urls[$z],$output);
When I called the callback function, the $output does not match the url, since I want to display the link with the output to match each other. What I am getting is the $output will either come before or after the next url…
tried to fix with sleep and curl_multi_select(which suppose to wait for activity on the connect), but can’t fix the problem…
If you trying to return the links like I am, don’t do the stupid way that I am doing >.<!
use the url (from getInfo) parameter instead…
$info['url'].
Forgot to mention use: curl_multi_select($master);
to lower your CPU spike when running…
Hey, i am using curl_mult_exec for processing thousands of URLs. Currently it is breaking down at around 15 to 20k.. plz help me on that… plzzzzzz