背景
小哥近来在通过动态代理池爬取一些公司需要的大文件pdf规格书的处理。遇到的难点,如何保证服务器CPU、连接数等正常情况下,多进程、异步快速处理这些业务并且保证准确。下面小哥就给看官唠嗑一下,我使用guzzlehttps如何处理的这一业务需求的。
梳理逻辑
-
多进程处理
保证并发处理,提高处理效率
-
异步处理
有些数据可能响应很快,有些很慢,不能因为一个进程阻塞其它业务正常执行影响爬取效率。
详细代码
/*** 使用guzzleHttp多进程异步远程下载文件* @param array $urlMap 多个远程爬取链接* @param string $localPath 本地保存路径*/public function downloadByGuzzlePoolAsync(array $urlMap,$localPath)
{//代理$proxy = 'http://http-dynamic-S04.xzzdaili.com:10030';$proxyUser = '1169461750313049664';$proxyPassword = 'lG9sMtTp';$proxyAuth = base64_encode($proxyUser . ":" . $proxyPassword);
$header = ['User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36','Referer' => 'https://exmaple.com','Proxy-Authorization' => "Basic " . $proxyAuth,'Content-Type' => 'application/pdf','content-encoding' => 'gzip, deflate, br, zstd','Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',];
$client = new Client();
$requests = function ($urlMap) use ($client,$localPath,$proxy,$header) {foreach ($urlMap as $url){yield $client->getAsync($url,['headers' => $header,'proxy' => $proxy,'verify' => false,'stream' => true,'sink' => $localPath])->then(function($resp) {echo "远程规格书获取成功,resp=" . jsonD($resp->getBody()) . PHP_EOL;},function($reason){echo '远程规格书获取失败 :'.$reason. PHP_EOL;});}};
$pool = new \GuzzleHttp\Pool($client, $requests($urlMap), ['concurrency' => 5,//进程数'options' => ['timeout' => 10, // 设置超时s],'fulfilled' => function (Response $response, $index) use($localPath){//TODO 处理接口成功结果逻辑
// 创建请求$zacStream = fopen($localPath, 'wb');//流速写入文件:while (!$response->getBody()->eof()) {fwrite($zacStream, $response->getBody()->read(1024 * 1024)); // 读取1MB的数据}fclose($zacStream);echo 'GuzzleHttp进程池响应成功,index=' . $index . ' response=' . $response->getReasonPhrase() . PHP_EOL;
},'rejected' => function (RequestException $reason, $index) {//TODO 处理接口失败结果逻辑
echo 'index=' . $index . ' ,error=' .$reason->getMessage() . PHP_EOL;},]);
$promise = $pool->promise();
// 捕获请求异常$promise->then(function () {echo "所有请求都已成功完成" .PHP_EOL;},function (RequestException $e) {echo "发生了异常: " . $e->getMessage() . PHP_EOL;});
// 等待所有请求完成$promise->wait();
// // 访问每个请求的响应
// foreach ($pool->getRequests() as $request) {
// echo $request->getUri() . "\n";
// }
}
}
以上是小哥本人文章的全部内容,希望总结会帮助到各位看官。最后,小哥温馨提示:每天阅读3分钟,天天学习一点点,天天进步一点点。