萬盛學電腦網

浏覽器 windows 10 wps office 殺毒軟件 數據庫 excel教程 文件管理 word教程 網頁制作 裝機必備軟件 linux教程

萬盛學電腦網 >> 網絡編程 >> php編程 >> PHP單線程實現並行抓取網頁

PHP單線程實現並行抓取網頁

　　本PHP教程將模擬並行抓取多個頁面信息的過程，關鍵在於單線程的並行處理。

　　一般情況下，大家寫抓取多個頁面信息的程序都采用串行方案，但獲取周期過長，不實用。於是我想到用curl 去並行抓取。但是，最後發現，那個虛擬服務器上沒有curl，這真是讓人糾結。於是，我決定改變思路，用單個線程也實現多個線程的效果。我想對網絡編程有點

　　了解的人肯定知道IO復用這個概念，當然PHP上也是支持的，而且，內部支持，不需要任何擴展。

　　可能有很多年編程經驗的人對PHP的stream 函數可能不太了解。PHP的壓縮文件流，文件流，tcp 協議下的應用都封裝成一個stream。所以，讀本地文件

　　和讀網絡文件沒有任何的差別。說了這樣多，我想大家都基本上明白了，直接貼上代碼吧：

　　代碼比較的粗糙，如果大家要實際用的話，還是要處理一些細節問題。

　　代碼

　　function http_get_open($url)

　　{

　　$url = parse_url($url);

　　if (empty($url['host'])) {

　　return false;

　　}

　　$host = $url['host'];

　　if (empty($url['path'])) {

　　$url['path'] = "/";

　　}

　　$get = $url['path'] . "?" . @$url['query'];

　　$fp = stream_socket_client("tcp://{$host}:80", $errno, $errstr, 30);

　　if (!$fp) {

　　echo "$errstr ($errno)
n";

　　return false;

　　} else {

　　fwrite($fp, "GET {$get} HTTP/1.0rnHost: {$host}rnAccept: */*rnrn");

　　}

　　return $fp;

　　}

　　function http_multi_get($urls)

　　{

　　$result = array();

　　$fps = array();

　　foreach ($urls as $key => $url)

　　{

　　$fp = http_get_open($url);

　　if ($fp === false) {

　　$result[$key] = false;

　　} else {

　　$result[$key] = '';

　　$fps[$key] = $fp;

　　}

　　while (1)

　　{

　　$reads = $fps;

　　if (empty($reads)) {

　　break;

　　}

　　if (($num = stream_select($reads, $w = null, $e = null, 30)) === false ) {

　　echo "error";

　　return false;

　　} else if ($num > 0) {//can read

　　foreach ($reads as $value)

　　{

　　$key = array_search($value, $fps);

　　if (!feof($value)) {

　　$result[$key] .= fread($value, 128);

　　} else {

　　unset($fps[$key]);

　　}

　　} else {//time out

　　echo "timeout";

　　return false;

　　}

　　foreach ($result as $key => &$value)

　　{

　　if ($value) {

　　$value = explode("rnrn", $value, 2);

　　}

　　return $result;

　　}

　　$urls = array();

　　$urls[] = "http://www.qq.com";

　　$urls[] = "http://www.sina.com.cn";

　　$urls[] = "http://www.sohu.com";

　　$urls[] = "http://www.blue1000.com";

　　//並行的抓取

　　$t1 = microtime(true);

　　$result = http_multi_get($urls);

　　$t1 = microtime(true) - $t1;

　　var_dump("cost: " . $t1);

　　//串行的抓取

　　$t1 = microtime(true);

　　foreach ($urls as $value)

　　{

　　file_get_contents($value);

　　}

　　$t1 = microtime(true) - $t1;

　　var_dump("cost: " . $t1);

　　最後運行的結果：

　　string 'cost: 3.2403128147125' (length=21)

　　string 'cost: 6.2333900928497' (length=21)

　　基本上是兩倍的效率，當然，發現新浪非常的慢，要2.5s 左右，

　　基本上是被他給拖累了，360只要 0.2s

　　如果，所有網站都差不多的速度，並行的數目更大，那麼差的倍數也就越大。

萬盛學電腦網

萬盛學電腦網 >> 網絡編程 >> php編程 >> PHP單線程實現並行抓取網頁

PHP單線程實現並行抓取網頁

php編程排行

程序編程推薦

熱門文章

相關文章

圖片文章

AAuto編程語言介紹

網站界面設計中要如何為網站創建風格指南

CentOS6怎麼樣設置ADSL上網

十步輕松搞定IIS+PHP環境搭建

萬盛學電腦網 | 設為首頁 | 加入收藏