Goutte

Goutte

Github: https://github.com/FriendsOfPHP/Goutte

Goutte 是一款轻量级简单易用的PHP爬虫类库,提供了优雅的API进行链接抓取和解析 HTML 文档。

安装

composer.json 文件中添加 fabpot/goutte 依赖:

composer require fabpot/goutte

使用

创建一个Goutte客户端实例 (which extends Symfony\Component\BrowserKit\Client):

use Goutte\Client;
$client = new Client();

URL请求使用 request() 方法:

// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');

该方法会返回一个 Crawler 对象 (Symfony\Component\DomCrawler\Crawler).

如果要创建自己的Guzzle配置,你需要创建一个Guzzle 6实例. 例如添加一个60秒请求超时:

use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;

$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
    'timeout' => 60,
));
$goutteClient->setClient($guzzleClient);

点击链接:

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

提取数据:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

表单提交:

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

用Goutte爬虫整合进php项目(ProcessWire)的思路

记录一下用Goutte整合到ProcessWire项目的过程

php爬虫框架Goutte

Goutte提供了很友好的API用来抓取网页并提取数据,和php项目直接对接,非常简单和强大。

Post Comment