Goutte
Github: https://github.com/FriendsOfPHP/Goutte
Goutte 是一款轻量级简单易用的PHP爬虫类库,提供了优雅的API进行链接抓取和解析 HTML 文档。
安装
在 composer.json
文件中添加 fabpot/goutte
依赖:
composer require fabpot/goutte
使用
创建一个Goutte客户端实例 (which extends Symfony\Component\BrowserKit\Client
):
use Goutte\Client; $client = new Client();
URL请求使用 request()
方法:
// Go to the symfony.com website $crawler = $client->request('GET', 'https://www.symfony.com/blog/');
该方法会返回一个 Crawler
对象 (Symfony\Component\DomCrawler\Crawler
).
如果要创建自己的Guzzle配置,你需要创建一个Guzzle 6实例. 例如添加一个60秒请求超时:
use Goutte\Client; use GuzzleHttp\Client as GuzzleClient; $goutteClient = new Client(); $guzzleClient = new GuzzleClient(array( 'timeout' => 60, )); $goutteClient->setClient($guzzleClient);
点击链接:
// Click on the "Security Advisories" link $link = $crawler->selectLink('Security Advisories')->link(); $crawler = $client->click($link);
提取数据:
// Get the latest post in this category and display the titles $crawler->filter('h2 > a')->each(function ($node) { print $node->text()."\n"; });
表单提交:
$crawler = $client->request('GET', 'https://github.com/'); $crawler = $client->click($crawler->selectLink('Sign in')->link()); $form = $crawler->selectButton('Sign in')->form(); $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx')); $crawler->filter('.flash-error')->each(function ($node) { print $node->text()."\n"; });