史上最全的网页爬虫技术框架合集

2017-10-17爬虫技术PySpider 1439

史上最全的网页爬虫技术框架合集

Python

Scrapy - Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。
- django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
- Scrapy-Redis - Redis-based components for Scrapy.
- scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
- distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
pyspider - 一个国人编写的强大的网络爬虫系统并带有强大的WebUI。采用Python语言编写，分布式架构，支持多种数据库后端，强大的WebUI支持脚本编辑器，任务监视器，项目管理器以及结果查看器。
cola - 一个分布式的爬虫框架，对于用户来说，只需编写几个特定的函数，而无需关注分布式运行的细节。任务会自动分配到多台机器上，整个过程对用户是透明的。
Demiurge - 基于PyQuery的爬虫微框架。
Scrapely - 个纯Python写的HTML抓屏库。
feedparser - 提取rss内容的解析器
you-get - You-Get是一个基于 Python 3 的下载工具。使用 You-Get 可以很轻松的下载到网络上的视频、图片及音乐。
Grab - Python爬虫框架，类似Scrapy，支持py2，py3。
MechanicalSoup - 一个与网站自动交互Python库。
portia - 基于Scrapy的可视化爬虫。
crawley - 非阻塞型python爬虫框架。
RoboBrowser - 简单来说robobrowser是一个浏览器，没有界面的浏览器。用纯python实现，运行在内存里。robobrowser可以打开网页，点击链接和按钮并且提交表单。功能确实不多，但是如果是做爬虫和简单的web测试的话，这些功能实际上是够用了的。
MSpider - 一个简单易上手的爬虫框架，使用gevent和js渲染。
brownant - 轻量级的Web数据提取框架。
PSpider - 一款Python3的轻量化爬虫框架。
Gain - 基于asyncio的新型爬虫框架。
sukhoi - 极简又强大的Web爬虫。

Java

Apache Nutch - Nutch是一个非常成熟的可扩展可伸缩的产品化网络爬虫。以Apache Hadoop数据结构为依托,提供了良好的批处理支持。 Nutch不仅具备了插件式和模块化优点,还提供了可扩展的功能接口。
- anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
Crawler4j - Simple and lightweight web crawler.
JSoup - Scrapes, parses, manipulates and cleans HTML.
websphinx - Website-Specific Processors for HTML information extraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Gecco - A easy to use lightweight web crawler
WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
Webmagic - A scalable crawler framework.
Spiderman - A scalable ,extensible, multi-threaded web crawler.
- Spiderman2 - A distributed web crawler framework,support js render.
Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
SeimiCrawler - An agile, distributed crawler framework.
StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
Spark-Crawler - Evolving Apache Nutch to run on Spark.
webBee - A DFS web spider.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
SimpleCrawler - Simple spider base on mutithreading, regluar expression.
DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
Abot - C# web crawler built for speed and flexibility.
Hawk - Advanced Crawler and ETL tool written in C#/WPF.
SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

JavaScript

scraperjs - A complete and versatile web scraper.
scrape-it - A Node.js scraper for humans.
simplecrawler - Event driven web crawler.
node-crawler - Node-crawler has clean,simple api.
js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
x-ray - Web scraper with pagination and crawler support.
node-osmosis - HTML/XML parser and web scraper for Node.js.
web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

PHP

Goutte - A screen scraping and web crawling library for PHP.
- laravel-goutte - Laravel 5 Facade for Goutte.
dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
pspider - Parallel web crawler written in PHP.
php-spider - A configurable and extensible PHP web spider.

C++

open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.

C

httrack - Copy websites to your computer.

Ruby

upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
Spidr - Spider a site ,multiple domains, certain links or infinitely.
Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
mechanize - Automated web interaction & crawling.

R

rvest - Simple web scraping for R.

Erlang

ebot - A scalable, distribuited and highly configurable web cawler.

Perl

web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

pholcus - A distributed, high concurrency and powerful web crawler.
gocrawl - Polite, slim and concurrent web crawler.
fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
go_spider - An awesome Go concurrent Crawler(spider) framework.
dht - BitTorrent DHT Protocol && DHT Spider.
ants-go - A open source, distributed, restful crawler engine in golang.
scrape - A simple, higher level interface for Go web scraping.
creeper - The Next Generation Crawler Framework (Go).
colly - Fast and Elegant Scraping Framework for Gophers.

Scala

crawler - Scala DSL for web crawling.
scrala - Scala crawler(spider) framework, inspired by scrapy.
ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

Github地址：https://github.com/BruceDone/awesome-crawler

恒馨博客

史上最全的网页爬虫技术框架合集

Python

Java

C#

JavaScript

PHP

C++

C

Ruby

R

Erlang

Perl

Go

Scala

Post Comment