如何抓取lazyload图片?原始图片的抓取方法

使用jQuery lazyload插件可以让图片延迟加载,加速网页快速访问,在dom中img的src并非原始图片所以需要独立特别处理一下。

需要一个很好用的php dom操作类:simple_html_dom

html实例代码

<div class="bike-detail-highlights lead">
					<div class="content-box">
						<div class="headline-box">

							<h3>
								Highlights
							</h3>
													</div>
					</div>

					<div class="teaser-section lead">
						<div class="canyon-carousel" data-slides-to-show="2|2|1" data-dots="0" data-arrows="1">
															<div class="slide teaser-box-wide image-box">
									<figure><span class="lazyload">
			<img data-srcset="https://static.canyon.com/img/cache/6d/f/985c7f395d74d845957046eee187d.jpg 1199w, https://static.canyon.com/img/cache/b6/8/29dac208809289cf2706bfd03bedc.jpg 767w, https://static.canyon.com/img/cache/2d/8/82976eba6dabc8bde40ec7476a4c0.jpg 599w, https://static.canyon.com/img/cache/91/c/2420482aeb1c9f12a82b16e4c7951.jpg 480w, https://static.canyon.com/img/cache/4b/f/a30d49a2891975c1e6646acdd4aa4.jpg 383w, https://static.canyon.com/img/cache/bd/7/02d38c5b77f1279d3f2df74ec208f.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Reynolds Strike Disc Carbon clincher " class="img-responsive" width="1199" height="799"></span>
										<figcaption><h4>
												Reynolds Strike Disc Carbon clincher 
											</h4>

											<p>
												
											</p>
										</figcaption></figure>
</div>
															<div class="slide teaser-box-wide image-box">
									<figure><span class="lazyload">
			<img data-srcset="https://static.canyon.com/img/cache/e9/3/f613267d16185572c41f2ca24fef6.jpg 1199w, https://static.canyon.com/img/cache/1a/4/5279931bab6c15d131aa70b7a46f1.jpg 767w, https://static.canyon.com/img/cache/ef/f/eb24ea886b76d12de3835290bb62e.jpg 599w, https://static.canyon.com/img/cache/69/8/1857c223f7f3c70677e080e2cf929.jpg 480w, https://static.canyon.com/img/cache/e5/e/79115b98d5927ec88507d775628a1.jpg 383w, https://static.canyon.com/img/cache/c2/e/da604b931c2beb157ea08e9e575c3.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Shimano Dura-Ace, 11s" class="img-responsive" width="1199" height="799"></span>
										<figcaption><h4>
												Shimano Dura-Ace, 11s
											</h4>

											<p>
												
											</p>
										</figcaption></figure>
</div>
															<div class="slide teaser-box-wide image-box">
									<figure><span class="lazyload">
			<img data-srcset="https://static.canyon.com/img/cache/0e/e/5e812d3079bc83f2d2b99ad4c45b5.jpg 1199w, https://static.canyon.com/img/cache/cf/6/beb4b3be11a5d63bd08a637b2e021.jpg 767w, https://static.canyon.com/img/cache/7e/0/0f07004c0873a629ced7e52f21a9c.jpg 599w, https://static.canyon.com/img/cache/74/1/29ded42ffa30dc67e40ed4354ce76.jpg 480w, https://static.canyon.com/img/cache/35/9/9134bb2d291eafd357b97580b89c1.jpg 383w, https://static.canyon.com/img/cache/a6/2/56ef5a5f90178396aae1aeebdfc59.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Fizik Arione R5 + Canyon S27 Aero VCLS" class="img-responsive" width="1199" height="799"></span>
										<figcaption><h4>
												Fizik Arione R5 + Canyon S27 Aero VCLS
											</h4>

											<p>
												
											</p>
										</figcaption></figure>
</div>
													</div>
					</div>

				</div>

上面这段代码用了响应式加载,每个img标签的src并非真正图片地址,原始图片位于data-srcset中,并且对不同的分辨率做了不同的图片显示的处理,当然,我们肯定要提取尺寸最大的图片,例如:https://static.canyon.com/img/cache/0e/e/5e812d3079bc83f2d2b99ad4c45b5.jpg 1199w

处理方法:

$doc = new simple_html_dom();
$doc->load($html);
foreach ($doc->find("img") as $key=>$img) {
	$srcs = $img->attr["data-srcset"];
	if($srcs)
	{
		preg_match('#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#si', $srcs, $result);
		$src = $result[0];
		$doc->find("img", $key)->src = $src;
	}
}

这样处理完之后dom对象就有了正确的src,所以我们接下来就很好提取了。

Post Comment