如何抓取lazyload图片?原始图片的抓取方法
使用jQuery lazyload插件可以让图片延迟加载,加速网页快速访问,在dom中img的src并非原始图片所以需要独立特别处理一下。
需要一个很好用的php dom操作类:simple_html_dom
html实例代码
<div class="bike-detail-highlights lead"> <div class="content-box"> <div class="headline-box"> <h3> Highlights </h3> </div> </div> <div class="teaser-section lead"> <div class="canyon-carousel" data-slides-to-show="2|2|1" data-dots="0" data-arrows="1"> <div class="slide teaser-box-wide image-box"> <figure><span class="lazyload"> <img data-srcset="https://static.canyon.com/img/cache/6d/f/985c7f395d74d845957046eee187d.jpg 1199w, https://static.canyon.com/img/cache/b6/8/29dac208809289cf2706bfd03bedc.jpg 767w, https://static.canyon.com/img/cache/2d/8/82976eba6dabc8bde40ec7476a4c0.jpg 599w, https://static.canyon.com/img/cache/91/c/2420482aeb1c9f12a82b16e4c7951.jpg 480w, https://static.canyon.com/img/cache/4b/f/a30d49a2891975c1e6646acdd4aa4.jpg 383w, https://static.canyon.com/img/cache/bd/7/02d38c5b77f1279d3f2df74ec208f.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Reynolds Strike Disc Carbon clincher " class="img-responsive" width="1199" height="799"></span> <figcaption><h4> Reynolds Strike Disc Carbon clincher </h4> <p> </p> </figcaption></figure> </div> <div class="slide teaser-box-wide image-box"> <figure><span class="lazyload"> <img data-srcset="https://static.canyon.com/img/cache/e9/3/f613267d16185572c41f2ca24fef6.jpg 1199w, https://static.canyon.com/img/cache/1a/4/5279931bab6c15d131aa70b7a46f1.jpg 767w, https://static.canyon.com/img/cache/ef/f/eb24ea886b76d12de3835290bb62e.jpg 599w, https://static.canyon.com/img/cache/69/8/1857c223f7f3c70677e080e2cf929.jpg 480w, https://static.canyon.com/img/cache/e5/e/79115b98d5927ec88507d775628a1.jpg 383w, https://static.canyon.com/img/cache/c2/e/da604b931c2beb157ea08e9e575c3.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Shimano Dura-Ace, 11s" class="img-responsive" width="1199" height="799"></span> <figcaption><h4> Shimano Dura-Ace, 11s </h4> <p> </p> </figcaption></figure> </div> <div class="slide teaser-box-wide image-box"> <figure><span class="lazyload"> <img data-srcset="https://static.canyon.com/img/cache/0e/e/5e812d3079bc83f2d2b99ad4c45b5.jpg 1199w, https://static.canyon.com/img/cache/cf/6/beb4b3be11a5d63bd08a637b2e021.jpg 767w, https://static.canyon.com/img/cache/7e/0/0f07004c0873a629ced7e52f21a9c.jpg 599w, https://static.canyon.com/img/cache/74/1/29ded42ffa30dc67e40ed4354ce76.jpg 480w, https://static.canyon.com/img/cache/35/9/9134bb2d291eafd357b97580b89c1.jpg 383w, https://static.canyon.com/img/cache/a6/2/56ef5a5f90178396aae1aeebdfc59.jpg 240w" data-sizes="(min-width: 1202px) 599px, (min-width: 768px) 383px, 100vw" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" alt="Fizik Arione R5 + Canyon S27 Aero VCLS" class="img-responsive" width="1199" height="799"></span> <figcaption><h4> Fizik Arione R5 + Canyon S27 Aero VCLS </h4> <p> </p> </figcaption></figure> </div> </div> </div> </div>
上面这段代码用了响应式加载,每个img
标签的src
并非真正图片地址,原始图片位于data-srcset
中,并且对不同的分辨率做了不同的图片显示的处理,当然,我们肯定要提取尺寸最大的图片,例如:https://static.canyon.com/img/cache/0e/e/5e812d3079bc83f2d2b99ad4c45b5.jpg 1199w
处理方法:
$doc = new simple_html_dom(); $doc->load($html); foreach ($doc->find("img") as $key=>$img) { $srcs = $img->attr["data-srcset"]; if($srcs) { preg_match('#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#si', $srcs, $result); $src = $result[0]; $doc->find("img", $key)->src = $src; } }
这样处理完之后dom对象就有了正确的src,所以我们接下来就很好提取了。