Python 爬取网站图片，我们打开

http://desk.zol.com.cn/bizhi/6429_79089_2.html

发现该网页有很多图片，并且我们点击下一页后会跳转到下一页，那么我们用 python 要如何爬去改页面的图片资源呢？

python requests 库

requests 库的官网http://docs.python-requests.org/en/master/

安装

pip(3) install requests

使用

使用 requests 库，需要对 http 协议有一定的了解，比如状态码，请求头，响应头，方法、字段、参数等概念有一定的了解。

我们导入库后，发送一个 get 请求

>>> import requests
>>> r = requests.get("http://desk.zol.com.cn/bizhi/6429_79089_2.html")
>>> r.status_code #返回状态码为成功
200

上面的 requests.get(url,params=None,**kwargs) 中：

url：是指获取页面的 url 连接
params：是 url 的额外参数，比如一些请求头，例如有些网站做了防盗链，需要特殊的 referer 或 user-agent 才可以访问，否则拒绝访问，那么我们可以通过定制一些请求头去进行访问。
**kewrgs 12个控制访问的参数

通过 r.headers 可以获取响应头信息

>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Content-Length': '27571', 'Via': 'http/1.1 zats-other1 (zcache-other1 [cRs f ])', 'Age': '3983', 'Expires': 'Mon, 10 Jul 2017 05:59:40 GMT', 'Vary': 'Accept-Encoding', 'Server': 'ngx_openresty', 'Last-Modified': 'Mon, 10 Jul 2017 03:59:40 GMT', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=7200', 'Date': 'Mon, 10 Jul 2017 05:06:03 GMT', 'nnCoection': 'close', 'Content-Type': 'text/html; charset=GBK'}

response 对象的属性

属性	说明
r.status_code	HTTP请求的返回状态，200表示成功，其他都表示有问题
r.text	HTTP 响应内容的字符串形式，url 对应的页面内容
r.encoding	HTTP Header 中猜测的响应头状态码
r.apparent_encoding	从内容中分析出的响应内容编码方式
r.content	响应内容的二进制形式

>>> r.encoding
'GBK'

>>> r.content
'<!DOCTYPE HTML>\r\n<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=gb2312">\r\n<meta name="applicable-device" content="pc">\r\n<title>\xba\xab\xb9\xfa\xd0\xa1\xc7\xe5\xd0\xc2\xc4\xcf\xb9\xe7\xc0\xf6\xbf\xed\xc6\xc1\xb1\xda\xd6\xbd-ZOL\xd7\xc0\xc3\xe6\xb1\xda\xd6\xbd</title>\r\n              <meta name="keywords" content="" />\r\n              <meta name="description" content=""/><meta property="og:type" content="image"/>\n<meta property="og:image" content="http://desk.fd.zol-img.com.cn/t_s120x90c5/g5/M00/0B/05/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"/>\n\r\n<link href="http://s.zol-img.com.cn/d/Desk/Desk_bizhi_detail.css?v=1028" rel="stylesheet" type="text/css" />\r\n\r\n<script src="http://p.zol-img.com.cn/desk/detail.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/public/js/swfobject.js" type="text/javascript"></script>\r\n<script src="http://icon.zol-img.com.cn/getcook.js?1312" type="text/javascript"></script>\r\n<script>\r\n\tdocument.domain = "zol.com.cn";\r\n\tvar userid = get_cookie(\'zol_userid\');\r\n\t\tvar deskPicArr \t\t= {"list":[{"picId":"79089","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH2IVmv2AAYP2zcB7GQAAQr3gJjQtUABg_z016.jpg!awen)"},{"picId":"79087","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHmIaBQ5AAxkE0uQNWQAAQr3QO8WIIADGQr480.jpg!awen)"},{"picId":"79088","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdHuIZqqBAA-e_5SjTQMAAQr3gD3HtsAD58X749.jpg!awen)"},{"picId":"79090","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdH6IJHoZAAxppAZiw2UAAQr3gMT8GkADGm8468.jpg!awen)"},{"picId":"79091","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJlcgdICIFA7wAAwW8vbAcEYAAQr3wBYmyYADBcK439.jpg!awen)"},{"picId":"79092","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdIKIHmjLAAW0KFEpkQUAAQr3wGS5wkABbRA556.jpg!awen)"},{"picId":"79093","oriSize":"2560x1600","resAll":["2560x1600","1920x1080","1680x1050","1600x900","1440x900","1366x768","1280x1024","1280x800","1024x768"],"imgsrc":"http:\\/\\/desk.fd.zol-img.com.cn\\/t_s##SIZE##\\/g5\\/M00\\/0B\\/05\\/ChMkJ1cgdISId2OQAA3bomjVkl0AAQr3wI0BO4ADdu6814.jpg!awen)"}]};\r\n\t/***********\xc8\xab\xbe\xd6\xb1\xe4\xc1\xbf\xb5\xc4\xc9\xf9\xc3\xf7*************/\r\n\tvar $deskGlobalConfig = {\r\n\t\t\tgroupId\t\t\t: 6429,\t\t\t//\xd7\xe9\xcd\xbcID\r\n\t\t\tp

如果乱码，则

 
 >>> r.apparent_encoding
'GB2312'
>>> r.encoding ='utf-8'

如果 header 中不存在 charset，则认为编码为 ISO-8859-1

请求异常处理

属性	说明
requests.ConnectionError	网络连接错误异常，如 DNS 查询失败，拒绝连接
requests.HTTPError	HTTP 错误异常
requests. URLRequired	URL 缺失异常
requests.TooManyRedirects	连接远程服务器超时
requests.Timeout	请求 URL 超时，产生的异常

#!/usr/bin/python
#!-*-conding:utf-8 -*-

import requests


def getHTMLIMG(url):

    try:

        r = requests.get(url,timeout=30)
        r.raise_for_status()#  如果状态不是200 引发 HTTPError 异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "请求异常"


if __name__ == "__main__":
    url = "http://desk.zol.com.cn/bizhi/6429_79089_2.html"
    print(getHTMLIMG(url))

requests 的 http 方法

方法	说明
requests.requets()	构造一个请求，支持以下各自方法的基础
requests.get()	get请求，获取实体内容
requests.head()	获取头信息
requests.post()	提交 POST 请求
requests.put()	提交 PUT 请求
requests.patch()	提交局部的修改请求
requests.delete()	提交删除请求

我们使用爬虫，大部分都是使用get方法比较多。

Beautiful Soup

安装

pip install beautifulsoup4

官网

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

使用

➜  ~ python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 12:39:47)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
u'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

导入bs4


>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>>

显示title

>>> soup.title
<title>This is a python demo page</title>

打印a标签

>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

上面的只能返回第一个标签

##获取名字

    
    >>> soup.a.name
u'a'
>>> soup.a.parent.name
u'p'
>>> soup.a.parent.parent.name
u'body'

获得数据内容

>>> tag = soup.a
>>> tag.attrs
{u'href': u'http://www.icourse163.org/course/BIT-268001', u'class': [u'py1'], u'id': u'link1'}
>>> tag.attrs['class']
[u'py1']
>>> tag.attrs['href']
u'http://www.icourse163.org/course/BIT-268001'

查看标签类型

>>> type(tag.attrs)
<type 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

获得标签中的内容

>>> soup.a.string
u'Basic Python'
>>> soup.p.string
u'The demo python introduces several python courses.'

上面我们看到p标签的内容中其实是包含一个b的

<p class="title">
     <b>
      The demo python introduces several python courses.
     </b>
    </p>

说明该方法是可以跨域多个层的

beautiful soup元素

遍历

遍历分上下平行遍历

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
[u'\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, u'\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, u'\n']
>>> >>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

方法

.contents
.children .desendants 需要配合for语句使用

上行遍历

>>> soup.body.parent
<html><head><title>This is a python demo page</title></head>\n<body>\n<p class="title"><b>The demo python introduces several python courses.</b></p>\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>\n</body></html>

标签书上行遍历

方法

.parent
.parents

平行遍历

获取下一个标签

>>> soup.a.next_sibling
u' and '

>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

获取前一个阶段

>>> soup.a.previous_sibling
u'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

方法

.next_sibling
.previous_sibling
.next_siblings
.previous_siblings

更友好的显示内容

>>> soup.prettify()
u'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>**
 </body>
</html>
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

#bs4 库的基本元素

tag 标签
name 名字
-attributes 标签属性
navigablestring 标签之间的字符串
comment 注释

本文内容来自网易云课堂

文章作者：阿文

文章链接： https://www.awen.me/post/3686208739.html

0 条评论

😀😃😄 😁😅😂 🤣😊😇 🙂🙃😉 😌😍🥰 😘😗😙 😚😋😛 😝😜🤪 🤨🧐🤓 😎🥸🤩 🥳😏😒 😞😔😟 😕🙁☹️ 😣😖😫 😩🥺😢 😭😤😠 😡🤬🤯 😳🥵🥶 😱😨😰 😥😓🤗 🤔🤭🤫 🤥😶😐 😑😬🙄 😯😦😧 😮😲🥱 😴🤤😪 😵🤐🥴 🤢🤮🤧 😷🤒🤕 🤑🤠😈 👿👹👺 🤡💩👻 💀☠️👽 👾🤖🎃 😺😸😹 😻😼😽 🙀😿😾 👍👎👏 🙌👐🤲 🤝🤜🤛 ✌️🤞🤟 🤘👌🤏 👈👉👆 👇☝️✋ 🤚🖐️🖖 👋🤙💪 🦾🖕✍️ 🙏💅🤳 💯💢💥 💫💦💨 🕳️💣💬 👁️‍🗨️🗨️🗯️ 💭💤❤️ 🧡💛💚 💙💜🖤 🤍🤎💔 ❣️💕💞 💓💗💖 💘💝💟 ☮️✝️☪️ 🕉️☸️✡️ 🔯🕎☯️ ☦️🛐⛎ ♈♉♊ ♋♌♍ ♎♏♐ ♑♒♓ 🆔⚛️🉑 ☢️☣️📴 📳🈶🈚 🈸🈺🈷️ ✴️🆚💮 🉐㊙️㊗️ 🈴🈵🈹 🈲🅰️🅱️ 🆎🆑🅾️ 🆘❌⭕ 🛑⛔📛 🚫💯💢 ♨️🚷🚯 🚳🚱🔞 📵🚭❗ ❕❓❔ ‼️⁉️🔅 🔆〽️⚠️ 🚸🔱⚜️ 🔰♻️✅ 🈯💹❇️ ✳️❎🌐 💠Ⓜ️🌀 💤🏧🚾 ♿🅿️🈳 🈂🛂🛃 🛄🛅🛗 🚀🛸🚁 🚉🚆🚅 ✈️🛫🛬 🛩️💺🛰️

加载中...

早

冥

读

写

跑

python爬虫

python requests 库

安装

使用

response 对象的属性

请求异常处理

requests 的 http 方法

Beautiful Soup

安装

官网

使用

导入bs4

显示title

打印a标签

获得数据内容

查看标签类型

获得标签中的内容

beautiful soup元素

遍历

更友好的显示内容

评论

深夜提醒

新年快乐

早

冥

读

写

跑

python requests 库

安装

使用

response 对象的属性

请求异常处理

requests 的 http 方法

Beautiful Soup

安装

官网

使用

导入bs4

显示title

打印a标签

获得数据内容

查看标签类型

获得标签中的内容

beautiful soup元素

遍历

更友好的显示内容

评论

留言反馈