Python 爬虫入门必修课：BeautifulSoup4 实战解析

在 Python 爬虫的世界里，如果说 requests 是负责“把网页拿回来”的搬运工，那么 BeautifulSoup 就是负责“从网页里找东西”的寻宝专家。

BeautifulSoup（通常指 bs4）是一个用于从 HTML 或 XML 文件中提取数据的 Python 库。它以简单 Pythonic 的方式导航、搜索和修改解析树，是新手入门爬虫的首选解析库。

安装

官方建议配合 lxml 解析器使用，速度更快：

pip install beautifulsoup4 lxml

核心用法

假设我们需要解析一段简单的 HTML：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""

1. 创建 Soup 对象

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.string) 
# 输出: The Dormouse's story

2. 搜索节点

BeautifulSoup 提供了极其好用的 find 和 find_all 方法。

查找单个元素：

# 查找第一个 <a> 标签
link = soup.find('a')
print(link['href']) 
# 输出: http://example.com/elsie

查找所有元素：

# 查找所有 class 为 "sister" 的 <a> 标签
for link in soup.find_all('a', class_='sister'):
    print(link.get_text())
# 输出: Elsie, Lacie, Tillie

3. CSS 选择器

如果你习惯写 CSS，BeautifulSoup 也支持 select 方法：

# 选择 id="link3" 的元素
print(soup.select_one('#link3').text)

# 选择 .story 下的所有 a 标签
links = soup.select('.story a')

总结

虽然现在有 XPath 和 PyQuery 等解析库，但在处理结构不规范的 HTML（Tag Soup）时，BeautifulSoup 的容错能力是极强的。对于大多数中小型的爬虫项目，Requests + BeautifulSoup 依然是开发效率最高的黄金搭档。

文章作者：阿文

文章链接： https://www.awen.me/post/930a1f5e.html

0 条评论

😀😃😄 😁😅😂 🤣😊😇 🙂🙃😉 😌😍🥰 😘😗😙 😚😋😛 😝😜🤪 🤨🧐🤓 😎🥸🤩 🥳😏😒 😞😔😟 😕🙁☹️ 😣😖😫 😩🥺😢 😭😤😠 😡🤬🤯 😳🥵🥶 😱😨😰 😥😓🤗 🤔🤭🤫 🤥😶😐 😑😬🙄 😯😦😧 😮😲🥱 😴🤤😪 😵🤐🥴 🤢🤮🤧 😷🤒🤕 🤑🤠😈 👿👹👺 🤡💩👻 💀☠️👽 👾🤖🎃 😺😸😹 😻😼😽 🙀😿😾 👍👎👏 🙌👐🤲 🤝🤜🤛 ✌️🤞🤟 🤘👌🤏 👈👉👆 👇☝️✋ 🤚🖐️🖖 👋🤙💪 🦾🖕✍️ 🙏💅🤳 💯💢💥 💫💦💨 🕳️💣💬 👁️‍🗨️🗨️🗯️ 💭💤❤️ 🧡💛💚 💙💜🖤 🤍🤎💔 ❣️💕💞 💓💗💖 💘💝💟 ☮️✝️☪️ 🕉️☸️✡️ 🔯🕎☯️ ☦️🛐⛎ ♈♉♊ ♋♌♍ ♎♏♐ ♑♒♓ 🆔⚛️🉑 ☢️☣️📴 📳🈶🈚 🈸🈺🈷️ ✴️🆚💮 🉐㊙️㊗️ 🈴🈵🈹 🈲🅰️🅱️ 🆎🆑🅾️ 🆘❌⭕ 🛑⛔📛 🚫💯💢 ♨️🚷🚯 🚳🚱🔞 📵🚭❗ ❕❓❔ ‼️⁉️🔅 🔆〽️⚠️ 🚸🔱⚜️ 🔰♻️✅ 🈯💹❇️ ✳️❎🌐 💠Ⓜ️🌀 💤🏧🚾 ♿🅿️🈳 🈂🛂🛃 🛄🛅🛗 🚀🛸🚁 🚉🚆🚅 ✈️🛫🛬 🛩️💺🛰️

您的评论由 AI 智能审核，一般1分钟内会展示，若不展示请确认你的评论是否符合社区和法律规范

加载中...

深夜提醒

新年快乐

安装

核心用法

1. 创建 Soup 对象

2. 搜索节点

3. CSS 选择器

总结

评论

选择联系方式

留言反馈