BeautifulSoup(通常指 bs4)是一个用于从 HTML 或 XML 文件中提取数据的 Python 库。它以简单 Pythonic 的方式导航、搜索和修改解析树,是新手入门爬虫的首选解析库。
安装
官方建议配合 lxml 解析器使用,速度更快:
pip install beautifulsoup4 lxml
核心用法
假设我们需要解析一段简单的 HTML:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> """
1. 创建 Soup 对象
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml') print(soup.title.string) # 输出: The Dormouse's story