Python爬虫第3课：BeautifulSoup解析HTML与数据提取

🌸 欢迎来到Python办公自动化专栏—Python处理办公问题，解放您的双

💻 个人主页——>个人主页欢迎访问

😸 Github主页——>Github主页欢迎访问

❓ 知乎主页——>知乎主页欢迎访问

🏳️‍🌈 CSDN博客主页：请点击——> 一晌小贪欢的博客主页求关注

👍 该系列文章专栏：请点击——>Python办公自动化专栏求订阅

🕷 此外还有爬虫专栏：请点击——>Python爬虫基础专栏求订阅

📕 此外还有python基础专栏：请点击——>Python基础学习专栏求订阅

文章作者技术和水平有限，如果文中出现错误，希望大家能指正🙏

❤️ 欢迎各位佬关注！ ❤️

课程目标

掌握BeautifulSoup库的基本使用方法
学会使用各种选择器定位HTML元素
理解HTML文档的树形结构
掌握从复杂HTML中提取数据的技巧

1. BeautifulSoup简介

BeautifulSoup是一个用于解析HTML和XML文档的Python库，它能够创建一个解析树，用于从HTML文档中提取数据。

1.1 安装BeautifulSoup

pip install beautifulsoup4
pip install lxml  # 推荐的解析器

1.2 基本使用

from bs4 import BeautifulSoup
import requests

# 获取网页内容
response = requests.get('https://example.com')
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, 'html.parser')
# 或者使用lxml解析器（更快）
soup = BeautifulSoup(html_content, 'lxml')

2. HTML文档结构理解

2.1 HTML基本结构

<!DOCTYPE html>
<html>
<head>
    <title>页面标题</title>
    <meta charset="UTF-8">
</head>
<body>
    <div class="container">
        <h1 id="main-title">主标题</h1>
        <p class="content">段落内容</p>
        <ul>
            <li>列表项1</li>
            <li>列表项2</li>
        </ul>
    </div>
</body>
</html>

2.2 DOM树概念

HTML文档可以看作一个树形结构，每个HTML标签都是树的一个节点。

3. 基本查找方法

3.1 按标签名查找

from bs4 import BeautifulSoup

html = """
<html>
<body>
    <h1>标题1</h1>
    <h1>标题2</h1>
    <p>段落1</p>
    <p>段落2</p>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# 查找第一个h1标签
first_h1 = soup.find('h1')
print(first_h1.text)  # 输出：标题1

# 查找所有h1标签
all_h1 = soup.find_all('h1')
for h1 in all_h1:
    print(h1.text)

# 查找所有p标签
all_p = soup.find_all('p')
print(len(all_p))  # 输出：2

3.2 按属性查找

html = """
<div class="container">
    <p class="intro">介绍段落</p>
    <p class="content">内容段落</p>
    <p id="special">特殊段落</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 按class查找
intro = soup.find('p', class_='intro')
print(intro.text)

# 按id查找
special = soup.find('p', id='special')
print(special.text)

# 按多个属性查找
content = soup.find('p', {'class': 'content'})
print(content.text)

3.3 按文本内容查找

# 查找包含特定文本的标签
link = soup.find('a', string='首页')

# 使用正则表达式查找文本
import re
pattern = re.compile(r'联系.*')
contact_link = soup.find('a', string=pattern)

4. CSS选择器

4.1 基本CSS选择器

html = """
<div class="container">
    <h1 id="title">主标题</h1>
    <div class="content">
        <p class="text">段落1</p>
        <p class="text highlight">段落2</p>
        <ul>
            <li>项目1</li>
            <li class="special">项目2</li>
        </ul>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 标签选择器
titles = soup.select('h1')

# ID选择器
title = soup.select('#title')[0]

# 类选择器
texts = soup.select('.text')

# 属性选择器
special_items = soup.select('[class="special"]')

# 后代选择器
content_paragraphs = soup.select('.content p')

# 子元素选择器
direct_children = soup.select('.container > .content')

# 多类选择器
highlighted = soup.select('.text.highlight')

4.2 高级CSS选择器

# 伪类选择器
first_li = soup.select('li:first-child')
last_li = soup.select('li:last-child')
nth_li = soup.select('li:nth-child(2)')

# 属性包含选择器
partial_class = soup.select('[class*="tex"]')

# 属性开始选择器
starts_with = soup.select('[class^="con"]')

# 属性结束选择器
ends_with = soup.select('[class$="ent"]')

5. 导航文档树

5.1 父子关系导航

html = """
<div class="parent">
    <p>第一个段落</p>
    <p>第二个段落</p>
    <div class="child">
        <span>子元素</span>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 获取父元素
span = soup.find('span')
parent_div = span.parent
print(parent_div['class'])  # ['child']

# 获取所有父元素
parents = span.parents
for parent in parents:
    if parent.name:
        print(parent.name)

# 获取子元素
parent_div = soup.find('div', class_='parent')
children = parent_div.children  # 生成器对象
for child in children:
    if child.name:  # 过滤文本节点
        print(child.name)

# 获取所有后代元素
descendants = parent_div.descendants

5.2 兄弟关系导航

# 获取下一个兄弟元素
first_p = soup.find('p')
next_sibling = first_p.next_sibling
print(next_sibling)

# 获取上一个兄弟元素
second_p = first_p.next_sibling.next_sibling
prev_sibling = second_p.previous_sibling

# 获取所有后续兄弟元素
next_siblings = first_p.next_siblings
for sibling in next_siblings:
    if sibling.name:
        print(sibling.name)

6. 提取数据

6.1 获取文本内容

html = """
<div class="article">
    <h1>文章标题</h1>
    <p>这是第一段内容。</p>
    <p>这是第二段内容。</p>
    <div class="meta">
        <span>作者：张三</span>
        <span>时间：2024-01-01</span>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 获取单个元素的文本
title = soup.find('h1').text
print(title)  # 文章标题

# 获取元素及其子元素的所有文本
article = soup.find('div', class_='article')
all_text = article.get_text()
print(all_text)

# 获取文本时指定分隔符
clean_text = article.get_text(separator=' | ', strip=True)
print(clean_text)

6.2 获取属性值

html = """
<div class="container">
    <a href="https://example.com" title="示例网站" target="_blank">链接</a>
    <img src="image.jpg" alt="图片描述" width="300" height="200">
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# 获取单个属性
link = soup.find('a')
href = link.get('href')  # 或者 link['href']
title = link.get('title')

# 获取所有属性
img = soup.find('img')
attrs = img.attrs
print(attrs)  # {'src': 'image.jpg', 'alt': '图片描述', 'width': '300', 'height': '200'}

# 检查属性是否存在
if link.has_attr('target'):
    print("链接有target属性")

7. 实战案例：爬取新闻列表

import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime

class NewsSpider:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
  
    def get_news_list(self, url):
        """获取新闻列表"""
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
        
            soup = BeautifulSoup(response.text, 'html.parser')
            return self.parse_news_list(soup)
        
        except Exception as e:
            print(f"获取新闻列表失败：{e}")
            return []
  
    def parse_news_list(self, soup):
        """解析新闻列表"""
        news_list = []
    
        # 假设新闻列表在class为'news-list'的div中
        news_container = soup.find('div', class_='news-list')
        if not news_container:
            return news_list
    
        # 查找所有新闻项
        news_items = news_container.find_all('div', class_='news-item')
    
        for item in news_items:
            news_data = self.extract_news_data(item)
            if news_data:
                news_list.append(news_data)
    
        return news_list
  
    def extract_news_data(self, item):
        """提取单条新闻数据"""
        try:
            # 提取标题和链接
            title_link = item.find('a', class_='title')
            if not title_link:
                return None
        
            title = title_link.get_text(strip=True)
            link = title_link.get('href')
        
            # 提取摘要
            summary_elem = item.find('p', class_='summary')
            summary = summary_elem.get_text(strip=True) if summary_elem else ''
        
            # 提取时间
            time_elem = item.find('span', class_='time')
            publish_time = time_elem.get_text(strip=True) if time_elem else ''
        
            # 提取作者
            author_elem = item.find('span', class_='author')
            author = author_elem.get_text(strip=True) if author_elem else ''
        
            return {
                'title': title,
                'link': link,
                'summary': summary,
                'publish_time': publish_time,
                'author': author,
                'crawl_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            }
        
        except Exception as e:
            print(f"提取新闻数据失败：{e}")
            return None
  
    def save_to_csv(self, news_list, filename='news.csv'):
        """保存数据到CSV文件"""
        if not news_list:
            print("没有数据需要保存")
            return
    
        with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['title', 'link', 'summary', 'publish_time', 'author', 'crawl_time']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
            writer.writeheader()
            for news in news_list:
                writer.writerow(news)
    
        print(f"数据已保存到 {filename}")

# 使用示例
if __name__ == "__main__":
    spider = NewsSpider()
    news_list = spider.get_news_list('https://example-news-site.com')
    spider.save_to_csv(news_list)

8. 处理特殊情况

8.1 处理编码问题

import requests
from bs4 import BeautifulSoup
import chardet

def get_soup_with_encoding(url):
    """自动检测编码并创建soup对象"""
    response = requests.get(url)
  
    # 检测编码
    detected = chardet.detect(response.content)
    encoding = detected['encoding']
  
    # 使用检测到的编码解码
    html = response.content.decode(encoding)
  
    return BeautifulSoup(html, 'html.parser')

8.2 处理JavaScript生成的内容

# BeautifulSoup只能解析静态HTML
# 对于JavaScript生成的内容，需要使用Selenium等工具
# 这里展示如何识别这种情况

def check_js_content(soup):
    """检查页面是否包含JavaScript生成的内容"""
    scripts = soup.find_all('script')
  
    for script in scripts:
        if script.string and 'document.write' in script.string:
            print("页面包含JavaScript生成的内容")
            return True
  
    return False

8.3 处理表格数据

def parse_table(soup, table_selector):
    """解析HTML表格"""
    table = soup.select_one(table_selector)
    if not table:
        return []
  
    rows = []
  
    # 获取表头
    headers = []
    header_row = table.find('tr')
    if header_row:
        for th in header_row.find_all(['th', 'td']):
            headers.append(th.get_text(strip=True))
  
    # 获取数据行
    for row in table.find_all('tr')[1:]:  # 跳过表头
        row_data = {}
        cells = row.find_all(['td', 'th'])
    
        for i, cell in enumerate(cells):
            if i < len(headers):
                row_data[headers[i]] = cell.get_text(strip=True)
    
        if row_data:
            rows.append(row_data)
  
    return rows

9. 性能优化技巧

9.1 选择合适的解析器

# 不同解析器的特点：
# html.parser: Python内置，容错性好，速度中等
# lxml: 速度快，功能强大，需要安装
# xml: 只能解析XML，速度快
# html5lib: 最好的容错性，速度慢

# 推荐使用lxml
soup = BeautifulSoup(html, 'lxml')

9.2 限制解析范围

# 只解析需要的部分
from bs4 import SoupStrainer

# 只解析div标签
parse_only = SoupStrainer("div")
soup = BeautifulSoup(html, "lxml", parse_only=parse_only)

# 只解析特定class的元素
parse_only = SoupStrainer("div", class_="content")
soup = BeautifulSoup(html, "lxml", parse_only=parse_only)

10. 实践练习

练习1：爬取商品信息

编写程序爬取电商网站的商品列表，提取商品名称、价格、评分等信息。

练习2：解析论坛帖子

爬取论坛的帖子列表，提取标题、作者、回复数、发布时间等信息。

练习3：提取表格数据

从包含表格的网页中提取结构化数据，并保存为CSV格式。

11. 课程小结

本课程我们学习了：

BeautifulSoup库的基本使用方法
HTML文档的树形结构理解
各种元素查找方法
CSS选择器的使用
文档树的导航方法
数据提取技巧
实战案例和特殊情况处理

12. 下节预告

下一课我们将学习：

XPath表达式的使用
lxml库的高级功能
处理XML文档
更复杂的数据提取场景

13. 作业

使用BeautifulSoup爬取一个新闻网站的文章列表
练习使用各种CSS选择器定位元素
编写一个通用的表格数据提取器
处理包含特殊字符和编码的网页

提示：BeautifulSoup是网页数据提取的核心工具，熟练掌握各种选择器和导航方法是关键。

希望对初学者有帮助；致力于办公自动化的小小程序员一枚

希望能得到大家的【❤️一个免费关注❤️】感谢！

求个 🤞 关注 🤞 +❤️ 喜欢 ❤️ +👍 收藏 👍

此外还有办公自动化专栏，欢迎大家订阅：Python办公自动化专栏

此外还有爬虫专栏，欢迎大家订阅：Python爬虫基础专栏

此外还有Python基础专栏，欢迎大家订阅：Python基础学习专栏

🌸 欢迎来到Python办公自动化专栏—Python处理办公问题，解放您的双

💻 个人主页——>个人主页欢迎访问

😸 Github主页——>Github主页欢迎访问

❓ 知乎主页——>知乎主页欢迎访问

🏳️‍🌈 CSDN博客主页：请点击——> 一晌小贪欢的博客主页求关注

👍 该系列文章专栏：请点击——>Python办公自动化专栏求订阅

🕷 此外还有爬虫专栏：请点击——>Python爬虫基础专栏求订阅

📕 此外还有python基础专栏：请点击——>Python基础学习专栏求订阅

文章作者技术和水平有限，如果文中出现错误，希望大家能指正🙏

❤️ 欢迎各位佬关注！ ❤️