Python 六种常见爬虫方法介绍

Python 六种常见爬虫方法介绍 | 极客日志

import requests
from bs4 import BeautifulSoup

# 发送 HTTP 请求
url = 'https://example.com'
response = requests.get(url)

# 解析 HTML 内容
soup = BeautifulSoup(response.text, 'html.parser')

# 提取数据
title = soup.title.text
print(f'网页标题：{title}')

# 提取所有链接
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

import requests
import re

# 发送 HTTP 请求
url = 'https://example.com'
response = requests.get(url)

# 使用正则表达式提取邮箱
emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', response.text)
print(emails)

from selenium import webdriver
from selenium.webdriver.common.by import By

# 初始化浏览器
driver = webdriver.Chrome()

# 打开网页
url = 'https://example.com'
driver.get(url)

# 获取动态加载的内容
element = driver.find_element(By.CSS_SELECTOR, '.dynamic-content')
print(element.text)

# 关闭浏览器
driver.quit()

pip install scrapy

scrapy startproject myproject

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

scrapy crawl myspider -o output.json

import requests
from pyquery import PyQuery as pq

# 发送 HTTP 请求
url = 'https://example.com'
response = requests.get(url)

# 解析 HTML
doc = pq(response.text)

# 提取数据
title = doc('title').text()
print(f'网页标题：{title}')

# 提取所有链接
links = doc('a')
for link in links.items():
    print(link.attr('href'))

import requests

# API 地址
url = 'https://api.example.com/data'

# 发送请求
params = {'key': 'your_api_key', 'q': 'search_query'}
response = requests.get(url, params=params)

# 解析 JSON 数据
data = response.json()
print(data)

方法	适用场景	优点	缺点
requests + BeautifulSoup	静态网页抓取	简单易用	无法处理动态内容
requests + 正则表达式	提取特定格式数据	灵活	正则表达式编写复杂
Selenium	动态网页抓取	支持动态内容	速度慢，资源消耗大
Scrapy	大规模数据抓取	功能强大，支持分布式	学习曲线较陡
PyQuery	熟悉 jQuery 语法的开发者	语法简洁	功能相对有限
API	网站提供 API 接口	高效、稳定	需要 API 权限

Python 六种常见爬虫方法介绍

1. 使用 requests + BeautifulSoup 抓取静态网页

示例代码：

适用场景：

2. 使用 requests + 正则表达式提取数据

示例代码：

适用场景：

3. 使用 Selenium 抓取动态网页

示例代码：

适用场景：

4. 使用 Scrapy 构建爬虫项目

示例代码：

适用场景：

5. 使用 PyQuery 解析 HTML

示例代码：

适用场景：

6. 使用 API 抓取数据

示例代码：

适用场景：

总结

更多推荐文章

相关免费在线工具

Python 六种常见爬虫方法介绍

1. 使用 requests + BeautifulSoup 抓取静态网页

示例代码：

适用场景：

2. 使用 requests + 正则表达式提取数据

示例代码：

适用场景：

3. 使用 Selenium 抓取动态网页

示例代码：

适用场景：

4. 使用 Scrapy 构建爬虫项目

示例代码：

适用场景：

5. 使用 PyQuery 解析 HTML

示例代码：

适用场景：

6. 使用 API 抓取数据

示例代码：

适用场景：

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具