网站反爬机制与反反爬应对策略详解

网站反爬机制与反反爬应对策略详解 | 极客日志

import urllib.request

url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
print(html[:500])

import requests

session = requests.Session()
resp = session.get('https://example.com/login', data={'user': 'admin'})
# 后续请求会自动携带登录后的 Cookie
resp = session.get('https://example.com/profile')

import urllib.request
import random

proxy_list = [
    'http://121.193.143.249:88',
    'http://112.126.65.193:88',
    'http://122.96.59.184:82'
]

proxy_url = random.choice(proxy_list)
proxy_support = urllib.request.ProxyHandler({'http': proxy_url})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

try:
    response = urllib.request.urlopen('http://www.whatismyip.com.tw')
    html = response.read().decode('utf-8')
    print(html)
except Exception as e:
    print(f'Error: {e}')

import time
import random

for i in range(5):
    # 随机等待 1 到 5 秒
    time.sleep(random.uniform(1, 5))
    # 执行请求逻辑
    pass

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument('--headless')  # 无头模式
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)

driver.get('https://example.com')
time.sleep(3)  # 等待页面加载
html = driver.page_source
driver.quit()

网站反爬机制与反反爬应对策略详解

一、从用户请求的 Headers 反爬虫

1. User-Agent 伪装

2. Referer 防盗链

二、基于用户行为反爬虫

1. IP 代理池

2. 请求频率控制

3. 账号轮换

三、动态页面的反爬虫

1. 分析网络请求

2. Selenium 自动化测试

3. 处理验证码

四、总结与法律合规

更多推荐文章

相关免费在线工具

网站反爬机制与反反爬应对策略详解

一、从用户请求的 Headers 反爬虫

1. User-Agent 伪装

2. Referer 防盗链

3. Cookie 管理

二、基于用户行为反爬虫

1. IP 代理池

2. 请求频率控制

3. 账号轮换

三、动态页面的反爬虫

1. 分析网络请求

2. Selenium 自动化测试

3. 处理验证码

四、总结与法律合规

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具