Python OCR 文字识别：pytesseract 安装与配置指南

pytesseract 是 Python 的 OCR（光学字符识别）库，可以从图片中提取文字。Windows 上使用需要先安装 Tesseract OCR 引擎。

版本要求

pytesseract 依赖 Tesseract OCR 引擎：

组件	推荐版本	Python 版本	说明
pytesseract	0.3.10	3.7+	Python 封装库
Tesseract-OCR	5.x	-	OCR 识别引擎
中文语言包	chi_sim	-	简体中文识别（可选）
英文语言包	eng	-	英文识别（默认自带）

注意：pytesseract 只是封装库，必须先安装 Tesseract OCR 引擎才能使用。

安装中可能遇到的问题

问题 1：Tesseract 引擎未安装

import pytesseract
pytesseract.image_to_string('test.jpg')
# TesseractNotFoundError: tesseract is not installed or it's not in your PATH

只装了 pytesseract，没装 Tesseract OCR 引擎。

问题 2：路径未配置

import pytesseract
pytesseract.image_to_string('test.jpg')
# pytesseract.pytesseract.TesseractNotFoundError

Tesseract 安装了，但 Python 找不到，需要手动指定路径。

问题 3：中文识别乱码

text = pytesseract.image_to_string('中文图片.jpg')
print(text)
# 输出：乱码或空白

没有安装中文语言包 chi_sim.traineddata。

问题 4：识别准确率低

识别结果错误很多，可能是图片质量差、没有预处理。

手动安装

步骤 1：安装 Tesseract OCR 引擎

下载地址：Tesseract OCR 引擎下载地址

选择最新版本（如 tesseract-ocr-w64-setup-5.3.3.20231005.exe）下载并安装。

安装时注意：

勾选"Additional language data" → 选择"Chinese - Simplified"（简体中文）
记住安装路径（默认：C:\Program Files\Tesseract-OCR）

步骤 2：配置环境变量（可选）

将 Tesseract 添加到系统 PATH：

右键"此电脑" → 属性 → 高级系统设置 → 环境变量
在"系统变量"中找到"Path"，点击编辑
新建：C:\Program Files\Tesseract-OCR
确定保存

步骤 3：安装 pytesseract

pip install pytesseract pillow

步骤 4：配置 Tesseract 路径

在 Python 代码中指定 Tesseract 路径：

import pytesseract
# 指定 Tesseract 安装路径
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

步骤 5：下载中文语言包（如未安装）

下载地址：语言包下载地址

下载 chi_sim.traineddata（简体中文），放到：

C:\Program Files\Tesseract-OCR\tessdata\

验证安装

基础测试

import pytesseract
from PIL import Image

# 配置路径（如果没加环境变量）
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# 测试英文识别
img = Image.open('test_eng.jpg')
text = pytesseract.image_to_string(img)
print(f"英文识别结果：\n{text}")

# 测试中文识别
img_cn = Image.open('test_chn.jpg')
text_cn = pytesseract.image_to_string(img_cn, lang='chi_sim')
print(f"中文识别结果：\n{text_cn}")

检查支持的语言

import pytesseract
# 查看已安装的语言包
print(pytesseract.get_languages())
# ['chi_sim', 'eng', ...]

实用案例

案例 1：身份证识别

import pytesseract
from PIL import Image
import cv2

# 读取身份证图片
img = cv2.imread('身份证.jpg')

# 图像预处理（提高识别率）
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 灰度化
blur = cv2.GaussianBlur(gray, (5, 5), 0)
# 降噪
_, binary = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# 二值化

# OCR 识别
text = pytesseract.image_to_string(binary, lang='chi_sim')
print(f"身份证信息：\n{text}")

# 提取特定字段（示例）
lines = text.split('\n')
for line in lines:
    if '姓名' in line:
        print(f"姓名：{line.split('姓名')[-1].strip()}")
    if '身份证号' in line:
        print(f"身份证号：{line.split('身份证号')[-1].strip()}")

案例 2：截图文字提取

import pytesseract
from PIL import ImageGrab

# 截取屏幕
screenshot = ImageGrab.grab()
screenshot.save('screenshot.png')

# 识别文字
text = pytesseract.image_to_string(screenshot, lang='chi_sim+eng')
print(f"截图文字：\n{text}")

# 保存到文件
with open('extracted_text.txt', 'w', encoding='utf-8') as f:
    f.write(text)

案例 3：验证码识别

import pytesseract
from PIL import Image
import cv2
import numpy as np

# 读取验证码
img = cv2.imread('captcha.jpg')

# 预处理（验证码识别关键）
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# 去噪
kernel = np.ones((2, 2), np.uint8)
opening = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

# 识别（只允许数字和字母）
custom_config = r'--oem 3 --psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
text = pytesseract.image_to_string(opening, config=custom_config)
print(f"验证码：{text.strip()}")

案例 4：批量 PDF 转文字

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def pdf_to_text(pdf_path, output_txt):
    # PDF 转图片
    images = convert_from_path(pdf_path, dpi=300)
    full_text = ""
    for i, img in enumerate(images):
        print(f"处理第{i+1}页...")
        text = pytesseract.image_to_string(img, lang='chi_sim+eng')
        full_text += f"\n========== 第{i+1}页 ==========\n"
        full_text += text
    # 保存结果
    with open(output_txt, 'w', encoding='utf-8') as f:
        f.write(full_text)
    print(f"完成！文本已保存到 {output_txt}")

# 使用
# pdf_to_text('扫描文档.pdf', '提取文本.txt')

提高识别准确率

1. 图像预处理

import cv2

# 读取图片
img = cv2.imread('test.jpg')

# 灰度化
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# 二值化
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

# 去噪
denoised = cv2.fastNlMeansDenoising(gray, None, 10, 7, 21)

# 调整大小（放大可能提高识别率）
resized = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)

2. 配置 OCR 参数

import pytesseract

# 常用配置
# --psm: Page Segmentation Mode（页面分割模式）
# 6: 单一文本块（默认）
# 7: 单行文本
# 8: 单个词
# 11: 稀疏文本
# --oem: OCR Engine Mode
# 0: 仅使用 Legacy 引擎
# 1: 仅使用神经网络 LSTM 引擎
# 3: 默认（自动）

# 示例：单行文字识别
text = pytesseract.image_to_string(img, lang='chi_sim', config='--psm 7')

# 示例：只识别数字
text = pytesseract.image_to_string(img, config='--psm 6 -c tessedit_char_whitelist=0123456789')

3. 选择合适的语言

# 纯英文
text = pytesseract.image_to_string(img, lang='eng')

# 纯中文
text = pytesseract.image_to_string(img, lang='chi_sim')

# 中英混合
text = pytesseract.image_to_string(img, lang='chi_sim+eng')

# 繁体中文
text = pytesseract.image_to_string(img, lang='chi_tra')

常见问题

Q：TesseractNotFoundError 怎么办？

确认安装了 Tesseract OCR 引擎
在代码中指定路径：

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Q：中文识别全是乱码？

检查是否安装了 chi_sim 语言包
确认使用了 lang='chi_sim' 参数
检查语言包位置：C:\Program Files\Tesseract-OCR\tessdata\chi_sim.traineddata

Q：识别准确率很低怎么办？

图片预处理：灰度化、二值化、去噪
提高图片分辨率（DPI 300 以上）
选择合适的 PSM 模式
图片文字尽量清晰、背景简单

Q：能识别手写字吗？

Tesseract 对手写字识别效果较差，建议使用深度学习模型（如 PaddleOCR）。

Q：商业使用需要付费吗？

Tesseract 是 Apache 2.0 开源协议，可免费商用。

Q：pytesseract 和其他 OCR 方案对比？

方案	优点	缺点	适用场景
pytesseract	免费、离线、轻量	识别率一般、手写字差	印刷体文字
百度 OCR API	识别率高、支持手写	收费、需联网、有调用限制	商业项目
PaddleOCR	识别率高、免费	模型大、配置复杂	高精度需求
EasyOCR	多语言支持、简单易用	速度较慢	多语言场景

常用功能

获取文字位置

import pytesseract
from PIL import Image

img = Image.open('test.jpg')

# 获取文字及位置信息
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)
for i, text in enumerate(data['text']):
    if text.strip():
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        print(f"文字：{text}, 位置：({x}, {y}), 大小：{w}x{h}")

保存为 PDF

import pytesseract
from PIL import Image

img = Image.open('scan.jpg')
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='chi_sim', extension='pdf')
with open('output.pdf', 'wb') as f:
    f.write(pdf)

置信度检测

import pytesseract
from PIL import Image

img = Image.open('test.jpg')
data = pytesseract.image_to_data(img, lang='chi_sim', output_type=pytesseract.Output.DICT)
for i, text in enumerate(data['text']):
    confidence = data['conf'][i]
    if confidence != -1 and text.strip():
        print(f"文字：{text}, 置信度：{confidence}%")

Python OCR 文字识别：pytesseract 安装与配置指南