Python unstructured 库处理非结构化数据并转换为结构化格式

Python unstructured 库处理非结构化数据并转换为结构化格式 | 极客日志

pip install unstructured

import unstructured
print(unstructured.__version__)  # 示例输出：0.16.17

# Mac
brew install libmagic
# Ubuntu
sudo apt-get install libmagic1

pip install unstructured-client

pip install "unstructured[docx]"

pip install "unstructured[local-inference]"

pip install unstructured

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
docker exec -it unstructured bash

from unstructured.partition.auto import partition

# 解析 PDF 文件
elements = partition(filename="example.pdf")
for element in elements[:5]:
    print(f"{element.category}: {element.text}")

Title: Introduction
NarrativeText: This is the first paragraph...
ListItem: - Item 1

from unstructured.partition.pdf import partition_pdf

# 高分辨率解析（包含表格）
elements = partition_pdf(filename="example.pdf", strategy="hi_res")

from unstructured.cleaners.core import clean, remove_punctuation

text = "Hello, World!!! This is a test..."
cleaned_text = clean(text, lowercase=True)  # 转换为小写并清理
cleaned_text = remove_punctuation(cleaned_text)  # 移除标点
print(cleaned_text)  # 输出：hello world this is a test

from unstructured.staging.base import convert_to_dict

# 转换为 JSON
elements = partition(filename="example.docx")
json_data = convert_to_dict(elements)
print(json_data[:2])  # 输出前两个元素

[{"type":"Title","text":"Introduction","metadata":{...}},{"type":"NarrativeText","text":"This is the first paragraph...","metadata":{...}}]

from langchain_unstructured import UnstructuredLoader

# 本地加载
loader = UnstructuredLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:100])  # 输出提取的文本

# 使用 Serverless API
loader = UnstructuredLoader(
    file_path="example.pdf",
    api_key="your_api_key",
    strategy="hi_res"
)
docs = loader.load()

from unstructured_client import UnstructuredClient

client = UnstructuredClient(api_key_auth="your_api_key")
with open("example.pdf", "rb") as f:
    response = client.general.partition(file=f, strategy="hi_res")
    print(response.elements[:2])

# 在容器内运行 Python 脚本
from unstructured.partition.auto import partition

elements = partition(filename="/data/example.pdf")
print([str(el) for el in elements[:5]])

from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json

# 解析文档
elements = partition(filename="report.pdf", strategy="hi_res")
json_data = convert_to_dict(elements)

# 保存为 JSON
with open("output.json", "w") as f:
    json.dump(json_data, f, indent=2)

# 加载到 LangChain
loader = UnstructuredLoader(file_path="report.pdf")
docs = loader.load()

# 假设使用 LLM 进行问答
from langchain.llms import OpenAI
llm = OpenAI(api_key="your_openai_key")
response = llm(f"Summarize: {docs[0].page_content[:500]}")
print(response)

from unstructured.partition.auto import partition
from unstructured.cleaners.core import clean, remove_punctuation
from unstructured.staging.base import convert_to_dict
from langchain_unstructured import UnstructuredLoader
import json

# 配置日志（使用 loguru）
from loguru import logger
logger.add("app.log", rotation="1 MB", level="INFO")

# 解析 PDF
logger.info("Starting PDF processing")
try:
    elements = partition(filename="sample.pdf", strategy="hi_res")
except Exception as e:
    logger.exception("Failed to process PDF")
    raise

# 清理文本
cleaned_elements = []
for element in elements:
    text = clean(element.text, lowercase=True)
    text = remove_punctuation(text)
    cleaned_elements.append({"type": element.category, "text": text})

logger.info("Text cleaning completed")

# 转换为 JSON
json_data = convert_to_dict(cleaned_elements)
with open("output.json", "w") as f:
    json.dump(json_data, f, indent=2)

logger.info("JSON output saved")

# LangChain 集成
loader = UnstructuredLoader(file_path="sample.pdf", strategy="hi_res")
docs = loader.load()
logger.info(f"Loaded {len(docs)} documents")

# 打印前 100 个字符
print(docs[0].page_content[:100])

2025-05-09T01:33:56.123 | INFO | Starting PDF processing
2025-05-09T01:33:57.124 | INFO | Text cleaning completed
2025-05-09T01:33:57.125 | INFO | JSON output saved
2025-05-09T01:33:57.126 | INFO | Loaded 1 documents

Python unstructured 库处理非结构化数据并转换为结构化格式

Python unstructured 库：处理和预处理非结构化数据

1. unstructured 库的作用

2. 安装与环境要求

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

更多推荐文章

相关免费在线工具

Python unstructured 库处理非结构化数据并转换为结构化格式

Python unstructured 库：处理和预处理非结构化数据

1. unstructured 库的作用

2. 安装与环境要求

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具