Skip to main content
Processors transform and enrich your data before it’s stored in knowledge bases.

Available Processors

Document Parser

Extracts text from various document formats.
result = client.processors.parse(
    file_url="https://example.com/document.pdf",
    output_format="markdown"
)

Text Chunker

Splits text into optimal chunks for embedding.
chunks = client.processors.chunk(
    text="Your long document text...",
    strategy="semantic",
    chunk_size=512,
    overlap=50
)

Metadata Extractor

Automatically extracts metadata from documents.
metadata = client.processors.extract_metadata(
    content="Document content...",
    extract=["title", "summary", "keywords", "entities"]
)

Processing Pipeline

You can chain processors together:
# Define a pipeline
pipeline = client.pipelines.create(
    name="document-pipeline",
    steps=[
        {"processor": "document_parser", "config": {"output": "markdown"}},
        {"processor": "metadata_extractor", "config": {"extract": ["title", "summary"]}},
        {"processor": "chunker", "config": {"strategy": "semantic"}}
    ]
)

# Run the pipeline
result = client.pipelines.run(
    pipeline_id=pipeline.id,
    input_url="https://example.com/doc.pdf"
)

Supported Formats

FormatParserNotes
PDFYesOCR available for scanned docs
DOCXYesPreserves formatting
PPTXYesExtracts slide content
HTMLYesCleans and extracts text
MarkdownYesDirect processing