Skip to content

Images, PDFs & audio

Agents aren't limited to text. You can send the model an image to look at, a PDF to read, or an audio clip to listen to — as long as the model you're using supports it. This is often called "multimodal" input (more than one mode of content).

Sending more than text

Normally a request is just a string. To include media, send a list of parts instead — text mixed with images, documents, or audio:

from agentix import TextPart, ImagePart

await agent.run([
    TextPart("What's in this picture?"),
    ImagePart.from_path("cat.png"),
])

The part types are TextPart, ImagePart, DocumentPart (for PDFs), and AudioPart. You can build each from a local file, raw bytes, a URL, or base64 data:

ImagePart.from_path("diagram.png")          # a file (type detected automatically)
ImagePart.from_url("https://example.com/x.jpg")
DocumentPart.from_path("report.pdf")

Each provider takes what it supports

Not every model accepts every kind of media. agentix translates each part into the right format for your chosen provider — and if a provider can't handle something (for example, Anthropic doesn't take audio), it raises a clear error instead of silently dropping it, so you're never confused about what happened.

Plain text still works exactly as before — you only use parts when you have media to send.

→ Runnable example: examples/22_multimodal.py