Structured Output | Athena | API Reference Docs

Use the Structured Data Extractor to extract structured data from text chunks using JSON schemas. This is useful for parsing unstructured text into well-defined data structures.

Key features:

Define output structure using JSON Schema (draft 2020-12)
Process multiple text chunks with map-reduce pattern
Get validated, structured output matching your schema

Install Package

1 !pip install -U athena-intelligence

Set Up Client

1 from athena import Athena, Chunk, ChunkContentItem_Text
2 
3 athena = Athena(api_key="<YOUR_API_KEY>")

Define Your Schema

Create a JSON schema that describes the structure you want to extract:

1 person_schema = {
2     "title": "Person",
3     "description": "Information about a person",
4     "type": "object",
5     "properties": {
6         "name": {"type": "string"},
7         "age": {"type": "integer"},
8         "email": {"type": "string"}
9     },
10     "required": ["name"]
11 }

Extract Structured Data

Pass text chunks and your schema to the structured data extractor:

1 response = athena.tools.structured_data_extractor.invoke(
2     chunks=[
3         Chunk(
4             chunk_id="1",
5             content=[
6                 ChunkContentItem_Text(
7                     text="John Smith is a 35 year old software developer. Contact him at john.smith@example.com"
8                 )
9             ]
10         ),
11         Chunk(
12             chunk_id="2",
13             content=[
14                 ChunkContentItem_Text(
15                     text="Jane Doe is a 28 year old data scientist. Her email is jane.doe@example.com"
16                 )
17             ]
18         )
19     ],
20     json_schema=person_schema,
21     reduce=True
22 )
23 
24 print(response.reduced_data)

1 {
2     "name": "John Smith",
3     "age": 35,
4     "email": "john.smith@example.com"
5 }

Access Chunk-by-Chunk Results

To get extracted data from each chunk individually, set reduce=False:

1 response = athena.tools.structured_data_extractor.invoke(
2     chunks=[
3         Chunk(
4             chunk_id="1",
5             content=[
6                 ChunkContentItem_Text(
7                     text="John Smith is a 35 year old software developer."
8                 )
9             ]
10         ),
11         Chunk(
12             chunk_id="2",
13             content=[
14                 ChunkContentItem_Text(
15                     text="Jane Doe is a 28 year old data scientist."
16                 )
17             ]
18         )
19     ],
20     json_schema=person_schema,
21     reduce=False
22 )
23 
24 for chunk_result in response.chunk_by_chunk_data:
25     print(f"Chunk {chunk_result.chunk_id}: {chunk_result.data}")

Chunk 1: {'name': 'John Smith', 'age': 35}
Chunk 2: {'name': 'Jane Doe', 'age': 28}