Structured Output

Use the Structured Data Extractor to extract structured data from text chunks using JSON schemas. This is useful for parsing unstructured text into well-defined data structures.

Key features:

  • Define output structure using JSON Schema (draft 2020-12)
  • Process multiple text chunks with map-reduce pattern
  • Get validated, structured output matching your schema
1

Install Package

1!pip install -U athena-intelligence
2

Set Up Client

1from athena import Athena, Chunk, ChunkContentItem_Text
2
3athena = Athena(api_key="<YOUR_API_KEY>")
3

Define Your Schema

Create a JSON schema that describes the structure you want to extract:

1person_schema = {
2 "title": "Person",
3 "description": "Information about a person",
4 "type": "object",
5 "properties": {
6 "name": {"type": "string"},
7 "age": {"type": "integer"},
8 "email": {"type": "string"}
9 },
10 "required": ["name"]
11}
4

Extract Structured Data

Pass text chunks and your schema to the structured data extractor:

1response = athena.tools.structured_data_extractor.invoke(
2 chunks=[
3 Chunk(
4 chunk_id="1",
5 content=[
6 ChunkContentItem_Text(
7 text="John Smith is a 35 year old software developer. Contact him at john.smith@example.com"
8 )
9 ]
10 ),
11 Chunk(
12 chunk_id="2",
13 content=[
14 ChunkContentItem_Text(
15 text="Jane Doe is a 28 year old data scientist. Her email is jane.doe@example.com"
16 )
17 ]
18 )
19 ],
20 json_schema=person_schema,
21 reduce=True
22)
23
24print(response.reduced_data)
1{
2 "name": "John Smith",
3 "age": 35,
4 "email": "john.smith@example.com"
5}
5

Access Chunk-by-Chunk Results

To get extracted data from each chunk individually, set reduce=False:

1response = athena.tools.structured_data_extractor.invoke(
2 chunks=[
3 Chunk(
4 chunk_id="1",
5 content=[
6 ChunkContentItem_Text(
7 text="John Smith is a 35 year old software developer."
8 )
9 ]
10 ),
11 Chunk(
12 chunk_id="2",
13 content=[
14 ChunkContentItem_Text(
15 text="Jane Doe is a 28 year old data scientist."
16 )
17 ]
18 )
19 ],
20 json_schema=person_schema,
21 reduce=False
22)
23
24for chunk_result in response.chunk_by_chunk_data:
25 print(f"Chunk {chunk_result.chunk_id}: {chunk_result.data}")
Chunk 1: {'name': 'John Smith', 'age': 35}
Chunk 2: {'name': 'Jane Doe', 'age': 28}