Invoke

Extract structured data.

tl;dr:

pass a valid JSON schema in json_schema
pass the page chunks as a list of Chunk objects, by default: {"type": "text", "content": "..."}
leave all other fields as default

Detailed configuration (only relevant for complex use cases):

The structured data extractor’s architecture follows the map-reduce pattern, where the asset is divided into chunks, the schema is extracted from each chunk, and the chunks are then reduced to a single structured data object.

In some applications, you may not want to:

map (if your input asset is small enough)
reduce (if your output object is large enough that it will overflow the output length; if you’re extracting a long list of entities; if youre ) to extract all instances of the schema).

You can configure these behaviors with the map and reduce fields.

Request

This endpoint expects an object.

chunkslist of objectsRequired

The chunks from which to extract structured data.

json_schemamap from strings to anyRequired

The JSON schema to use for validation (version draft 2020-12). See the docs here.

chunk_messageslist of objectsOptional

The prompt to use for the data extraction over each individual chunk. It must be a list of messages. The chunk content will be appended as a list of human messages.

reducebooleanOptionalDefaults to true

If map, whether to reduce the chunks to a single structured object (true) or return the full list (false). Use True unless you want to preserve duplicates from each page or expect the object to overflow the output context.

reduce_messageslist of objectsOptional

The prompt to use for the reduce steps. It must be a list of messages. The two extraction attempts will be appended as a list of human messages.

Response

Successful Response

chunk_by_chunk_datalist of objects or null

The extracted structured data for each chunk. A list where each element is guaranteed to match json_schema.

reduced_datamap from strings to any or null

If reduce is True, the reduced structured data, otherwise null. Guaranteed to match json_schema.

1	from athena import Athena, Chunk, ChunkContentItem_Text
2
3	client = Athena(
4	api_key="YOUR_API_KEY",
5	)
6	client.tools.structured_data_extractor.invoke(
7	chunks=[
8	Chunk(
9	chunk_id="1",
10	content=[
11	ChunkContentItem_Text(
12	text="John Smith is a 35 year old developer. You can reach him at john.smith@example.com",
13	)
14	],
15	),
16	Chunk(
17	chunk_id="2",
18	content=[
19	ChunkContentItem_Text(
20	text="Jane Doe is a 25 year old developer. You can reach her at jane@example.com",
21	)
22	],
23	),
24	],
25	json_schema={
26	"description": "A person",
27	"properties": {
28	"age": {"type": "integer"},
29	"email": {"type": "string"},
30	"name": {"type": "string"},
31	},
32	"required": ["name"],
33	"title": "Person",
34	"type": "object",
35	},
36	)

1	{
2	"chunk_by_chunk_data": [
3	{
4	"chunk_id": "1",
5	"chunk_result": {
6	"age": 35,
7	"email": "john.smith@example.com",
8	"name": "John Smith"
9	}
10	},
11	{
12	"chunk_id": "2",
13	"chunk_result": {
14	"age": 25,
15	"email": "jane@example.com",
16	"name": "Jane Doe"
17	}
18	}
19	],
20	"reduced_data": {
21	"age": 25,
22	"email": "jane@example.com",
23	"name": "Jane Doe"
24	}
25	}

Headers

Request

Response

Errors