Extract structured data.
tl;dr:
json_schema
Chunk
objects, by default: {"type": "text", "content": "..."}
Detailed configuration (only relevant for complex use cases):
The structured data extractor’s architecture follows the map-reduce pattern, where the asset is divided into chunks, the schema is extracted from each chunk, and the chunks are then reduced to a single structured data object.
In some applications, you may not want to:
You can configure these behaviors with the map
and reduce
fields.
The chunks from which to extract structured data.
The JSON schema to use for validation (version draft 2020-12). See the docs here.
The prompt to use for the data extraction over each individual chunk. It must be a list of messages. The chunk content will be appended as a list of human messages.
If map
, whether to reduce the chunks to a single structured object (true) or return the full list (false). Use True unless you want to preserve duplicates from each page or expect the object to overflow the output context.
The prompt to use for the reduce steps. It must be a list of messages. The two extraction attempts will be appended as a list of human messages.
The extracted structured data for each chunk. A list where each element is guaranteed to match json_schema
.
If reduce is True, the reduced structured data, otherwise null. Guaranteed to match json_schema
.