Load Data Frames

Set up environment

1 !pip install -U athenaintel

1 import os
2 
3 ATHENA_API_KEY = os.environ["ATHENA_API_KEY"]
4 
5 from athena.client import Athena
6 
7 athena = Athena(
8     api_key=ATHENA_API_KEY,
9 )

Load a JSON-serialisable data frame

Call tools.data_frame() to load a data frame from a CSV/excel file:

1 df = athena.tools.data_frame(
2     asset_id='doc_9249292-d118-42d3-95b4-00eccfe0754f'
3 )
4 df

Athena returns a simple pandas DataFrame representation with the default parsing options. You can adjust the following options:

row_limit: int number of rows to load,
index_column: int column to use as an index,
columns: list[str | int] indices or names of columns to include,
sheet_name: str | int name of the sheet to load, only applicable to Excel files
separator: str separator to use when parsing, only applicable to CSV files

For example, when working with large datasets, it might be beneficial to first examine at the initial five rows:

1 df_head = athena.tools.data_frame(
2     asset_id='doc_9249292-d118-42d3-95b4-00eccfe0754f',
3     row_limit=5
4 )
5 df_head

Load a large or complex data frame

The tools.data_frame() method is sufficient for handling well-formatted, medium-sized data frames and provides interface that is agnostic to the SDK version (a sister method is available in the TypeScript SDK).

However, if your Excel files include values that cannot be JSON-serialized, are serializable with a loss of precision, or contain additional metadata, you may prefer to use tools.read_data_frame() method. This method skips the JSON serialization step and provides a raw byte stream to the pandas read_csv or read_excel methods, as appropriate.

The keyword arguments provided to read_data_frame will be passed to the underlying read_csv/read_excel, depending on the file type.

1 df_head = athena.tools.read_data_frame(
2     asset_id='doc_9249292-d118-42d3-95b4-00eccfe0754f',
3     dtype={"a": np.float64, "b": np.int32}
4 )
5 df_head

Load a data frame with another package

If you prefer to use another data frame implementation, you can access the raw bytes stream object using the tools.get_file() method, which accepts a single argument - the document identifier. The resulting object complies with the io.BytesIO interface and can be used with most data frame libraries, for example:

1 import polars as pl
2 
3 bytes_io = athena.tools.get_file(
4     asset_id='doc_9249292-d118-42d3-95b4-00eccfe0754f',
5 )
6 df = pl.read_csv(bytes_io)