Skip to content

DataFrame

DataFrame represents a query against a specific table. Unlike computation container frameworks like pandas or Dask, Pixeltable dataframes do not hold data or allow you to update data (use insert/update/delete for that purpose). Another difference to pandas is that query execution needs to be initiated explicitly in order to return results.

Overview

Query Construction
select Select output expressions
where Filter table rows
group_by Group table rows in order to apply aggregate functions
order_by Order output rows
limit Limit the number of output rows
Query Execution
collect Return all output rows
show Return a number of output rows
head Return the oldest rows
tail Return the most recently added rows
Data Export
to_pytorch_dataset Return the query result as a pytorch IterableDataset
to_coco_dataset Return the query result as a COCO dataset

pixeltable.DataFrame

DataFrame(
    tbl: TableVersionPath,
    select_list: Optional[List[Tuple[Expr, Optional[str]]]] = None,
    where_clause: Optional[Predicate] = None,
    group_by_clause: Optional[List[Expr]] = None,
    grouping_tbl: Optional[TableVersion] = None,
    order_by_clause: Optional[List[Tuple[Expr, bool]]] = None,
    limit: Optional[int] = None,
)

collect

collect() -> DataFrameResultSet

group_by

group_by(*grouping_items: Any) -> DataFrame

Add a group-by clause to this DataFrame. Variants: - group_by(): group a component view by their respective base table rows - group_by(, ...): group by the given expressions

head

head(n: int = 10) -> DataFrameResultSet

limit

limit(n: int) -> DataFrame

order_by

order_by(*expr_list: Expr, asc: bool = True) -> DataFrame

select

select(*items: Any, **named_items: Any) -> DataFrame

show

show(n: int = 20) -> DataFrameResultSet

tail

tail(n: int = 10) -> DataFrameResultSet

to_pytorch_dataset

to_pytorch_dataset(
    image_format: str = "pt",
) -> "torch.utils.data.IterableDataset"

Convert the dataframe to a pytorch IterableDataset suitable for parallel loading with torch.utils.data.DataLoader.

This method requires pyarrow >= 13, torch and torchvision to work.

This method serializes data so it can be read from disk efficiently and repeatedly without re-executing the query. This data is cached to disk for future re-use.

Parameters:

  • image_format (str, default: 'pt' ) –

    format of the images. Can be 'pt' (pytorch tensor) or 'np' (numpy array). 'np' means image columns return as an RGB uint8 array of shape HxWxC. 'pt' means image columns return as a CxHxW tensor with values in [0,1] and type torch.float32. (the format output by torchvision.transforms.ToTensor())

Returns:

  • 'torch.utils.data.IterableDataset'

    A pytorch IterableDataset: Columns become fields of the dataset, where rows are returned as a dictionary compatible with torch.utils.data.DataLoader default collation.

Constraints

The default collate_fn for torch.data.util.DataLoader cannot represent null values as part of a pytorch tensor when forming batches. These values will raise an exception while running the dataloader.

If you have them, you can work around None values by providing your custom collate_fn to the DataLoader (and have your model handle it). Or, if these are not meaningful values within a minibtach, you can modify or remove any such values through selections and filters prior to calling to_pytorch_dataset().

to_coco_dataset

to_coco_dataset() -> Path

Convert the dataframe to a COCO dataset. This dataframe must return a single json-typed output column in the following format: { 'image': PIL.Image.Image, 'annotations': [ { 'bbox': [x: int, y: int, w: int, h: int], 'category': str | int, }, ... ], }

Returns:

  • Path

    Path to the COCO dataset file.

where

where(pred: Predicate) -> DataFrame