DataFrame
DataFrame
represents a query against a specific table. Unlike computation container frameworks like pandas or Dask,
Pixeltable dataframes do not hold data or allow you to update data (use insert/update/delete for that purpose).
Another difference to pandas is that query execution needs to be initiated explicitly in order to return results.
Overview
Query Construction | |
---|---|
select |
Select output expressions |
where |
Filter table rows |
group_by |
Group table rows in order to apply aggregate functions |
order_by |
Order output rows |
limit |
Limit the number of output rows |
Query Execution | |
---|---|
collect |
Return all output rows |
show |
Return a number of output rows |
head |
Return the oldest rows |
tail |
Return the most recently added rows |
Data Export | |
---|---|
to_pytorch_dataset |
Return the query result as a pytorch IterableDataset |
to_coco_dataset |
Return the query result as a COCO dataset |
pixeltable.DataFrame
DataFrame(
tbl: TableVersionPath,
select_list: Optional[List[Tuple[Expr, Optional[str]]]] = None,
where_clause: Optional[Expr] = None,
group_by_clause: Optional[List[Expr]] = None,
grouping_tbl: Optional[TableVersion] = None,
order_by_clause: Optional[List[Tuple[Expr, bool]]] = None,
limit: Optional[int] = None,
)
collect
collect() -> DataFrameResultSet
group_by
group_by(*grouping_items: Any) -> DataFrame
Add a group-by clause to this DataFrame.
Variants:
- group_by(
head
head(n: int = 10) -> DataFrameResultSet
show
show(n: int = 20) -> DataFrameResultSet
tail
tail(n: int = 10) -> DataFrameResultSet
to_pytorch_dataset
to_pytorch_dataset(
image_format: str = "pt",
) -> "torch.utils.data.IterableDataset"
Convert the dataframe to a pytorch IterableDataset suitable for parallel loading with torch.utils.data.DataLoader.
This method requires pyarrow >= 13, torch and torchvision to work.
This method serializes data so it can be read from disk efficiently and repeatedly without re-executing the query. This data is cached to disk for future re-use.
Parameters:
-
image_format
(str
, default:'pt'
) –format of the images. Can be 'pt' (pytorch tensor) or 'np' (numpy array). 'np' means image columns return as an RGB uint8 array of shape HxWxC. 'pt' means image columns return as a CxHxW tensor with values in [0,1] and type torch.float32. (the format output by torchvision.transforms.ToTensor())
Returns:
-
'torch.utils.data.IterableDataset'
–A pytorch IterableDataset: Columns become fields of the dataset, where rows are returned as a dictionary compatible with torch.utils.data.DataLoader default collation.
Constraints
The default collate_fn for torch.data.util.DataLoader cannot represent null values as part of a pytorch tensor when forming batches. These values will raise an exception while running the dataloader.
If you have them, you can work around None values by providing your custom collate_fn to the DataLoader (and have your model handle it). Or, if these are not meaningful values within a minibtach, you can modify or remove any such values through selections and filters prior to calling to_pytorch_dataset().
to_coco_dataset
to_coco_dataset() -> Path
Convert the dataframe to a COCO dataset. This dataframe must return a single json-typed output column in the following format: { 'image': PIL.Image.Image, 'annotations': [ { 'bbox': [x: int, y: int, w: int, h: int], 'category': str | int, }, ... ], }
Returns:
-
Path
–Path to the COCO dataset file.