DataFrame
DataFrame
represents a query against a specific table. Unlike computation container frameworks like pandas or Dask,
Pixeltable dataframes do not hold data or allow you to update data
(use insert/update/delete
for that purpose).
Another difference to pandas is that query execution needs to be initiated explicitly in order to return results.
Overview
Query Construction | |
---|---|
select |
Select output expressions |
join |
Join to another table |
where |
Filter table rows |
group_by |
Group table rows in order to apply aggregate functions |
order_by |
Order output rows |
limit |
Limit the number of output rows |
Query Execution | |
---|---|
collect |
Return all output rows |
show |
Return a number of output rows |
head |
Return the oldest rows |
tail |
Return the most recently added rows |
Data Export | |
---|---|
to_pytorch_dataset |
Return the query result as a pytorch IterableDataset |
to_coco_dataset |
Return the query result as a COCO dataset |
pixeltable.DataFrame
DataFrame(
from_clause: Optional[FromClause] = None,
select_list: Optional[list[tuple[Expr, Optional[str]]]] = None,
where_clause: Optional[Expr] = None,
group_by_clause: Optional[list[Expr]] = None,
grouping_tbl: Optional[TableVersion] = None,
order_by_clause: Optional[list[tuple[Expr, bool]]] = None,
limit: Optional[int] = None,
)
collect
collect() -> DataFrameResultSet
group_by
group_by(*grouping_items: Any) -> DataFrame
Add a group-by clause to this DataFrame.
Variants:
- group_by(
Note, that grouping will be applied to the rows and take effect when used with an aggregation function like sum(), count() etc.
Parameters:
-
grouping_items
(Any
, default:()
) –expressions to group by
Returns:
-
DataFrame
–A new DataFrame with the specified group-by clause.
Raises:
-
Error
–If the group-by clause is already specified, or if the specified expression is invalid, or refer to tables not in the DataFrame, or if the DataFrame is a result of a join.
Examples:
Given the DataFrame book from a table t with all its columns and rows:
>>> book = t.select()
Group the above DataFrame book by the 'genre' column (referenced in table t):
>>> df = book.group_by(t.genre)
Use the above DataFrame df grouped by genre to count the number of books for each 'genre':
>>> df = book.group_by(t.genre).select(t.genre, count=count(t.genre)).show()
Use the above DataFrame df grouped by genre to the total price of books for each 'genre':
>>> df = book.group_by(t.genre).select(t.genre, total=sum(t.price)).show()
head
head(n: int = 10) -> DataFrameResultSet
Return the first n rows of the DataFrame, in insertion order of the underlying Table.
head() is not supported for joins.
Parameters:
-
n
(int
, default:10
) –Number of rows to select. Default is 10.
Returns:
-
DataFrameResultSet
–A DataFrameResultSet with the first n rows of the DataFrame.
Raises:
-
Error
–If the DataFrame is the result of a join or if the DataFrame has an order_by clause.
limit
limit(n: int) -> DataFrame
Limit the number of rows in the DataFrame.
Parameters:
-
n
(int
) –Number of rows to select.
Returns:
-
DataFrame
–A new DataFrame with the specified limited rows.
order_by
order_by(*expr_list: Expr, asc: bool = True) -> DataFrame
Add an order-by clause to this DataFrame.
Parameters:
-
expr_list
(Expr
, default:()
) –expressions to order by
-
asc
(bool
, default:True
) –whether to order in ascending order (True) or descending order (False). Default is True.
Returns:
-
DataFrame
–A new DataFrame with the specified order-by clause.
Raises:
-
Error
–If the order-by clause is already specified, or if the specified expression is invalid, or refer to tables not in the DataFrame.
Examples:
Given the DataFrame book from a table t with all its columns and rows:
>>> book = t.select()
Order the above DataFrame book by two columns (price, pages) in descending order:
>>> df = book.order_by(t.price, t.pages, asc=False)
Order the above DataFrame book by price in descending order, but order the pages in ascending order:
>>> df = book.order_by(t.price, asc=False).order_by(t.pages)
select
select(*items: Any, **named_items: Any) -> DataFrame
Select columns or expressions from the DataFrame.
Parameters:
-
items
(Any
, default:()
) –expressions to be selected
-
named_items
(Any
, default:{}
) –named expressions to be selected
Returns:
-
DataFrame
–A new DataFrame with the specified select list.
Raises:
-
Error
–If the select list is already specified, or if any of the specified expressions are invalid, or refer to tables not in the DataFrame.
Examples:
Given the DataFrame person from a table t with all its columns and rows:
>>> person = t.select()
Select the columns 'name' and 'age' (referenced in table t) from the DataFrame person:
>>> df = person.select(t.name, t.age)
Select the columns 'name' (referenced in table t) from the DataFrame person,
and a named column 'is_adult' from the expression age >= 18
where 'age' is
another column in table t:
>>> df = person.select(t.name, is_adult=(t.age >= 18))
join
join(
other: Table,
on: Optional[Union[Expr, Sequence[ColumnRef]]] = None,
how: LiteralType = "inner",
) -> DataFrame
Join this DataFrame with a table.
Parameters:
-
other
(Table
) –the table to join with
-
on
(Optional[Union[Expr, Sequence[ColumnRef]]]
, default:None
) –the join condition, which can be either a) references to one or more columns or b) a boolean expression.
-
column references: implies an equality predicate that matches columns in both this DataFrame and
other
by name.- column in
other
: A column with that same name must be present in this DataFrame, and it must be unique (otherwise the join is ambiguous). - column in this DataFrame: A column with that same name must be present in
other
.
- column in
-
boolean expression: The expressions must be valid in the context of the joined tables.
-
-
how
(LiteralType
, default:'inner'
) –the type of join to perform.
'inner'
: only keep rows that have a match in both'left'
: keep all rows from this DataFrame and only matching rows from the other table'right'
: keep all rows from the other table and only matching rows from this DataFrame'full_outer'
: keep all rows from both this DataFrame and the other table'cross'
: Cartesian product; noon
condition allowed
Returns:
-
DataFrame
–A new DataFrame.
Examples:
Perform an inner join between t1 and t2 on the column id:
>>> join1 = t1.join(t2, on=t2.id)
Perform a left outer join of join1 with t3, also on id (note that we can't specify on=t3.id
here,
because that would be ambiguous, since both t1 and t2 have a column named id):
>>> join2 = join1.join(t3, on=t2.id, how='left')
Do the same, but now with an explicit join predicate:
>>> join2 = join1.join(t3, on=t2.id == t3.id, how='left')
Join t with d, which has a composite primary key (columns pk1 and pk2, with corresponding foreign key columns d1 and d2 in t):
>>> df = t.join(d, on=(t.d1 == d.pk1) & (t.d2 == d.pk2), how='left')
show
show(n: int = 20) -> DataFrameResultSet
tail
tail(n: int = 10) -> DataFrameResultSet
Return the last n rows of the DataFrame, in insertion order of the underlying Table.
tail() is not supported for joins.
Parameters:
-
n
(int
, default:10
) –Number of rows to select. Default is 10.
Returns:
-
DataFrameResultSet
–A DataFrameResultSet with the last n rows of the DataFrame.
Raises:
-
Error
–If the DataFrame is the result of a join or if the DataFrame has an order_by clause.
to_pytorch_dataset
to_pytorch_dataset(
image_format: str = "pt",
) -> "torch.utils.data.IterableDataset"
Convert the dataframe to a pytorch IterableDataset suitable for parallel loading with torch.utils.data.DataLoader.
This method requires pyarrow >= 13, torch and torchvision to work.
This method serializes data so it can be read from disk efficiently and repeatedly without re-executing the query. This data is cached to disk for future re-use.
Parameters:
-
image_format
(str
, default:'pt'
) –format of the images. Can be 'pt' (pytorch tensor) or 'np' (numpy array). 'np' means image columns return as an RGB uint8 array of shape HxWxC. 'pt' means image columns return as a CxHxW tensor with values in [0,1] and type torch.float32. (the format output by torchvision.transforms.ToTensor())
Returns:
-
'torch.utils.data.IterableDataset'
–A pytorch IterableDataset: Columns become fields of the dataset, where rows are returned as a dictionary compatible with torch.utils.data.DataLoader default collation.
Constraints
The default collate_fn for torch.data.util.DataLoader cannot represent null values as part of a pytorch tensor when forming batches. These values will raise an exception while running the dataloader.
If you have them, you can work around None values by providing your custom collate_fn to the DataLoader (and have your model handle it). Or, if these are not meaningful values within a minibtach, you can modify or remove any such values through selections and filters prior to calling to_pytorch_dataset().
to_coco_dataset
to_coco_dataset() -> Path
Convert the dataframe to a COCO dataset. This dataframe must return a single json-typed output column in the following format: { 'image': PIL.Image.Image, 'annotations': [ { 'bbox': [x: int, y: int, w: int, h: int], 'category': str | int, }, ... ], }
Returns:
-
Path
–Path to the COCO dataset file.
where
where(pred: Expr) -> DataFrame
Filter rows based on a predicate.
Parameters:
-
pred
(Expr
) –the predicate to filter rows
Returns:
-
DataFrame
–A new DataFrame with the specified predicates replacing the where-clause.
Raises:
-
Error
–If the predicate is not a Pixeltable expression, or if it does not return a boolean value, or refers to tables not in the DataFrame.
Examples:
Given the DataFrame person from a table t with all its columns and rows:
>>> person = t.select()
Filter the above DataFrame person to only include rows where the column 'age' (referenced in table t) is greater than 30:
>>> df = person.where(t.age > 30)