pixeltable.io
pixeltable.io
create_label_studio_project
create_label_studio_project(
t: Table,
label_config: str,
name: Optional[str] = None,
title: Optional[str] = None,
media_import_method: Literal["post", "file", "url"] = "post",
col_mapping: Optional[dict[str, str]] = None,
sync_immediately: bool = True,
s3_configuration: Optional[dict[str, Any]] = None,
**kwargs: Any
) -> SyncStatus
Create a new Label Studio project and link it to the specified Table
.
- A tutorial notebook with fully worked examples can be found here: Using Label Studio for Annotations with Pixeltable
The required parameter label_config
specifies the Label Studio project configuration,
in XML format, as described in the Label Studio documentation. The linked project will
have one column for each data field in the configuration; for example, if the
configuration has an entry
<Image name="image_obj" value="$image"/>
then the linked project will have a column named image
. In addition, the linked project
will always have a JSON-typed column annotations
representing the output.
By default, Pixeltable will link each of these columns to a column of the specified Table
with the same name. If any of the data fields are missing, an exception will be raised. If
the annotations
column is missing, it will be created. The default names can be overridden
by specifying an optional col_mapping
, with Pixeltable column names as keys and Label
Studio field names as values. In all cases, the Pixeltable columns must have types that are
consistent with their corresponding Label Studio fields; otherwise, an exception will be raised.
The API key and URL for a valid Label Studio server must be specified in Pixeltable config. Either:
- Set the
LABEL_STUDIO_API_KEY
andLABEL_STUDIO_URL
environment variables; or - Specify
api_key
andurl
fields in thelabel-studio
section of$PIXELTABLE_HOME/config.yaml
.
Requirements:
pip install label-studio-sdk
pip install boto3
(if using S3 import storage)
Parameters:
-
t
(Table
) –The table to link to.
-
label_config
(str
) –The Label Studio project configuration, in XML format.
-
name
(Optional[str]
, default:None
) –An optional name for the new project in Pixeltable. If specified, must be a valid Pixeltable identifier and must not be the name of any other external data store linked to
t
. If not specified, a default name will be used of the formls_project_0
,ls_project_1
, etc. -
title
(Optional[str]
, default:None
) –An optional title for the Label Studio project. This is the title that annotators will see inside Label Studio. Unlike
name
, it does not need to be an identifier and does not need to be unique. If not specified, the table namet.name
will be used. -
media_import_method
(Literal['post', 'file', 'url']
, default:'post'
) –The method to use when transferring media files to Label Studio:
post
: Media will be sent to Label Studio via HTTP post. This should generally only be used for prototyping; due to restrictions in Label Studio, it can only be used with projects that have just one data field, and does not scale well.file
: Media will be sent to Label Studio as a file on the local filesystem. This method can be used if Pixeltable and Label Studio are running on the same host.url
: Media will be sent to Label Studio as externally accessible URLs. This method cannot be used with local media files or with media generated by computed columns. The default ispost
.
-
col_mapping
(Optional[dict[str, str]]
, default:None
) –An optional mapping of local column names to Label Studio fields.
-
sync_immediately
(bool
, default:True
) –If
True
, immediately perform an initial synchronization by exporting all rows of the table as Label Studio tasks. -
s3_configuration
(Optional[dict[str, Any]]
, default:None
) –If specified, S3 import storage will be configured for the new project. This can only be used with
media_import_method='url'
, and ifmedia_import_method='url'
and any of the media data is referenced bys3://
URLs, then it must be specified in order for such media to display correctly in the Label Studio interface.The items in the
s3_configuration
dictionary correspond to kwarg parameters of the Label Studioconnect_s3_import_storage
method, as described in the Label Studio connect_s3_import_storage docs.bucket
must be specified; all other parameters are optional. If credentials are not specified explicitly, Pixeltable will attempt to retrieve them from the environment (such as from~/.aws/credentials
). If a title is not specified, Pixeltable will use the default'Pixeltable-S3-Import-Storage'
. All other parameters use their Label Studio defaults. -
kwargs
(Any
, default:{}
) –Additional keyword arguments are passed to the
start_project
method in the Label Studio SDK, as described in the Label Studio start_project docs.
Returns:
-
SyncStatus
–A
SyncStatus
representing the status of any synchronization operations that occurred.
Examples:
Create a Label Studio project whose tasks correspond to videos stored in the video_col
column of the table tbl
:
>>> config = """
<View>
<Video name="video_obj" value="$video_col"/>
<Choices name="video-category" toName="video" showInLine="true">
<Choice value="city"/>
<Choice value="food"/>
<Choice value="sports"/>
</Choices>
</View>"""
create_label_studio_project(tbl, config)
Create a Label Studio project with the same configuration, using media_import_method='url'
,
whose media are stored in an S3 bucket:
>>> create_label_studio_project(
tbl,
config,
media_import_method='url',
s3_configuration={'bucket': 'my-bucket', 'region_name': 'us-east-2'}
)
import_csv
import_csv(
tbl_name: str,
filepath_or_buffer,
schema_overrides: Optional[dict[str, ColumnType]] = None,
primary_key: Optional[Union[str, list[str]]] = None,
num_retained_versions: int = 10,
comment: str = "",
**kwargs
) -> Table
import_excel
import_excel(
tbl_name: str,
io,
*args,
schema_overrides: Optional[dict[str, ColumnType]] = None,
primary_key: Optional[Union[str, list[str]]] = None,
num_retained_versions: int = 10,
comment: str = "",
**kwargs
) -> Table
Creates a new base table from an Excel (.xlsx) file. This is a convenience method and is
equivalent to calling import_pandas(table_path, pd.read_excel(io, *args, **kwargs), schema=schema)
.
See the Pandas documentation for read_excel
for more details.
Returns:
import_huggingface_dataset
import_huggingface_dataset(
table_path: str,
dataset: Union[Dataset, DatasetDict],
*,
column_name_for_split: Optional[str] = None,
schema_overrides: Optional[dict[str, Any]] = None,
**kwargs: Any
) -> Table
Create a new base table from a Huggingface dataset, or dataset dict with multiple splits.
Requires datasets
library to be installed.
Parameters:
-
table_path
(str
) –Path to the table.
-
dataset
(Union[Dataset, DatasetDict]
) –Huggingface
datasets.Dataset
ordatasets.DatasetDict
to insert into the table. -
column_name_for_split
(Optional[str]
, default:None
) –column name to use for split information. If None, no split information will be stored.
-
schema_overrides
(Optional[dict[str, Any]]
, default:None
) –If specified, then for each (name, type) pair in
schema_overrides
, the column with namename
will be given typetype
, instead of being inferred from theDataset
orDatasetDict
. The keys inschema_overrides
should be the column names of theDataset
orDatasetDict
(whether or not they are valid Pixeltable identifiers). -
kwargs
(Any
, default:{}
) –Additional arguments to pass to
create_table
.
Returns:
import_json
import_json(
tbl_path: str,
filepath_or_url: str,
*,
schema_overrides: Optional[dict[str, ColumnType]] = None,
primary_key: Optional[Union[str, list[str]]] = None,
num_retained_versions: int = 10,
comment: str = "",
**kwargs: Any
) -> Table
Creates a new base table from a JSON file. This is a convenience method and is
equivalent to calling import_data(table_path, json.loads(file_contents, **kwargs), ...)
, where file_contents
is the contents of the specified filepath_or_url
.
Parameters:
-
tbl_path
(str
) –The name of the table to create.
-
filepath_or_url
(str
) –The path or URL of the JSON file.
-
schema_overrides
(Optional[dict[str, ColumnType]]
, default:None
) –If specified, then columns in
schema_overrides
will be given the specified types (seeimport_rows()
). -
primary_key
(Optional[Union[str, list[str]]]
, default:None
) –The primary key of the table (see
create_table()
). -
num_retained_versions
(int
, default:10
) –The number of retained versions of the table (see
create_table()
). -
comment
(str
, default:''
) –A comment to attach to the table (see
create_table()
). -
kwargs
(Any
, default:{}
) –Additional keyword arguments to pass to
json.loads
.
Returns:
import_pandas
import_pandas(
tbl_name: str,
df: DataFrame,
*,
schema_overrides: Optional[dict[str, ColumnType]] = None,
primary_key: Optional[Union[str, list[str]]] = None,
num_retained_versions: int = 10,
comment: str = ""
) -> Table
Creates a new base table from a Pandas
DataFrame
, with the
specified name. The schema of the table will be inferred from the DataFrame.
The column names of the new table will be identical to those in the DataFrame, as long as they are valid Pixeltable identifiers. If a column name is not a valid Pixeltable identifier, it will be normalized according to the following procedure: - first replace any non-alphanumeric characters with underscores; - then, preface the result with the letter 'c' if it begins with a number or an underscore; - then, if there are any duplicate column names, suffix the duplicates with '_2', '_3', etc., in column order.
Parameters:
-
tbl_name
(str
) –The name of the table to create.
-
df
(DataFrame
) –The Pandas
DataFrame
. -
schema_overrides
(Optional[dict[str, ColumnType]]
, default:None
) –If specified, then for each (name, type) pair in
schema_overrides
, the column with namename
will be given typetype
, instead of being inferred from theDataFrame
. The keys inschema_overrides
should be the column names of theDataFrame
(whether or not they are valid Pixeltable identifiers).
Returns:
import_parquet
import_parquet(
table_path: str,
*,
parquet_path: str,
schema_overrides: Optional[Dict[str, ColumnType]] = None,
**kwargs: Any
) -> Table
Creates a new base table from a Parquet file or set of files. Requires pyarrow to be installed.
Parameters:
-
table_path
(str
) –Path to the table.
-
parquet_path
(str
) –Path to an individual Parquet file or directory of Parquet files.
-
schema_overrides
(Optional[Dict[str, ColumnType]]
, default:None
) –If specified, then for each (name, type) pair in
schema_overrides
, the column with namename
will be given typetype
, instead of being inferred from the Parquet dataset. The keys inschema_overrides
should be the column names of the Parquet dataset (whether or not they are valid Pixeltable identifiers). -
kwargs
(Any
, default:{}
) –Additional arguments to pass to
create_table
.
Returns:
import_rows
import_rows(
tbl_path: str,
rows: list[dict[str, Any]],
*,
schema_overrides: Optional[dict[str, ColumnType]] = None,
primary_key: Optional[Union[str, list[str]]] = None,
num_retained_versions: int = 10,
comment: str = ""
) -> Table
Creates a new base table from a list of dictionaries. The dictionaries must be of the
form {column_name: value, ...}
. Pixeltable will attempt to infer the schema of the table from the
supplied data, using the most specific type that can represent all the values in a column.
If schema_overrides
is specified, then for each entry (column_name, type)
in schema_overrides
,
Pixeltable will force the specified column to the specified type (and will not attempt any type inference
for that column).
All column types of the new table will be nullable unless explicitly specified as non-nullable in
schema_overrides
.
Parameters:
-
tbl_path
(str
) –The qualified name of the table to create.
-
rows
(list[dict[str, Any]]
) –The list of dictionaries to import.
-
schema_overrides
(Optional[dict[str, ColumnType]]
, default:None
) –If specified, then columns in
schema_overrides
will be given the specified types as described above. -
primary_key
(Optional[Union[str, list[str]]]
, default:None
) –The primary key of the table (see
create_table()
). -
num_retained_versions
(int
, default:10
) –The number of retained versions of the table (see
create_table()
). -
comment
(str
, default:''
) –A comment to attach to the table (see
create_table()
).
Returns: