Data
ArrowData
¶
Bases: DataInterface
Source code in python/opsml/data/_data.pyi
896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 |
|
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Table | None
|
PyArrow Table |
None
|
dependent_vars (DependentVars | List[str] | List[int] | None):
List of dependent variables to associate with data
data_splits (DataSplits | List[DataSplit]):
Optional list of `DataSplit`
sql_logic (SqlLogic | None):
Sql logic used to generate data represented as a dictionary.
data_profile (DataProfile | None):
Data profile
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the data from a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from. |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable load kwargs
columns (list | None): If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'. If empty, no columns will be read. Default is None. use_threads (bool): Perform multi-threaded column reads. Default is True. schema (Schema | None): Optionally provide the Schema for the parquet dataset, in which case it will not be inferred from the source. Default is None. use_pandas_metadata (bool): If True and file has custom pandas schema metadata, ensure that index columns are also loaded. Default is False. read_dictionary (list | None): List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. Default is None. memory_map (bool): If the source is a file path, use a memory map to read file, which can improve performance in some environments. Default is False. buffer_size (int): If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered. Default is 0. partitioning (pyarrow.dataset.Partitioning | str | list of str): The partitioning scheme for a partitioned dataset. Default is 'hive'. filesystem (FileSystem | None): If nothing passed, will be inferred based on path. Default is None. filters (pyarrow.compute.Expression | list[tuple] | list[list[tuple]] | None): Rows which do not match the filter predicate will be removed from scanned data. Default is None. use_legacy_dataset (bool | None): Deprecated and has no effect from PyArrow version 15.0.0. Default is None. ignore_prefixes (list | None): Files matching any of these prefixes will be ignored by the discovery process. Default is ['.', '_']. pre_buffer (bool): Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). Default is True. coerce_int96_timestamp_unit (str | None): Cast timestamps that are stored in INT96 format to a particular resolution (e.g. 'ms'). Default is None. decryption_properties (FileDecryptionProperties | None): File-level decryption properties. Default is None. thrift_string_size_limit (int | None): If not None, override the maximum total string size allocated when decoding Thrift structures. Default is None. thrift_container_size_limit (int | None): If not None, override the maximum total size of containers allocated when decoding Thrift structures. Default is None. page_checksum_verification (bool): If True, verify the checksum for each page read from the file. Default is False.
Additional Information
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
Source code in python/opsml/data/_data.pyi
988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 |
|
save(path, save_kwargs=None)
¶
Saves pyarrow table to parquet via write_table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to save the data to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable save kwargs
row_group_size (int | None): Maximum number of rows in each written row group. If None, the row group size will be the minimum of the Table size and 1024 * 1024. Default is None. version ({'1.0', '2.4', '2.6'}): Determine which Parquet logical types are available for use. Default is '2.6'. use_dictionary (bool | list): Specify if dictionary encoding should be used in general or only for some columns. Default is True. compression (str | dict): Specify the compression codec, either on a general basis or per-column. Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}. Default is 'snappy'. write_statistics (bool | list): Specify if statistics should be written in general or only for some columns. Default is True. use_deprecated_int96_timestamps (bool | None): Write timestamps to INT96 Parquet format. Default is None. coerce_timestamps (str | None): Cast timestamps to a particular resolution. Valid values: {None, 'ms', 'us'}. Default is None. allow_truncated_timestamps (bool): Allow loss of data when coercing timestamps to a particular resolution. Default is False. data_page_size (int | None): Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Default is None. flavor ({'spark'} | None): Sanitize schema or set other compatibility options to work with various target systems. Default is None. filesystem (FileSystem | None): Filesystem object to use when reading the parquet file. Default is None. compression_level (int | dict | None): Specify the compression level for a codec, either on a general basis or per-column. Default is None. use_byte_stream_split (bool | list): Specify if the byte_stream_split encoding should be used in general or only for some columns. Default is False. column_encoding (str | dict | None): Specify the encoding scheme on a per column basis. Default is None. data_page_version ({'1.0', '2.0'}): The serialized Parquet data page format version to write. Default is '1.0'. use_compliant_nested_type (bool): Whether to write compliant Parquet nested type (lists). Default is True. encryption_properties (FileEncryptionProperties | None): File encryption properties for Parquet Modular Encryption. Default is None. write_batch_size (int | None): Number of values to write to a page at a time. Default is None. dictionary_pagesize_limit (int | None): Specify the dictionary page size limit per row group. Default is None. store_schema (bool): By default, the Arrow schema is serialized and stored in the Parquet file metadata. Default is True. write_page_index (bool): Whether to write a page index in general for all columns. Default is False. write_page_checksum (bool): Whether to write page checksums in general for all columns. Default is False. sorting_columns (Sequence[SortingColumn] | None): Specify the sort order of the data being written. Default is None. store_decimal_as_integer (bool): Allow decimals with 1 <= precision <= 18 to be stored as integers. Default is False. **kwargs: Additional options for ParquetWriter.
Additional Information
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
Source code in python/opsml/data/_data.pyi
ColumnSplit
¶
Source code in python/opsml/data/_data.pyi
__init__(column_name, column_value, column_type=ColType.Builtin, inequality=None)
¶
Define a column split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_name
|
str
|
The name of the column |
required |
column_value
|
Union[str, float, int]
|
The value of the column. Can be a string, float, or int. If timestamp, convert to isoformat (str) and specify timestamp coltype |
required |
column_type
|
ColType
|
The type of the column. Defaults to ColType.Builtin. If providing ColtType.Timestamp, the column_value should be a float |
Builtin
|
inequality
|
Optional[Union[str, Inequality]]
|
The inequality of the column |
None
|
Source code in python/opsml/data/_data.pyi
DataInterface
¶
Source code in python/opsml/data/_data.pyi
353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 |
|
data
property
writable
¶
Returns the data
data_profile
property
writable
¶
Return the data profile
Returns:
Type | Description |
---|---|
Optional[DataProfile]
|
The data profile |
data_splits
property
writable
¶
Returns the data splits.
data_type
property
¶
Return the data type.
dependent_vars
property
writable
¶
Returns the dependent variables.
schema
property
writable
¶
Returns the feature map.
sql_logic
property
¶
Returns the sql logic.
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
Data. Can be a pyarrow table, pandas dataframe, polars dataframe or numpy array |
None
|
dependent_vars
|
DependentVars
|
List of dependent variables to associate with data |
None
|
data_splits
|
DataSplits
|
Optional list of |
None
|
sql_logic
|
SqlLogic
|
SqlLogic class used to generate data. |
None
|
data_profile
|
DataProfile
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
add_sql_logic(name, query=None, filepath=None)
¶
Add sql logic to the data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the sql logic |
required |
query
|
Optional[str]
|
The optional query to use |
None
|
filepath
|
Optional[str]
|
The optional filepath to open the query from |
None
|
Source code in python/opsml/data/_data.pyi
create_data_profile(bin_size=20, compute_correlations=False)
¶
Create a data profile
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bin_size
|
int
|
The bin size for the data profile |
20
|
compute_correlations
|
bool
|
Whether to compute correlations |
False
|
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the data from a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to pass in. |
None
|
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Saves all data interface component to the given path. This used as part of saving a DataCard
Methods called in save
- save_sql: Saves all sql logic to files(s)
- create_schema: Creates a FeatureSchema from the associated data
- save_data: Saves the data to a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
The path to save the data interface components to. |
required |
save_kwargs
|
DataSaveKwargs
|
The save kwargs to use. |
None
|
Source code in python/opsml/data/_data.pyi
DataInterfaceMetadata
¶
Source code in python/opsml/data/_data.pyi
__init__(save_metadata, schema, extra_metadata, sql_logic, interface_type, data_splits, dependent_vars, data_type)
¶
Instantiate DataInterfaceMetadata object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_metadata
|
DataInterfaceSaveMetadata
|
The save metadata |
required |
schema
|
FeatureSchema
|
The schema |
required |
extra_metadata
|
dict[str, str]
|
Extra metadata |
required |
sql_logic
|
SqlLogic
|
Sql logic |
required |
interface_type
|
DataInterfaceType
|
The interface type |
required |
data_splits
|
DataSplits
|
The data splits |
required |
dependent_vars
|
DependentVars
|
Dependent variables |
required |
data_type
|
DataType
|
The data type |
required |
Source code in python/opsml/data/_data.pyi
__str__()
¶
model_dump_json()
¶
DataInterfaceSaveMetadata
¶
Source code in python/opsml/data/_data.pyi
__init__(data_uri, data_profile_uri=None, sql_uri=None, extra=None, save_kwargs=None)
¶
Define interface save metadata
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_uri
|
Path
|
The data uri |
required |
data_profile_uri
|
Optional[Path]
|
The data profile uri |
None
|
sql_uri
|
Optional[Path]
|
The sql uri |
None
|
extra
|
Optional[ExtraMetadata]
|
Extra metadata |
None
|
save_kwargs
|
Optional[DataSaveKwargs]
|
Save kwargs |
None
|
Source code in python/opsml/data/_data.pyi
DataLoadKwargs
¶
Source code in python/opsml/data/_data.pyi
__init__(data=None)
¶
Optional arguments to pass to load_model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict
|
Optional data arguments to use when loading |
None
|
DataSaveKwargs
¶
Source code in python/opsml/data/_data.pyi
__init__(data=None)
¶
Optional arguments to pass to save_model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict
|
Optional data arguments to use when saving model to onnx format |
None
|
DataSplit
¶
Source code in python/opsml/data/_data.pyi
__init__(label, column_split=None, start_stop_split=None, indice_split=None)
¶
Define a data split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label
|
str
|
The label of the split |
required |
column_split
|
Optional[ColumnSplit]
|
The column split |
None
|
start_stop_split
|
Optional[StartStopSplit]
|
The start stop split |
None
|
indice_split
|
Optional[IndiceSplit]
|
The indice split |
None
|
Source code in python/opsml/data/_data.pyi
DataSplits
¶
Source code in python/opsml/data/_data.pyi
splits
property
writable
¶
Return the splits
__init__(splits)
¶
__str__()
¶
split_data(data, data_type, dependent_vars)
¶
Split the data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
The data to split |
required |
data_type
|
DataType
|
The data type |
required |
dependent_vars
|
DependentVars
|
Dependent variables to associate with the data |
required |
Returns:
Type | Description |
---|---|
Dict[str, Data]
|
A dictionary of data splits |
Source code in python/opsml/data/_data.pyi
DataSplitter
¶
Source code in python/opsml/data/_data.pyi
split_data(split, data, data_type, dependent_vars)
staticmethod
¶
Create a split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split
|
DataSplit
|
The data split to use to split the data |
required |
data
|
Any
|
The data to split |
required |
data_type
|
DataType
|
The data type |
required |
dependent_vars
|
DependentVars
|
Dependent variables to associate with the data |
required |
Returns:
Type | Description |
---|---|
Data
|
A Data object |
Source code in python/opsml/data/_data.pyi
DependentVars
¶
Source code in python/opsml/data/_data.pyi
column_indices
property
writable
¶
Return the column indices
column_names
property
writable
¶
Return the column names
__init__(column_names=None, column_indices=None)
¶
Define dependent variables for the data interface. User can specify either column names or column indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_names
|
Optional[List[str]]
|
The column names of the dependent variables |
None
|
column_indices
|
Optional[List[int]]
|
The column indices of the dependent variables |
None
|
Source code in python/opsml/data/_data.pyi
IndiceSplit
¶
Source code in python/opsml/data/_data.pyi
__init__(indices)
¶
Define an indice split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices
|
List[int]
|
The indices of the split |
required |
NumpyData
¶
Bases: DataInterface
Source code in python/opsml/data/_data.pyi
566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 |
|
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
NDArray | None
|
Numpy array |
None
|
dependent_vars
|
DependentVars | List[str] | List[int] | None
|
List of dependent variables to associate with data |
None
|
data_splits
|
DataSplits | List[DataSplit]
|
Optional list of |
None
|
sql_logic
|
SqlLogic | None
|
Sql logic used to generate data represented as a dictionary. |
None
|
data_profile
|
DataProfile | None
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the data via numpy.load
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from. |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to use when loading |
None
|
Acceptable load kwargs:
see: https://numpy.org/doc/stable/reference/generated/numpy.load.html
mmap_mode:
If not None, then memory-map the file, using the given mode
allow_pickle (bool):
Allow loading pickled object arrays stored in npy files
fix_imports (bool):
If fix_imports is True, pickle will try to map the old Python 2 names to the new names used in Python 3.
encoding (str):
What encoding to use when reading Python 2 strings. Only useful when py3k is True.
max_header_size (int):
The maximum size of the file header
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Save data using numpy save format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to save the data to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable save kwargs:
see: https://numpy.org/doc/stable/reference/generated/numpy.save.html
allow_pickle (bool):
Allow saving object arrays using Python pickles.
fix_imports (bool):
The fix_imports flag is deprecated and has no effect
Source code in python/opsml/data/_data.pyi
PandasData
¶
Bases: DataInterface
Source code in python/opsml/data/_data.pyi
775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 |
|
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame | None
|
Pandas dataframe |
None
|
dependent_vars
|
DependentVars | List[str] | List[int] | None
|
List of dependent variables to associate with data |
None
|
data_splits
|
DataSplits | List[DataSplit]
|
Optional list of |
None
|
sql_logic
|
SqlLogic | None
|
Sql logic used to generate data represented as a dictionary. |
None
|
data_profile
|
DataProfile | None
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the pandas dataframe from a parquet dataset via read_parquet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from. |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable load kwargs
engine ({'auto', 'pyarrow', 'fastparquet'}): Parquet library to use. If 'auto', then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try 'pyarrow', falling back to 'fastparquet' if 'pyarrow' is unavailable. Default is 'auto'. columns (list | None): If not None, only these columns will be read from the file. Default is None. storage_options (dict | None): Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Default is None. use_nullable_dtypes (bool): If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice. Default is False. Deprecated since version 2.0. dtype_backend ({'numpy_nullable', 'pyarrow'}): Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows: - "numpy_nullable": returns nullable-dtype-backed DataFrame (default). - "pyarrow": returns pyarrow-backed nullable ArrowDtype DataFrame. Default is 'numpy_nullable'. filesystem (fsspec | pyarrow filesystem | None): Filesystem object to use when reading the parquet file. Only implemented for engine="pyarrow". Default is None. filters (list[tuple] | list[list[tuple]] | None): To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in] The innermost tuples are transposed into a set of filters applied through an AND operation. The outer list combines these sets of filters through an OR operation. A single list of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. Using this argument will NOT result in row-wise filtering of the final partitions unless engine="pyarrow" is also specified. For other engines, filtering is only performed at the partition level, that is, to prevent the loading of some row-groups and/or files. Default is None. **kwargs: Any additional kwargs are passed to the engine.
Additional Information
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Saves pandas dataframe as parquet file via to_parquet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to save the data to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable save kwargs
engine ({'auto', 'pyarrow', 'fastparquet'}): Parquet library to use. If 'auto', then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try 'pyarrow', falling back to 'fastparquet' if 'pyarrow' is unavailable. Default is 'auto'. compression (str | None): Name of the compression to use. Use None for no compression. Supported options: 'snappy', 'gzip', 'brotli', 'lz4', 'zstd'. Default is 'snappy'. index (bool | None): If True, include the dataframe's index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe's index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn't require much space and is faster. Other indexes will be included as columns in the file output. Default is None. partition_cols (list | None): Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string. Default is None. storage_options (dict | None): Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Default is None. **kwargs: Any additional kwargs are passed to the engine
Additional Information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html
Source code in python/opsml/data/_data.pyi
PolarsData
¶
Bases: DataInterface
Source code in python/opsml/data/_data.pyi
646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 |
|
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame | None
|
Pandas dataframe |
None
|
dependent_vars
|
DependentVars | List[str] | List[int] | None
|
List of dependent variables to associate with data |
None
|
data_splits
|
DataSplits | List[DataSplit]
|
Optional list of |
None
|
sql_logic
|
SqlLogic | None
|
Sql logic used to generate data represented as a dictionary. |
None
|
data_profile
|
DataProfile | None
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the data from a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from. |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable load kwargs
columns (list[int] | list[str] | None): Columns to load. Default is None. n_rows (int | None): Number of rows to load. Default is None. row_index_name (str | None): Name of the row index. Default is None. row_index_offset (int): Offset for the row index. Default is 0. parallel (ParallelStrategy): Parallel strategy to use. Default is 'auto'. use_statistics (bool): Whether to use statistics. Default is True. hive_partitioning (bool | None): Whether to use hive partitioning. Default is None. glob (bool): Whether to use glob pattern matching. Default is True. schema (SchemaDict | None): Schema to use. Default is None. hive_schema (SchemaDict | None): Hive schema to use. Default is None. try_parse_hive_dates (bool): Whether to try parsing hive dates. Default is True. rechunk (bool): Whether to rechunk the data. Default is False. low_memory (bool): Whether to use low memory mode. Default is False. storage_options (dict[str, Any] | None): Additional storage options. Default is None. credential_provider (CredentialProviderFunction | Literal['auto'] | None): Credential provider function. Default is 'auto'. retries (int): Number of retries for loading. Default is 2. use_pyarrow (bool): Whether to use PyArrow for loading. Default is False. pyarrow_options (dict[str, Any] | None): Additional options for PyArrow. Default is None. memory_map (bool): Whether to use memory mapping. Default is True. include_file_paths (str | None): File paths to include. Default is None. allow_missing_columns (bool): Whether to allow missing columns. Default is False.
See Also
https://docs.pola.rs/api/python/dev/reference/api/polars.read_parquet.html
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Saves polars dataframe to parquet dataset via write_parquet
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to save the data to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable save kwargs
compression (ParquetCompression): Compression codec to use for writing. compression_level (int | None): Compression level to use. Default is None. statistics (bool | str | dict[str, bool]): Whether to write statistics. Default is True. row_group_size (int | None): Number of rows per row group. Default is None. data_page_size (int | None): Size of data pages. Default is None. use_pyarrow (bool): Whether to use PyArrow for writing. Default is False. pyarrow_options (dict[str, Any] | None): Additional options for PyArrow. Default is None. partition_by (str | Sequence[str] | None): Columns to partition by. Default is None. partition_chunk_size_bytes (int): Size of partition chunks in bytes. Default is 4294967296. storage_options (dict[str, Any] | None): Additional storage options. Default is None. credential_provider (CredentialProviderFunction | Literal['auto'] | None): Credential provider function. Default is 'auto'. retries (int): Number of retries for writing. Default is 2.
See Also
https://docs.pola.rs/api/python/dev/reference/api/polars.DataFrame.write_parquet.html
Source code in python/opsml/data/_data.pyi
SqlData
¶
Source code in python/opsml/data/_data.pyi
__init__(sql_logic, data_profile=None)
¶
Define a sql data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql
|
SqlLogic
|
Sql logic used to generate data represented as a dictionary. |
required |
data_profile
|
DataProfile | None
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Save the sql logic to a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
The path to save the sql logic to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Source code in python/opsml/data/_data.pyi
SqlLogic
¶
Source code in python/opsml/data/_data.pyi
queries
property
writable
¶
Return the queries
__getitem__(key)
¶
Get the query by key
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key
|
str
|
The key to get the query by |
required |
Returns:
Type | Description |
---|---|
str
|
The query |
__init__(queries)
¶
Define sql logic
Parameters:
Name | Type | Description | Default |
---|---|---|---|
queries
|
Dict[str, str]
|
Sql logic used to generate data represented as a dictionary. Key is the name to assign to the sql logic and value is either a sql query or a path to a .sql file. |
required |
Source code in python/opsml/data/_data.pyi
__str__()
¶
add_sql_logic(name, query=None, filepath=None)
¶
Add sql logic to existing queries
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name to associate with the sql logic |
required |
query
|
Optional[str]
|
SQL query |
None
|
filepath
|
Optional[str]
|
Filepath to SQL query |
None
|
Source code in python/opsml/data/_data.pyi
StartStopSplit
¶
Source code in python/opsml/data/_data.pyi
__init__(start, stop)
¶
Define a start stop split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start
|
int
|
The start of the split |
required |
stop
|
int
|
The stop of the split |
required |
TorchData
¶
Bases: DataInterface
Source code in python/opsml/data/_data.pyi
1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 |
|
__init__(data=None, data_splits=None, dependent_vars=None, sql_logic=None, data_profile=None)
¶
Define a data interface
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Tensor | None
|
Torch tensor |
None
|
dependent_vars
|
DependentVars | List[str] | List[int] | None
|
List of dependent variables to associate with data |
None
|
data_splits
|
DataSplits | List[DataSplit]
|
Optional list of |
None
|
sql_logic
|
SqlLogic | None
|
Sql logic used to generate data represented as a dictionary. |
None
|
data_profile
|
DataProfile | None
|
Data profile |
None
|
Source code in python/opsml/data/_data.pyi
load(path, metadata, load_kwargs=None)
¶
Load the torch tensor from file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to load the data from. |
required |
metadata
|
DataInterfaceSaveMetadata
|
Metadata associated with the data |
required |
load_kwargs
|
DataLoadKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable load kwargs
map_location: A function, torch.device, string or a dict specifying how to remap storage locations. pickle_module: Module used for unpickling metadata and objects (has to match the pickle_module used to serialize file). weights_only: Indicates whether unpickler should be restricted to loading only tensors, primitive types, dictionaries and any types added via torch.serialization.add_safe_globals(). mmap: Indicates whether the file should be mmaped rather than loading all the storages into memory. Typically, tensor storages in the file will first be moved from disk to CPU memory, after which they are moved to the location that they were tagged with when saving, or specified by map_location. This second step is a no-op if the final location is CPU. When the mmap flag is set, instead of copying the tensor storages from disk to CPU memory in the first step, f is mmaped. pickle_load_args: (Python 3 only) optional keyword arguments passed over to pickle_module.load() and pickle_module.Unpickler(), e.g., errors=....
Additional Information
https://pytorch.org/docs/stable/generated/torch.load.html
Source code in python/opsml/data/_data.pyi
save(path, save_kwargs=None)
¶
Saves torch tensor to a file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Base path to save the data to. |
required |
save_kwargs
|
DataSaveKwargs
|
Additional kwargs to pass in. |
None
|
Acceptable save kwargs
pickle_module (Any): Module used for pickling metadata and objects. pickle_protocol (int): Can be specified to override the default protocol.
Additional Information
https://pytorch.org/docs/main/generated/torch.save.html
Source code in python/opsml/data/_data.pyi
generate_feature_schema(data, data_type)
¶
Generate a feature schema
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
Data to generate the feature schema from |
required |
data_type
|
DataType
|
The data type |
required |
Returns:
Type | Description |
---|---|
FeatureSchema
|
A feature map |