Data

unimol_tools.data contains functions and classes for loading, containing, and scaler data, feature.

DataHub

Classes and functions from unimol_tools.data.datahub.py.

Datareader

Classes and functions from unimol_tools.data.datareader.py.

class unimol_tools.data.datareader.MolDataReader[source]

A class to read Mol Data.

read_data(data=None, is_train=True, **params)[source]

Reads and preprocesses molecular data from various input formats for model training or prediction. Parsing target columns 1. if target_cols is not None, use target_cols as target columns. 2. if target_cols is None, use all columns with prefix ‘target_col_prefix’ as target columns. 3. use given target_cols as target columns placeholder with value -1.0 for predict

Parameters:
  • data – The input molecular data. Can be a file path (str), a dictionary, or a list of SMILES strings.

  • is_train – (bool) A flag indicating if the operation is for training. Determines data processing steps.

  • params – A dictionary of additional parameters for data processing.

Returns:

A dictionary containing processed data and related information for model consumption.

Raises:

ValueError – If the input data type is not supported or if any SMILES string is invalid (when strict).

check_smiles(smi, is_train, smi_strict)[source]

Validates a SMILES string and decides whether it should be included based on training mode and strictness.

Parameters:
  • smi – (str) The SMILES string to check.

  • is_train – (bool) Indicates if this check is happening during training.

  • smi_strict – (bool) If true, invalid SMILES strings raise an error, otherwise they’re logged and skipped.

Returns:

(bool) True if the SMILES string is valid, False otherwise.

Raises:

ValueError – If the SMILES string is invalid and strict mode is on.

smi2scaffold(smi)[source]

Converts a SMILES string to its corresponding scaffold.

Parameters:

smi – (str) The SMILES string to convert.

Returns:

(str) The scaffold of the SMILES string, or the original SMILES if conversion fails.

anomaly_clean(data, task, target_cols)[source]

Performs anomaly cleaning on the data based on the specified task.

Parameters:
  • data – (DataFrame) The dataset to be cleaned.

  • task – (str) The type of task which determines the cleaning strategy.

  • target_cols – (list) The list of target columns to consider for cleaning.

Returns:

(DataFrame) The cleaned dataset.

Raises:

ValueError – If the provided task is not recognized.

anomaly_clean_regression(data, target_cols)[source]

Performs anomaly cleaning specifically for regression tasks using a 3-sigma threshold.

Parameters:
  • data – (DataFrame) The dataset to be cleaned.

  • target_cols – (list) The list of target columns to consider for cleaning.

Returns:

(DataFrame) The cleaned dataset after applying the 3-sigma rule.

Datascaler

Classes and functions from unimol_tools.data.datascaler.py.

class unimol_tools.data.datascaler.TargetScaler(ss_method, task, load_dir=None)[source]

A class to scale the target.

__init__(ss_method, task, load_dir=None)[source]

Initializes the TargetScaler object for scaling target values.

Parameters:
  • ss_method – (str) The scaling method to be used.

  • task – (str) The type of machine learning task (e.g., ‘classification’, ‘regression’).

  • load_dir – (str, optional) Directory from which to load an existing scaler.

transform(target)[source]

Transforms the target values using the appropriate scaling method.

Parameters:

target – (array-like) The target values to be transformed.

Returns:

(array-like) The transformed target values.

fit(target, dump_dir)[source]

Fits the scaler to the target values and optionally saves the scaler to disk.

Parameters:
  • target – (array-like) The target values to fit the scaler.

  • dump_dir – (str) Directory where the fitted scaler will be saved.

scaler_choose(method, target)[source]

Selects the appropriate scaler based on the scaling method and fit it to the target.

Parameters:
  • method

    (str) The scaling method to be used.

    currently support:

    • ’minmax’: MinMaxScaler,

    • ’standard’: StandardScaler,

    • ’robust’: RobustScaler,

    • ’maxabs’: MaxAbsScaler,

    • ’quantile’: QuantileTransformer,

    • ’power_trans’: PowerTransformer,

    • ’normalizer’: Normalizer,

    • ’log1p’: FunctionTransformer,

  • target – (array-like) The target values to fit the scaler.

Returns:

The fitted scaler object.

inverse_transform(target)[source]

Inverse transforms the scaled target values back to their original scale.

Parameters:

target – (array-like) The target values to be inverse transformed.

Returns:

(array-like) The target values in their original scale.

is_skewed(target)[source]

Determines whether the target values are skewed based on skewness and kurtosis metrics.

Parameters:

target – (array-like) The target values to be checked for skewness.

Returns:

(bool) True if the target is skewed, False otherwise.

Conformer

Classes and functions from unimol_tools.data.conformer.py.