Data
unimol_tools.data contains functions and classes for loading, containing, and scaler data, feature.
DataHub
Classes and functions from unimol_tools.data.datahub.py.
Datareader
Classes and functions from unimol_tools.data.datareader.py.
- class unimol_tools.data.datareader.MolDataReader[source]
A class to read Mol Data.
- read_data(data=None, is_train=True, **params)[source]
Reads and preprocesses molecular data from various input formats for model training or prediction. Parsing target columns 1. if target_cols is not None, use target_cols as target columns. 2. if target_cols is None, use all columns with prefix ‘target_col_prefix’ as target columns. 3. use given target_cols as target columns placeholder with value -1.0 for predict
- Parameters:
data – The input molecular data. Can be a file path (str), a dictionary, or a list of SMILES strings.
is_train – (bool) A flag indicating if the operation is for training. Determines data processing steps.
params – A dictionary of additional parameters for data processing.
- Returns:
A dictionary containing processed data and related information for model consumption.
- Raises:
ValueError – If the input data type is not supported or if any SMILES string is invalid (when strict).
- check_smiles(smi, is_train, smi_strict)[source]
Validates a SMILES string and decides whether it should be included based on training mode and strictness.
- Parameters:
smi – (str) The SMILES string to check.
is_train – (bool) Indicates if this check is happening during training.
smi_strict – (bool) If true, invalid SMILES strings raise an error, otherwise they’re logged and skipped.
- Returns:
(bool) True if the SMILES string is valid, False otherwise.
- Raises:
ValueError – If the SMILES string is invalid and strict mode is on.
- smi2scaffold(smi)[source]
Converts a SMILES string to its corresponding scaffold.
- Parameters:
smi – (str) The SMILES string to convert.
- Returns:
(str) The scaffold of the SMILES string, or the original SMILES if conversion fails.
- anomaly_clean(data, task, target_cols)[source]
Performs anomaly cleaning on the data based on the specified task.
- Parameters:
data – (DataFrame) The dataset to be cleaned.
task – (str) The type of task which determines the cleaning strategy.
target_cols – (list) The list of target columns to consider for cleaning.
- Returns:
(DataFrame) The cleaned dataset.
- Raises:
ValueError – If the provided task is not recognized.
- anomaly_clean_regression(data, target_cols)[source]
Performs anomaly cleaning specifically for regression tasks using a 3-sigma threshold.
- Parameters:
data – (DataFrame) The dataset to be cleaned.
target_cols – (list) The list of target columns to consider for cleaning.
- Returns:
(DataFrame) The cleaned dataset after applying the 3-sigma rule.
Datascaler
Classes and functions from unimol_tools.data.datascaler.py.
- class unimol_tools.data.datascaler.TargetScaler(ss_method, task, load_dir=None)[source]
A class to scale the target.
- __init__(ss_method, task, load_dir=None)[source]
Initializes the TargetScaler object for scaling target values.
- Parameters:
ss_method – (str) The scaling method to be used.
task – (str) The type of machine learning task (e.g., ‘classification’, ‘regression’).
load_dir – (str, optional) Directory from which to load an existing scaler.
- transform(target)[source]
Transforms the target values using the appropriate scaling method.
- Parameters:
target – (array-like) The target values to be transformed.
- Returns:
(array-like) The transformed target values.
- fit(target, dump_dir)[source]
Fits the scaler to the target values and optionally saves the scaler to disk.
- Parameters:
target – (array-like) The target values to fit the scaler.
dump_dir – (str) Directory where the fitted scaler will be saved.
- scaler_choose(method, target)[source]
Selects the appropriate scaler based on the scaling method and fit it to the target.
- Parameters:
method –
(str) The scaling method to be used.
currently support:
’minmax’: MinMaxScaler,
’standard’: StandardScaler,
’robust’: RobustScaler,
’maxabs’: MaxAbsScaler,
’quantile’: QuantileTransformer,
’power_trans’: PowerTransformer,
’normalizer’: Normalizer,
’log1p’: FunctionTransformer,
target – (array-like) The target values to fit the scaler.
- Returns:
The fitted scaler object.
Conformer
Classes and functions from unimol_tools.data.conformer.py.