Data

unimol_tools.data contains functions and classes for loading, containing, and scaler data, feature.

DataHub

Classes and functions from unimol_tools.data.datahub.py.

class unimol_tools.data.datahub.DataHub(data=None, is_train=True, save_path=None, **params)[source]

The DataHub class is responsible for storing and preprocessing data for machine learning tasks. It initializes with configuration options to handle different types of tasks such as regression, classification, and others. It also supports data scaling and handling molecular data.

__init__(data=None, is_train=True, save_path=None, **params)[source]

Initializes the DataHub instance with data and configuration for the ML task.

Parameters:
  • data – Initial dataset to be processed.

  • is_train – (bool) Indicates if the DataHub is being used for training.

  • save_path – (str) Path to save any necessary files, like scalers.

  • params – Additional parameters for data preprocessing and model configuration.

_init_data(**params)[source]

Initializes and preprocesses the data based on the task and parameters provided.

This method handles reading raw data, scaling targets, and transforming data for use with molecular inputs. It tailors the preprocessing steps based on the task type, such as regression or classification.

Parameters:

params – Additional parameters for data processing.

Raises:

ValueError – If the task type is unknown.

save_mol2sdf(data, mols, params)[source]

Save the conformers to a SDF file.

Parameters:
  • data – DataFrame containing the raw data.

  • mols – List of RDKit molecule objects.

Datareader

Classes and functions from unimol_tools.data.datareader.py.

class unimol_tools.data.datareader.MolDataReader[source]

A class to read Mol Data.

read_data(data=None, is_train=True, **params)[source]

Reads and preprocesses molecular data from various input formats for model training or prediction. Parsing target columns 1. if target_cols is not None, use target_cols as target columns. 2. if target_cols is None, use all columns with prefix ‘target_col_prefix’ as target columns. 3. use given target_cols as target columns placeholder with value -1.0 for predict

Parameters:
  • data – The input molecular data. Can be a file path (str), a dictionary, or a list of SMILES strings.

  • is_train – (bool) A flag indicating if the operation is for training. Determines data processing steps.

  • params – A dictionary of additional parameters for data processing.

Returns:

A dictionary containing processed data and related information for model consumption.

Raises:

ValueError – If the input data type is not supported or if any SMILES string is invalid (when strict).

check_smiles(smi, is_train, smi_strict)[source]

Validates a SMILES string and decides whether it should be included based on training mode and strictness.

Parameters:
  • smi – (str) The SMILES string to check.

  • is_train – (bool) Indicates if this check is happening during training.

  • smi_strict – (bool) If true, invalid SMILES strings raise an error, otherwise they’re logged and skipped.

Returns:

(bool) True if the SMILES string is valid, False otherwise.

Raises:

ValueError – If the SMILES string is invalid and strict mode is on.

smi2scaffold(smi)[source]

Converts a SMILES string to its corresponding scaffold.

Parameters:

smi – (str) The SMILES string to convert.

Returns:

(str) The scaffold of the SMILES string, or the original SMILES if conversion fails.

anomaly_clean(data, task, target_cols)[source]

Performs anomaly cleaning on the data based on the specified task.

Parameters:
  • data – (DataFrame) The dataset to be cleaned.

  • task – (str) The type of task which determines the cleaning strategy.

  • target_cols – (list) The list of target columns to consider for cleaning.

Returns:

(DataFrame) The cleaned dataset.

Raises:

ValueError – If the provided task is not recognized.

anomaly_clean_regression(data, target_cols)[source]

Performs anomaly cleaning specifically for regression tasks using a 3-sigma threshold.

Parameters:
  • data – (DataFrame) The dataset to be cleaned.

  • target_cols – (list) The list of target columns to consider for cleaning.

Returns:

(DataFrame) The cleaned dataset after applying the 3-sigma rule.

Datascaler

Classes and functions from unimol_tools.data.datascaler.py.

class unimol_tools.data.datascaler.TargetScaler(ss_method, task, load_dir=None)[source]

A class to scale the target.

__init__(ss_method, task, load_dir=None)[source]

Initializes the TargetScaler object for scaling target values.

Parameters:
  • ss_method – (str) The scaling method to be used.

  • task – (str) The type of machine learning task (e.g., ‘classification’, ‘regression’).

  • load_dir – (str, optional) Directory from which to load an existing scaler.

transform(target)[source]

Transforms the target values using the appropriate scaling method.

Parameters:

target – (array-like) The target values to be transformed.

Returns:

(array-like) The transformed target values.

fit(target, dump_dir)[source]

Fits the scaler to the target values and optionally saves the scaler to disk.

Parameters:
  • target – (array-like) The target values to fit the scaler.

  • dump_dir – (str) Directory where the fitted scaler will be saved.

scaler_choose(method, target)[source]

Selects the appropriate scaler based on the scaling method and fit it to the target.

Parameters:
  • method

    (str) The scaling method to be used.

    currently support:

    • ’minmax’: MinMaxScaler,

    • ’standard’: StandardScaler,

    • ’robust’: RobustScaler,

    • ’maxabs’: MaxAbsScaler,

    • ’quantile’: QuantileTransformer,

    • ’power_trans’: PowerTransformer,

    • ’normalizer’: Normalizer,

    • ’log1p’: FunctionTransformer,

  • target – (array-like) The target values to fit the scaler.

Returns:

The fitted scaler object.

inverse_transform(target)[source]

Inverse transforms the scaled target values back to their original scale.

Parameters:

target – (array-like) The target values to be inverse transformed.

Returns:

(array-like) The target values in their original scale.

is_skewed(target)[source]

Determines whether the target values are skewed based on skewness and kurtosis metrics.

Parameters:

target – (array-like) The target values to be checked for skewness.

Returns:

(bool) True if the target is skewed, False otherwise.

Conformer

Classes and functions from unimol_tools.data.conformer.py.

class unimol_tools.data.conformer.ConformerGen(**params)[source]

This class designed to generate conformers for molecules represented as SMILES strings using provided parameters and configurations. The transform method uses multiprocessing to speed up the conformer generation process.

__init__(**params)[source]

Initializes the neural network model based on the provided model name and parameters.

Parameters:
  • model_name – (str) The name of the model to initialize.

  • params – Additional parameters for model configuration.

Returns:

An instance of the specified neural network model.

Raises:

ValueError – If the model name is not recognized.

_init_features(**params)[source]

Initializes the features of the ConformerGen object based on provided parameters.

Parameters:

params – Arbitrary keyword arguments for feature configuration. These can include the random seed, maximum number of atoms, data type, generation method, generation mode, and whether to remove hydrogens.

single_process(smiles)[source]

Processes a single SMILES string to generate conformers using the specified method.

Parameters:

smiles – (str) The SMILES string representing the molecule.

Returns:

A unimolecular data representation (dictionary) of the molecule.

Raises:

ValueError – If the conformer generation method is unrecognized.

unimol_tools.data.conformer.inner_smi2coords(smi, seed=42, mode='fast', remove_hs=True, return_mol=False)[source]

This function is responsible for converting a SMILES (Simplified Molecular Input Line Entry System) string into 3D coordinates for each atom in the molecule. It also allows for the generation of 2D coordinates if 3D conformation generation fails, and optionally removes hydrogen atoms and their coordinates from the resulting data.

Parameters:
  • smi – (str) The SMILES representation of the molecule.

  • seed – (int, optional) The random seed for conformation generation. Defaults to 42.

  • mode – (str, optional) The mode of conformation generation, ‘fast’ for quick generation, ‘heavy’ for more attempts. Defaults to ‘fast’.

  • remove_hs – (bool, optional) Whether to remove hydrogen atoms from the final coordinates. Defaults to True.

Returns:

A tuple containing the list of atom symbols and their corresponding 3D coordinates.

Raises:

AssertionError – If no atoms are present in the molecule or if the coordinates do not align with the atom count.

unimol_tools.data.conformer.inner_coords(atoms, coordinates, remove_hs=True)[source]

Processes a list of atoms and their corresponding coordinates to remove hydrogen atoms if specified. This function takes a list of atom symbols and their corresponding coordinates and optionally removes hydrogen atoms from the output. It includes assertions to ensure the integrity of the data and uses numpy for efficient processing of the coordinates.

Parameters:
  • atoms – (list) A list of atom symbols (e.g., [‘C’, ‘H’, ‘O’]).

  • coordinates – (list of tuples or list of lists) Coordinates corresponding to each atom in the atoms list.

  • remove_hs – (bool, optional) A flag to indicate whether hydrogen atoms should be removed from the output. Defaults to True.

Returns:

A tuple containing two elements; the filtered list of atom symbols and their corresponding coordinates. If remove_hs is False, the original lists are returned.

Raises:

AssertionError – If the length of atoms list does not match the length of coordinates list.

unimol_tools.data.conformer.coords2unimol(atoms, coordinates, dictionary, max_atoms=256, remove_hs=True, **params)[source]

Converts atom symbols and coordinates into a unified molecular representation.

Parameters:
  • atoms – (list) List of atom symbols.

  • coordinates – (ndarray) Array of atomic coordinates.

  • dictionary – (Dictionary) An object that maps atom symbols to unique integers.

  • max_atoms – (int) The maximum number of atoms to consider for the molecule.

  • remove_hs – (bool) Whether to remove hydrogen atoms from the representation.

  • params – Additional parameters.

Returns:

A dictionary containing the molecular representation with tokens, distances, coordinates, and edge types.

class unimol_tools.data.conformer.UniMolV2Feature(**params)[source]

This class is responsible for generating features for molecules represented as SMILES strings. It uses the ConformerGen class to generate conformers for the molecules and converts the resulting atom symbols and coordinates into a unified molecular representation.

__init__(**params)[source]

Initializes the neural network model based on the provided model name and parameters.

Parameters:
  • model_name – (str) The name of the model to initialize.

  • params – Additional parameters for model configuration.

Returns:

An instance of the specified neural network model.

Raises:

ValueError – If the model name is not recognized.

_init_features(**params)[source]

Initializes the features of the UniMolV2Feature object based on provided parameters.

Parameters:

params – Arbitrary keyword arguments for feature configuration. These can include the random seed, maximum number of atoms, data type, generation method, generation mode, and whether to remove hydrogens.

single_process(smiles)[source]

Processes a single SMILES string to generate conformers using the specified method.

Parameters:

smiles – (str) The SMILES string representing the molecule.

Returns:

A unimolecular data representation (dictionary) of the molecule.

Raises:

ValueError – If the conformer generation method is unrecognized.

unimol_tools.data.conformer.create_mol_from_atoms_and_coords(atoms, coordinates)[source]

Creates an RDKit molecule object from a list of atom symbols and their corresponding coordinates.

Parameters:
  • atoms – (list) Atom symbols for the molecule.

  • coordinates – (list) Atomic coordinates for the molecule.

Returns:

RDKit molecule object.

unimol_tools.data.conformer.mol2unimolv2(mol, max_atoms=128, remove_hs=True, **params)[source]

Converts atom symbols and coordinates into a unified molecular representation.

Parameters:
  • mol – (rdkit.Chem.Mol) The molecule object containing atom symbols and coordinates.

  • max_atoms – (int) The maximum number of atoms to consider for the molecule.

  • remove_hs – (bool) Whether to remove hydrogen atoms from the representation. This must be True for UniMolV2.

  • params – Additional parameters.

Returns:

A batched data containing the molecular representation.

unimol_tools.data.conformer.safe_index(l, e)[source]

Return index of element e in list l. If e is not present, return the last index

unimol_tools.data.conformer.atom_to_feature_vector(atom)[source]

Converts rdkit atom object to feature list of indices :param mol: rdkit atom object :return: list

unimol_tools.data.conformer.bond_to_feature_vector(bond)[source]

Converts rdkit bond object to feature list of indices :param mol: rdkit bond object :return: list

unimol_tools.data.conformer.get_graph(mol)[source]

Converts SMILES string to graph Data object :input: SMILES string (str) :return: graph object

Split

unimol_tools.data.split.py manages the split methods in the dataset.

class unimol_tools.data.split.Splitter(method='random', kfold=5, seed=42, **params)[source]

The Splitter class is responsible for splitting a dataset into train and test sets based on the specified method.

__init__(method='random', kfold=5, seed=42, **params)[source]

Initializes the Splitter with a specified split method and random seed.

Parameters:
  • split_method – (str) The method for splitting the dataset, in the format ‘Nfold_method’. Defaults to ‘5fold_random’.

  • seed – (int) Random seed for reproducibility in random splitting. Defaults to 42.

_init_split()[source]

Initializes the actual splitter object based on the specified method.

Returns:

The initialized splitter object.

Raises:

ValueError – If an unknown splitting method is specified.

split(smiles, target=None, group=None, scaffolds=None, **params)[source]

Splits the dataset into train and test sets based on the initialized method.

Parameters:
  • data – The dataset to be split.

  • target – (optional) Target labels for stratified splitting. Defaults to None.

  • group – (optional) Group labels for group-based splitting. Defaults to None.

Returns:

An iterator yielding train and test set indices for each fold.

Raises:

ValueError – If the splitter method does not support the provided parameters.