Interface

Train

unimol_tools.train.py trains a Uni-Mol model.

class unimol_tools.train.MolTrain(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check='filter', smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', **params)[source]

A MolTrain class is responsible for interface of training process of molecular data.

__init__(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check='filter', smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', **params)[source]

Initialize a MolTrain class.

Parameters:
  • task – str, default=’classification’, currently support [`]classification`, regression, multiclass, multilabel_classification, multilabel_regression.

  • data_type – str, default=’molecule’, currently support molecule, oled.

  • epochs – int, default=10, number of epochs to train.

  • learning_rate – float, default=1e-4, learning rate of optimizer.

  • batch_size – int, default=16, batch size of training.

  • early_stopping – int, default=5, early stopping patience.

  • metrics

    str, default=’none’, metrics to evaluate model performance.

    currently support:

    • classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.

    • regression: mse, pearsonr, spearmanr, mse, r2.

    • multiclass: log_loss, acc.

    • multilabel_classification: auc, auprc, log_loss, acc, mcc.

    • multilabel_regression: mae, mse, r2.

  • split

    str, default=’random’, split method of training dataset. currently support: random, scaffold, group, stratified, select.

    • random: random split.

    • scaffold: split by scaffold.

    • group: split by group. split_group_col should be specified.

    • stratified: stratified split. split_group_col should be specified.

    • select: use split_group_col to manually select the split group. Column values of split_group_col should be range from 0 to kfold-1 to indicate the split group.

  • split_group_col – str, default=’scaffold’, column name of group split.

  • kfold

    int, default=5, number of folds for k-fold cross validation.

    • 1: no split. all data will be used for training.

  • save_path – str, default=’./exp’, path to save training results.

  • remove_hs – bool, default=False, whether to remove hydrogens from molecules.

  • smiles_col – str, default=’SMILES’, column name of SMILES.

  • target_cols – list or str, default=None, column names of target values.

  • target_col_prefix – str, default=’TARGET’, prefix of target column name.

  • target_anomaly_check – str, default=’filter’, how to deal with anomaly target values. currently support: filter, none.

  • smiles_check – str, default=’filter’, how to deal with invalid SMILES. currently support: filter, none.

  • target_normalize – str, default=’auto’, how to normalize target values. ‘auto’ means we will choose the normalize strategy by automatic. currently support: auto, minmax, standard, robust, log1p, none.

  • max_norm – float, default=5.0, max norm of gradient clipping.

  • use_cuda – bool, default=True, whether to use GPU.

  • use_amp – bool, default=True, whether to use automatic mixed precision.

  • freeze_layers – str or list, frozen layers by startwith name list. [‘encoder’, ‘gbf’] will freeze all the layers whose name start with ‘encoder’ or ‘gbf’.

  • freeze_layers_reversed – bool, default=False, inverse selection of frozen layers

  • params – dict, default=None, other parameters.

  • load_model_dir – str, default=None, path to load model for transfer learning.

  • model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.

  • model_size – str, default=’84m’, model size. work when model_name is unimolv2. avaliable: 84m, 164m, 310m, 570m, 1.1B.

fit(data)[source]

Fit the model according to the given training data with multi datasource support, including SMILES csv file and custom coordinate data.

For example: custom coordinate data.

from unimol_tools import MolTrain
import numpy as np
custom_data ={'target':np.random.randint(2, size=100),
            'atoms':[['C','C','H','H','H','H'] for _ in range(100)],
            'coordinates':[np.random.randn(6,3) for _ in range(100)],
            }

clf = MolTrain()
clf.fit(custom_data)
update_and_save_config()[source]

Update and save config file.

Predict

unimol_tools.predictor.py predict through a Uni-Mol model.

class unimol_tools.predict.MolPredict(load_model=None)[source]

A MolPredict class is responsible for interface of predicting process of molecular data.

__init__(load_model=None)[source]

Initialize a MolPredict class.

Parameters:

load_model – str, default=None, path of model to load.

predict(data, save_path=None, metrics='none')[source]

Predict molecular data.

Parameters:
  • data – str or pandas.DataFrame or dict of atoms and coordinates, input data for prediction. - str: path of csv file. - pandas.DataFrame: dataframe of data. - dict: dict of atoms and coordinates, e.g. {‘atoms’: [‘C’, ‘C’, ‘C’], ‘coordinates’: [[0, 0, 0], [0, 0, 1], [0, 0, 2]]}

  • save_path – str, default=None, path to save predict result.

  • metrics

    str, default=’none’, metrics to evaluate model performance.

    currently support:

    • classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.

    • regression: mae, pearsonr, spearmanr, mse, r2.

    • multiclass: log_loss, acc.

    • multilabel_classification: auc, auprc, log_loss, acc, mcc.

    • multilabel_regression: mae, mse, r2.

Return y_pred:

numpy.ndarray, predict result.

save_predict(data, dir, prefix)[source]

Save predict result to csv file.

Parameters:
  • data – pandas.DataFrame, predict result.

  • dir – str, directory to save predict result.

  • prefix – str, prefix of predict result file name.

Uni-Mol representation

unimol_tools.predictor.py get the Uni-Mol representation.

class unimol_tools.predictor.MolDataset(*args: Any, **kwargs: Any)[source]

A MolDataset class is responsible for interface of molecular dataset.

__init__(data, label=None)[source]
class unimol_tools.predictor.UniMolRepr(data_type='molecule', remove_hs=False, model_name='unimolv1', model_size='84m', use_gpu=True)[source]

A UniMolRepr class is responsible for interface of molecular representation by unimol

__init__(data_type='molecule', remove_hs=False, model_name='unimolv1', model_size='84m', use_gpu=True)[source]

Initialize a UniMolRepr class.

Parameters:
  • data_type – str, default=’molecule’, currently support molecule, oled.

  • remove_hs – bool, default=False, whether to remove hydrogens in molecular.

  • use_gpu – bool, default=True, whether to use gpu.

  • model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.

  • model_size – str, default=’84m’, model size of unimolv2.

get_repr(data=None, return_atomic_reprs=False)[source]

Get molecular representation by unimol.

Parameters:
  • data

    str, dict or list, default=None, input data for unimol.

    • str: smiles string or path to a smiles file.

    • dict: custom conformers, should take atoms and coordinates as input.

    • list: list of smiles strings.

  • return_atomic_reprs – bool, default=False, whether to return atomic representations.

Returns:

dict of molecular representation.