Interface

Train

unimol_tools.train.py trains a Uni-Mol model.

class unimol_tools.train.MolTrain(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check=False, smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, use_ddp=False, use_gpu='all', freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', conf_cache_level=1, **params)[source]

A MolTrain class is responsible for interface of training process of molecular data.

__init__(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check=False, smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, use_ddp=False, use_gpu='all', freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', conf_cache_level=1, **params)[source]

Initialize a MolTrain class.

Parameters:

task – str, default=’classification’, currently support [`]classification`, regression, multiclass, multilabel_classification, multilabel_regression.
data_type – str, default=’molecule’, currently support molecule, oled.
epochs – int, default=10, number of epochs to train.
learning_rate – float, default=1e-4, learning rate of optimizer.
batch_size – int, default=16, batch size of training.
early_stopping – int, default=5, early stopping patience.
metrics –
str, default=’none’, metrics to evaluate model performance.

currently support:
- classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.
- regression: mse, pearsonr, spearmanr, mse, r2.
- multiclass: log_loss, acc.
- multilabel_classification: auc, auprc, log_loss, acc, mcc.
- multilabel_regression: mae, mse, r2.
split –
str, default=’random’, split method of training dataset. currently support: random, scaffold, group, stratified, select.
- random: random split.
- scaffold: split by scaffold.
- group: split by group. split_group_col should be specified.
- stratified: stratified split. split_group_col should be specified.
- select: use split_group_col to manually select the split group. Column values of split_group_col should be range from 0 to kfold-1 to indicate the split group.
split_group_col – str, default=’scaffold’, column name of group split.
kfold –
int, default=5, number of folds for k-fold cross validation.
- 1: no split. all data will be used for training.
save_path – str, default=’./exp’, path to save training results.
remove_hs – bool, default=False, whether to remove hydrogens from molecules.
smiles_col – str, default=’SMILES’, column name of SMILES.
target_cols – list or str, default=None, column names of target values.
target_col_prefix – str, default=’TARGET’, prefix of target column name.
target_anomaly_check – str, default=False, how to deal with anomaly target values. currently support: filter, none.
smiles_check – str, default=’filter’, how to deal with invalid SMILES. currently support: filter, none.
target_normalize – str, default=’auto’, how to normalize target values. ‘auto’ means we will choose the normalize strategy by automatic. currently support: auto, minmax, standard, robust, log1p, none.
max_norm – float, default=5.0, max norm of gradient clipping.
use_cuda – bool, default=True, whether to use GPU.
use_amp – bool, default=True, whether to use automatic mixed precision.
use_ddp – bool, default=True, whether to use distributed data parallel.
use_gpu – str, default=’all’, which GPU to use. ‘all’ means use all GPUs. ‘0,1,2’ means use GPU 0, 1, 2.
freeze_layers – str or list, frozen layers by startwith name list. [‘encoder’, ‘gbf’] will freeze all the layers whose name start with ‘encoder’ or ‘gbf’.
freeze_layers_reversed – bool, default=False, inverse selection of frozen layers
params – dict, default=None, other parameters.
load_model_dir – str, default=None, path to load model for transfer learning.
model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.
model_size – str, default=’84m’, model size. work when model_name is unimolv2. Avaliable: 84m, 164m, 310m, 570m, 1.1B.
conf_cache_level – int, optional [0, 1, 2], default=1, configuration cache level to save the conformers to sdf file. - 0: no caching. - 1: cache if not exists. - 2: always cache.

fit(data)[source]

Fit the model according to the given training data with multi datasource support, including SMILES csv file and custom coordinate data.

For example: custom coordinate data.

from unimol_tools import MolTrain
import numpy as np
custom_data ={'target':np.random.randint(2, size=100),
            'atoms':[['C','C','H','H','H','H'] for _ in range(100)],
            'coordinates':[np.random.randn(6,3) for _ in range(100)],
            }

clf = MolTrain()
clf.fit(custom_data)

update_and_save_config()[source]: Update and save config file.

Predict

unimol_tools.predictor.py predict through a Uni-Mol model.

class unimol_tools.predict.MolPredict(load_model=None)[source]

A MolPredict class is responsible for interface of predicting process of molecular data.

__init__(load_model=None)[source]

Initialize a MolPredict class.

Parameters:: load_model – str, default=None, path of model to load.

predict(data, save_path=None, metrics='none')[source]

Predict molecular data.

Parameters:

data – str or pandas.DataFrame or dict of atoms and coordinates, input data for prediction. - str: path of csv file. - pandas.DataFrame: dataframe of data. - dict: dict of atoms and coordinates, e.g. {‘atoms’: [‘C’, ‘C’, ‘C’], ‘coordinates’: [[0, 0, 0], [0, 0, 1], [0, 0, 2]]}
save_path – str, default=None, path to save predict result.
metrics –
str, default=’none’, metrics to evaluate model performance.

currently support:
- classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.
- regression: mae, pearsonr, spearmanr, mse, r2.
- multiclass: log_loss, acc.
- multilabel_classification: auc, auprc, log_loss, acc, mcc.
- multilabel_regression: mae, mse, r2.

Return y_pred:

numpy.ndarray, predict result.

save_predict(data, dir, prefix)[source]

Save predict result to csv file.

Parameters:

data – pandas.DataFrame, predict result.
dir – str, directory to save predict result.
prefix – str, prefix of predict result file name.

Uni-Mol representation

unimol_tools.predictor.py get the Uni-Mol representation.

class unimol_tools.predictor.MolDataset(*args: Any, **kwargs: Any)[source]

A MolDataset class is responsible for interface of molecular dataset.

__init__(data, label=None)[source]

class unimol_tools.predictor.UniMolRepr(data_type='molecule', batch_size=32, remove_hs=False, model_name='unimolv1', model_size='84m', use_cuda=True, use_ddp=False, use_gpu='all', save_path=None, **kwargs)[source]

A UniMolRepr class is responsible for interface of molecular representation by unimol

__init__(data_type='molecule', batch_size=32, remove_hs=False, model_name='unimolv1', model_size='84m', use_cuda=True, use_ddp=False, use_gpu='all', save_path=None, **kwargs)[source]

Initialize a UniMolRepr class.

Parameters:

data_type – str, default=’molecule’, currently support molecule, oled.
batch_size – int, default=32, batch size for training.
remove_hs – bool, default=False, whether to remove hydrogens in molecular.
model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.
model_size – str, default=’84m’, model size of unimolv2. Avaliable: 84m, 164m, 310m, 570m, 1.1B.
use_cuda – bool, default=True, whether to use gpu.
use_ddp – bool, default=False, whether to use distributed data parallel.
use_gpu – str, default=’all’, which gpu to use.

get_repr(data=None, return_atomic_reprs=False)[source]

Get molecular representation by unimol.

Parameters:

data –
str, dict or list, default=None, input data for unimol.
- str: smiles string or path to a smiles file.
- dict: custom conformers, should take atoms and coordinates as input.
- list: list of smiles strings.
return_atomic_reprs – bool, default=False, whether to return atomic representations.

Returns:

dict of molecular representation.