Interface
Train
unimol_tools.train.py trains a Uni-Mol model.
- class unimol_tools.train.MolTrain(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check=False, smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, use_ddp=False, use_gpu='all', freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', conf_cache_level=1, **params)[source]
A
MolTrainclass is responsible for interface of training process of molecular data.- __init__(task='classification', data_type='molecule', epochs=10, learning_rate=0.0001, batch_size=16, early_stopping=5, metrics='none', split='random', split_group_col='scaffold', kfold=5, save_path='./exp', remove_hs=False, smiles_col='SMILES', target_cols=None, target_col_prefix='TARGET', target_anomaly_check=False, smiles_check='filter', target_normalize='auto', max_norm=5.0, use_cuda=True, use_amp=True, use_ddp=False, use_gpu='all', freeze_layers=None, freeze_layers_reversed=False, load_model_dir=None, model_name='unimolv1', model_size='84m', conf_cache_level=1, **params)[source]
Initialize a
MolTrainclass.- Parameters:
task – str, default=’classification’, currently support [`]classification`, regression, multiclass, multilabel_classification, multilabel_regression.
data_type – str, default=’molecule’, currently support molecule, oled.
epochs – int, default=10, number of epochs to train.
learning_rate – float, default=1e-4, learning rate of optimizer.
batch_size – int, default=16, batch size of training.
early_stopping – int, default=5, early stopping patience.
metrics –
str, default=’none’, metrics to evaluate model performance.
currently support:
classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.
regression: mse, pearsonr, spearmanr, mse, r2.
multiclass: log_loss, acc.
multilabel_classification: auc, auprc, log_loss, acc, mcc.
multilabel_regression: mae, mse, r2.
split –
str, default=’random’, split method of training dataset. currently support: random, scaffold, group, stratified, select.
random: random split.
scaffold: split by scaffold.
group: split by group. split_group_col should be specified.
stratified: stratified split. split_group_col should be specified.
select: use split_group_col to manually select the split group. Column values of split_group_col should be range from 0 to kfold-1 to indicate the split group.
split_group_col – str, default=’scaffold’, column name of group split.
kfold –
int, default=5, number of folds for k-fold cross validation.
1: no split. all data will be used for training.
save_path – str, default=’./exp’, path to save training results.
remove_hs – bool, default=False, whether to remove hydrogens from molecules.
smiles_col – str, default=’SMILES’, column name of SMILES.
target_cols – list or str, default=None, column names of target values.
target_col_prefix – str, default=’TARGET’, prefix of target column name.
target_anomaly_check – str, default=False, how to deal with anomaly target values. currently support: filter, none.
smiles_check – str, default=’filter’, how to deal with invalid SMILES. currently support: filter, none.
target_normalize – str, default=’auto’, how to normalize target values. ‘auto’ means we will choose the normalize strategy by automatic. currently support: auto, minmax, standard, robust, log1p, none.
max_norm – float, default=5.0, max norm of gradient clipping.
use_cuda – bool, default=True, whether to use GPU.
use_amp – bool, default=True, whether to use automatic mixed precision.
use_ddp – bool, default=True, whether to use distributed data parallel.
use_gpu – str, default=’all’, which GPU to use. ‘all’ means use all GPUs. ‘0,1,2’ means use GPU 0, 1, 2.
freeze_layers – str or list, frozen layers by startwith name list. [‘encoder’, ‘gbf’] will freeze all the layers whose name start with ‘encoder’ or ‘gbf’.
freeze_layers_reversed – bool, default=False, inverse selection of frozen layers
params – dict, default=None, other parameters.
load_model_dir – str, default=None, path to load model for transfer learning.
model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.
model_size – str, default=’84m’, model size. work when model_name is unimolv2. Avaliable: 84m, 164m, 310m, 570m, 1.1B.
conf_cache_level – int, optional [0, 1, 2], default=1, configuration cache level to save the conformers to sdf file. - 0: no caching. - 1: cache if not exists. - 2: always cache.
- fit(data)[source]
Fit the model according to the given training data with multi datasource support, including SMILES csv file and custom coordinate data.
For example: custom coordinate data.
from unimol_tools import MolTrain import numpy as np custom_data ={'target':np.random.randint(2, size=100), 'atoms':[['C','C','H','H','H','H'] for _ in range(100)], 'coordinates':[np.random.randn(6,3) for _ in range(100)], } clf = MolTrain() clf.fit(custom_data)
Predict
unimol_tools.predictor.py predict through a Uni-Mol model.
- class unimol_tools.predict.MolPredict(load_model=None)[source]
A
MolPredictclass is responsible for interface of predicting process of molecular data.- __init__(load_model=None)[source]
Initialize a
MolPredictclass.- Parameters:
load_model – str, default=None, path of model to load.
- predict(data, save_path=None, metrics='none')[source]
Predict molecular data.
- Parameters:
data – str or pandas.DataFrame or dict of atoms and coordinates, input data for prediction. - str: path of csv file. - pandas.DataFrame: dataframe of data. - dict: dict of atoms and coordinates, e.g. {‘atoms’: [‘C’, ‘C’, ‘C’], ‘coordinates’: [[0, 0, 0], [0, 0, 1], [0, 0, 2]]}
save_path – str, default=None, path to save predict result.
metrics –
str, default=’none’, metrics to evaluate model performance.
currently support:
classification: auc, auprc, log_loss, acc, f1_score, mcc, precision, recall, cohen_kappa.
regression: mae, pearsonr, spearmanr, mse, r2.
multiclass: log_loss, acc.
multilabel_classification: auc, auprc, log_loss, acc, mcc.
multilabel_regression: mae, mse, r2.
- Return y_pred:
numpy.ndarray, predict result.
Uni-Mol representation
unimol_tools.predictor.py get the Uni-Mol representation.
- class unimol_tools.predictor.MolDataset(*args: Any, **kwargs: Any)[source]
A
MolDatasetclass is responsible for interface of molecular dataset.
- class unimol_tools.predictor.UniMolRepr(data_type='molecule', batch_size=32, remove_hs=False, model_name='unimolv1', model_size='84m', use_cuda=True, use_ddp=False, use_gpu='all', save_path=None, **kwargs)[source]
A
UniMolReprclass is responsible for interface of molecular representation by unimol- __init__(data_type='molecule', batch_size=32, remove_hs=False, model_name='unimolv1', model_size='84m', use_cuda=True, use_ddp=False, use_gpu='all', save_path=None, **kwargs)[source]
Initialize a
UniMolReprclass.- Parameters:
data_type – str, default=’molecule’, currently support molecule, oled.
batch_size – int, default=32, batch size for training.
remove_hs – bool, default=False, whether to remove hydrogens in molecular.
model_name – str, default=’unimolv1’, currently support unimolv1, unimolv2.
model_size – str, default=’84m’, model size of unimolv2. Avaliable: 84m, 164m, 310m, 570m, 1.1B.
use_cuda – bool, default=True, whether to use gpu.
use_ddp – bool, default=False, whether to use distributed data parallel.
use_gpu – str, default=’all’, which gpu to use.
- get_repr(data=None, return_atomic_reprs=False)[source]
Get molecular representation by unimol.
- Parameters:
data –
str, dict or list, default=None, input data for unimol.
str: smiles string or path to a smiles file.
dict: custom conformers, should take atoms and coordinates as input.
list: list of smiles strings.
return_atomic_reprs – bool, default=False, whether to return atomic representations.
- Returns:
dict of molecular representation.