Models
unimol_tools.models contains the models of Uni-Mol.
Uni-Mol
unimol_tools.models.unimol.py contains the UniMolModel, which is the backbone of Uni-Mol model.
- class unimol_tools.models.unimol.UniMolModel(*args: Any, **kwargs: Any)[source]
UniMolModel is a specialized model for molecular, protein, crystal, or MOF (Metal-Organic Frameworks) data. It dynamically configures its architecture based on the type of data it is intended to work with. The model supports multiple data types and incorporates various architecture configurations and pretrained weights.
- Attributes:
output_dim: The dimension of the output layer.
data_type: The type of data the model is designed to handle.
remove_hs: Flag to indicate whether hydrogen atoms are removed in molecular data.
pretrain_path: Path to the pretrained model weights.
dictionary: The dictionary object used for tokenization and encoding.
mask_idx: Index of the mask token in the dictionary.
padding_idx: Index of the padding token in the dictionary.
embed_tokens: Embedding layer for token embeddings.
encoder: Transformer encoder backbone of the model.
gbf_proj, gbf: Layers for Gaussian basis functions or numerical embeddings.
classification_head: The final classification head of the model.
- __init__(output_dim=2, data_type='molecule', **params)[source]
Initializes the UniMolModel with specified parameters and data type.
- Parameters:
output_dim – (int) The number of output dimensions (classes).
data_type – (str) The type of data (e.g., ‘molecule’, ‘protein’).
params – Additional parameters for model configuration.
- load_pretrained_weights(path, strict=False)[source]
Loads pretrained weights into the model.
- Parameters:
path – (str) Path to the pretrained weight file.
- classmethod build_model(args)[source]
Class method to build a new instance of the UniMolModel.
- Parameters:
args – Arguments for model configuration.
- Returns:
An instance of UniMolModel.
- forward(src_tokens, src_distance, src_coord, src_edge_type, return_repr=False, return_atomic_reprs=False, **kwargs)[source]
Defines the forward pass of the model.
- Parameters:
src_tokens – Tokenized input data.
src_distance – Additional molecular features.
src_coord – Additional molecular features.
src_edge_type – Additional molecular features.
gas_id – Optional environmental features for MOFs.
gas_attr – Optional environmental features for MOFs.
pressure – Optional environmental features for MOFs.
temperature – Optional environmental features for MOFs.
return_repr – Flags to return intermediate representations.
return_atomic_reprs – Flags to return intermediate representations.
- Returns:
Output logits or requested intermediate representations.
- class unimol_tools.models.unimol.LinearHead(*args: Any, **kwargs: Any)[source]
Linear head.
- class unimol_tools.models.unimol.ClassificationHead(*args: Any, **kwargs: Any)[source]
Head for sentence-level classification tasks.
- __init__(input_dim, inner_dim, num_classes, activation_fn, pooler_dropout)[source]
Initialize the classification head.
- Parameters:
input_dim – Dimension of input features.
inner_dim – Dimension of the inner layer.
num_classes – Number of classes for classification.
activation_fn – Activation function name.
pooler_dropout – Dropout rate for the pooling layer.
- class unimol_tools.models.unimol.NonLinearHead(*args: Any, **kwargs: Any)[source]
A neural network module used for simple classification tasks. It consists of a two-layered linear network with a nonlinear activation function in between.
- Attributes:
linear1: The first linear layer.
linear2: The second linear layer that outputs to the desired dimensions.
activation_fn: The nonlinear activation function.
- __init__(input_dim, out_dim, activation_fn, hidden=None)[source]
Initializes the NonLinearHead module.
- Parameters:
input_dim – Dimension of the input features.
out_dim – Dimension of the output.
activation_fn – The activation function to use.
hidden – Dimension of the hidden layer; defaults to the same as input_dim if not provided.
- unimol_tools.models.unimol.gaussian(x, mean, std)
Gaussian function implemented for PyTorch tensors.
- Parameters:
x – The input tensor.
mean – The mean for the Gaussian function.
std – The standard deviation for the Gaussian function.
- Returns:
The output tensor after applying the Gaussian function.
- unimol_tools.models.unimol.get_activation_fn(activation)[source]
Returns the activation function corresponding to activation
- class unimol_tools.models.unimol.GaussianLayer(*args: Any, **kwargs: Any)[source]
A neural network module implementing a Gaussian layer, useful in graph neural networks.
- Attributes:
K: Number of Gaussian kernels.
means, stds: Embeddings for the means and standard deviations of the Gaussian kernels.
mul, bias: Embeddings for scaling and bias parameters.
- class unimol_tools.models.unimol.NumericalEmbed(*args: Any, **kwargs: Any)[source]
Numerical embedding module, typically used for embedding edge features in graph neural networks.
- Attributes:
K: Output dimension for embeddings.
mul, bias, w_edge: Embeddings for transformation parameters.
proj: Projection layer to transform inputs.
ln: Layer normalization.
- __init__(K=128, edge_types=1024, activation_fn='gelu')[source]
Initializes the NonLinearHead.
- Parameters:
input_dim – The input dimension of the first layer.
out_dim – The output dimension of the second layer.
activation_fn – The activation function to use.
hidden – The dimension of the hidden layer; defaults to input_dim if not specified.
Model
unimol_tools.models.nnmodel.py contains the NNModel, which is responsible for initializing the model.
- class unimol_tools.models.nnmodel.NNModel(data, trainer, **params)[source]
A
NNModelclass is responsible for initializing the model- __init__(data, trainer, **params)[source]
Initializes the neural network model with the given data and parameters.
- Parameters:
data – (dict) Contains the dataset information, including features and target scaling.
trainer – (object) An instance of a training class, responsible for managing training processes.
params – Various additional parameters used for model configuration.
The model is configured based on the task type and specific parameters provided.
- _init_model(model_name, **params)[source]
Initializes the neural network model based on the provided model name and parameters.
- Parameters:
model_name – (str) The name of the model to initialize.
params – Additional parameters for model configuration.
- Returns:
An instance of the specified neural network model.
- Raises:
ValueError – If the model name is not recognized.
- collect_data(X, y, idx)[source]
Collects and formats the training or validation data.
- Parameters:
X – (np.ndarray or dict) The input features, either as a numpy array or a dictionary of tensors.
y – (np.ndarray) The target values as a numpy array.
idx – Indices to select the specific data samples.
- Returns:
A tuple containing processed input data and target values.
- Raises:
ValueError – If X is neither a numpy array nor a dictionary.
- run()[source]
Executes the training process of the model. This involves data preparation, model training, validation, and computing metrics for each fold in cross-validation.
- dump(data, dir, name)[source]
Saves the specified data to a file.
- Parameters:
data – The data to be saved.
dir – (str) The directory where the data will be saved.
name – (str) The name of the file to save the data.
- unimol_tools.models.nnmodel.NNDataset(data, label=None)[source]
Creates a dataset suitable for use with PyTorch models.
- Parameters:
data – The input data.
label – Optional labels corresponding to the input data.
- Returns:
An instance of TorchDataset.
- class unimol_tools.models.nnmodel.TorchDataset(*args: Any, **kwargs: Any)[source]
A custom dataset class for PyTorch that handles data and labels. This class is compatible with PyTorch’s Dataset interface and can be used with a DataLoader for efficient batch processing. It’s designed to work with both numpy arrays and PyTorch tensors.
Loss
unimol_tools.models.loss.py contains different loss functions.
- class unimol_tools.models.loss.GHM_Loss(*args: Any, **kwargs: Any)[source]
A
GHM_Lossclass.- __init__(bins=10, alpha=0.5)[source]
Initializes the GHM_Loss module with the specified number of bins and alpha value.
- Parameters:
bins – (int) The number of bins to divide the gradient. Defaults to 10.
alpha – (float) The smoothing parameter for updating the last bin count. Defaults to 0.5.
- _g2bin(g)[source]
Maps gradient values to corresponding bin indices.
- Parameters:
g – (torch.Tensor) Gradient tensor.
- Returns:
(torch.Tensor) Bin indices for each gradient value.
- _custom_loss(x, target, weight)[source]
Custom loss function to be implemented in subclasses.
- Parameters:
x – (torch.Tensor) Predicted values.
target – (torch.Tensor) Ground truth labels.
weight – (torch.Tensor) Weights for the loss.
- Raises:
NotImplementedError – Indicates that the method should be implemented in subclasses.
- class unimol_tools.models.loss.GHMC_Loss(*args: Any, **kwargs: Any)[source]
Inherits from GHM_Loss. GHM_Loss for classification.
- __init__(bins, alpha)[source]
Initializes the GHMC_Loss with specified number of bins and alpha value.
- Parameters:
bins – (int) Number of bins for gradient division.
alpha – (float) Smoothing parameter for bin count updating.
- class unimol_tools.models.loss.GHMR_Loss(*args: Any, **kwargs: Any)[source]
Inherits from GHM_Loss. GHM_Loss for regression
- __init__(bins, alpha, mu)[source]
Initializes the GHMR_Loss with specified number of bins, alpha value, and mu parameter.
- Parameters:
bins – (int) Number of bins for gradient division.
alpha – (float) Smoothing parameter for bin count updating.
mu – (float) Parameter used in the GHMR loss formula.
- unimol_tools.models.loss.MAEwithNan(y_pred, y_true)[source]
Calculates the Mean Absolute Error (MAE) loss, ignoring NaN values in the target.
- Parameters:
y_pred – (torch.Tensor) Predicted values.
y_true – (torch.Tensor) Ground truth values, may contain NaNs.
- Returns:
(torch.Tensor) MAE loss computed only on non-NaN elements.
- unimol_tools.models.loss.FocalLoss(y_pred, y_true, alpha=0.25, gamma=2)[source]
Calculates the Focal Loss, used to address class imbalance by focusing on hard examples.
- Parameters:
y_pred – (torch.Tensor) Predicted probabilities.
y_true – (torch.Tensor) Ground truth labels.
alpha – (float) Weighting factor for balancing positive and negative examples. Defaults to 0.25.
gamma – (float) Focusing parameter to scale the loss. Defaults to 2.
- Returns:
(torch.Tensor) Computed focal loss.
- unimol_tools.models.loss.FocalLossWithLogits(y_pred, y_true, alpha=0.25, gamma=2.0)[source]
Calculates the Focal Loss using predicted logits (raw scores), automatically applying the sigmoid function.
- Parameters:
y_pred – (torch.Tensor) Predicted logits.
y_true – (torch.Tensor) Ground truth labels, may contain NaNs.
alpha – (float) Weighting factor for balancing positive and negative examples. Defaults to 0.25.
gamma – (float) Focusing parameter to scale the loss. Defaults to 2.0.
- Returns:
(torch.Tensor) Computed focal loss.
- unimol_tools.models.loss.myCrossEntropyLoss(y_pred, y_true)[source]
Calculates the cross-entropy loss between predictions and targets.
- Parameters:
y_pred – (torch.Tensor) Predicted logits or probabilities.
y_true – (torch.Tensor) Ground truth labels.
- Returns:
(torch.Tensor) Computed cross-entropy loss.
Transformers
unimol_tools.models.transformers.py contains a custom Transformer Encoder module that extends PyTorch’s nn.Module.
- unimol_tools.models.transformers.softmax_dropout(input, dropout_prob, is_training=True, mask=None, bias=None, inplace=True)[source]
softmax dropout, and mask, bias are optional.
- Args:
input (torch.Tensor): input tensor dropout_prob (float): dropout probability is_training (bool, optional): is in training or not. Defaults to True. mask (torch.Tensor, optional): the mask tensor, use as input + mask . Defaults to None. bias (torch.Tensor, optional): the bias tensor, use as input + bias . Defaults to None.
- Returns:
torch.Tensor: the result after softmax
- unimol_tools.models.transformers.get_activation_fn(activation)[source]
Returns the activation function corresponding to activation
- class unimol_tools.models.transformers.TransformerEncoderLayer(*args: Any, **kwargs: Any)[source]
Implements a Transformer Encoder Layer used in BERT/XLM style pre-trained models.
- class unimol_tools.models.transformers.TransformerEncoderWithPair(*args: Any, **kwargs: Any)[source]
A custom Transformer Encoder module that extends PyTorch’s nn.Module. This encoder is designed for tasks that require understanding pair relationships in sequences. It includes standard transformer encoder layers along with additional normalization and dropout features.
- Attributes:
emb_dropout: Dropout rate applied to the embedding layer.
max_seq_len: Maximum length of the input sequences.
embed_dim: Dimensionality of the embeddings.
attention_heads: Number of attention heads in the transformer layers.
emb_layer_norm: Layer normalization applied to the embedding layer.
final_layer_norm: Optional final layer normalization.
final_head_layer_norm: Optional layer normalization for the attention heads.
layers: A list of transformer encoder layers.
- Methods:
forward: Performs the forward pass of the module.
- __init__(encoder_layers: int = 6, embed_dim: int = 768, ffn_embed_dim: int = 3072, attention_heads: int = 8, emb_dropout: float = 0.1, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, max_seq_len: int = 256, activation_fn: str = 'gelu', post_ln: bool = False, no_final_head_layer_norm: bool = False) None[source]
Initializes and configures the layers and other components of the transformer encoder.
- Parameters:
encoder_layers – (int) Number of encoder layers in the transformer.
embed_dim – (int) Dimensionality of the input embeddings.
ffn_embed_dim – (int) Dimensionality of the feedforward network model.
attention_heads – (int) Number of attention heads in each encoder layer.
emb_dropout – (float) Dropout rate for the embedding layer.
dropout – (float) Dropout rate for the encoder layers.
attention_dropout – (float) Dropout rate for the attention mechanisms.
activation_dropout – (float) Dropout rate for activations.
max_seq_len – (int) Maximum sequence length the model can handle.
activation_fn – (str) The activation function to use (e.g., “gelu”).
post_ln – (bool) If True, applies layer normalization after the feedforward network.
no_final_head_layer_norm – (bool) If True, does not apply layer normalization to the final attention head.
- forward(emb: torch.Tensor, attn_mask: torch.Tensor | None = None, padding_mask: torch.Tensor | None = None) torch.Tensor[source]
Conducts the forward pass of the transformer encoder.
- Parameters:
emb – (torch.Tensor) The input tensor of embeddings.
attn_mask – (Optional[torch.Tensor]) Attention mask to specify positions to attend to.
padding_mask – (Optional[torch.Tensor]) Mask to indicate padded elements in the input.
- Returns:
(torch.Tensor) The output tensor after passing through the transformer encoder layers. It also returns tensors related to pair representation and normalization losses.