Models

unimol_tools.models contains the models of Uni-Mol.

Uni-Mol

unimol_tools.models.unimol.py contains the UniMolModel, which is the backbone of Uni-Mol model.

class unimol_tools.models.unimol.UniMolModel(*args: Any, **kwargs: Any)[source]

UniMolModel is a specialized model for molecular, protein, crystal, or MOF (Metal-Organic Frameworks) data. It dynamically configures its architecture based on the type of data it is intended to work with. The model supports multiple data types and incorporates various architecture configurations and pretrained weights.

Attributes:
  • output_dim: The dimension of the output layer.

  • data_type: The type of data the model is designed to handle.

  • remove_hs: Flag to indicate whether hydrogen atoms are removed in molecular data.

  • pretrain_path: Path to the pretrained model weights.

  • dictionary: The dictionary object used for tokenization and encoding.

  • mask_idx: Index of the mask token in the dictionary.

  • padding_idx: Index of the padding token in the dictionary.

  • embed_tokens: Embedding layer for token embeddings.

  • encoder: Transformer encoder backbone of the model.

  • gbf_proj, gbf: Layers for Gaussian basis functions or numerical embeddings.

  • classification_head: The final classification head of the model.

__init__(output_dim=2, data_type='molecule', **params)[source]

Initializes the UniMolModel with specified parameters and data type.

Parameters:
  • output_dim – (int) The number of output dimensions (classes).

  • data_type – (str) The type of data (e.g., ‘molecule’, ‘protein’).

  • params – Additional parameters for model configuration.

load_pretrained_weights(path)[source]

Loads pretrained weights into the model.

Parameters:

path – (str) Path to the pretrained weight file.

classmethod build_model(args)[source]

Class method to build a new instance of the UniMolModel.

Parameters:

args – Arguments for model configuration.

Returns:

An instance of UniMolModel.

forward(src_tokens, src_distance, src_coord, src_edge_type, return_repr=False, return_atomic_reprs=False, **kwargs)[source]

Defines the forward pass of the model.

Parameters:
  • src_tokens – Tokenized input data.

  • src_distance – Additional molecular features.

  • src_coord – Additional molecular features.

  • src_edge_type – Additional molecular features.

  • gas_id – Optional environmental features for MOFs.

  • gas_attr – Optional environmental features for MOFs.

  • pressure – Optional environmental features for MOFs.

  • temperature – Optional environmental features for MOFs.

  • return_repr – Flags to return intermediate representations.

  • return_atomic_reprs – Flags to return intermediate representations.

Returns:

Output logits or requested intermediate representations.

batch_collate_fn(samples)[source]

Custom collate function for batch processing non-MOF data.

Parameters:

samples – A list of sample data.

Returns:

A tuple containing a batch dictionary and labels.

class unimol_tools.models.unimol.ClassificationHead(*args: Any, **kwargs: Any)[source]

Head for sentence-level classification tasks.

__init__(input_dim, inner_dim, num_classes, activation_fn, pooler_dropout)[source]

Initialize the classification head.

Parameters:
  • input_dim – Dimension of input features.

  • inner_dim – Dimension of the inner layer.

  • num_classes – Number of classes for classification.

  • activation_fn – Activation function name.

  • pooler_dropout – Dropout rate for the pooling layer.

forward(features, **kwargs)[source]

Forward pass for the classification head.

Parameters:

features – Input features for classification.

Returns:

Output from the classification head.

class unimol_tools.models.unimol.NonLinearHead(*args: Any, **kwargs: Any)[source]

A neural network module used for simple classification tasks. It consists of a two-layered linear network with a nonlinear activation function in between.

Attributes:
  • linear1: The first linear layer.

  • linear2: The second linear layer that outputs to the desired dimensions.

  • activation_fn: The nonlinear activation function.

__init__(input_dim, out_dim, activation_fn, hidden=None)[source]

Initializes the NonLinearHead module.

Parameters:
  • input_dim – Dimension of the input features.

  • out_dim – Dimension of the output.

  • activation_fn – The activation function to use.

  • hidden – Dimension of the hidden layer; defaults to the same as input_dim if not provided.

forward(x)[source]

Forward pass of the NonLinearHead.

Parameters:

x – Input tensor to the module.

Returns:

Tensor after passing through the network.

unimol_tools.models.unimol.gaussian(x, mean, std)

Gaussian function implemented for PyTorch tensors.

Parameters:
  • x – The input tensor.

  • mean – The mean for the Gaussian function.

  • std – The standard deviation for the Gaussian function.

Returns:

The output tensor after applying the Gaussian function.

unimol_tools.models.unimol.get_activation_fn(activation)[source]

Returns the activation function corresponding to activation

class unimol_tools.models.unimol.GaussianLayer(*args: Any, **kwargs: Any)[source]

A neural network module implementing a Gaussian layer, useful in graph neural networks.

Attributes:
  • K: Number of Gaussian kernels.

  • means, stds: Embeddings for the means and standard deviations of the Gaussian kernels.

  • mul, bias: Embeddings for scaling and bias parameters.

__init__(K=128, edge_types=1024)[source]

Initializes the GaussianLayer module.

Parameters:
  • K – Number of Gaussian kernels.

  • edge_types – Number of different edge types to consider.

Returns:

An instance of the configured Gaussian kernel and edge types.

forward(x, edge_type)[source]

Forward pass of the GaussianLayer.

Parameters:
  • x – Input tensor representing distances or other features.

  • edge_type – Tensor indicating types of edges in the graph.

Returns:

Tensor transformed by the Gaussian layer.

class unimol_tools.models.unimol.NumericalEmbed(*args: Any, **kwargs: Any)[source]

Numerical embedding module, typically used for embedding edge features in graph neural networks.

Attributes:
  • K: Output dimension for embeddings.

  • mul, bias, w_edge: Embeddings for transformation parameters.

  • proj: Projection layer to transform inputs.

  • ln: Layer normalization.

__init__(K=128, edge_types=1024, activation_fn='gelu')[source]

Initializes the NonLinearHead.

Parameters:
  • input_dim – The input dimension of the first layer.

  • out_dim – The output dimension of the second layer.

  • activation_fn – The activation function to use.

  • hidden – The dimension of the hidden layer; defaults to input_dim if not specified.

forward(x, edge_type)[source]

Forward pass of the NonLinearHead.

Parameters:

x – Input tensor to the classification head.

Returns:

The output tensor after passing through the layers.

Model

unimol_tools.models.nnmodel.py contains the NNModel, which is responsible for initializing the model.

class unimol_tools.models.nnmodel.NNModel(data, trainer, **params)[source]

A NNModel class is responsible for initializing the model

__init__(data, trainer, **params)[source]

Initializes the neural network model with the given data and parameters.

Parameters:
  • data – (dict) Contains the dataset information, including features and target scaling.

  • trainer – (object) An instance of a training class, responsible for managing training processes.

  • params – Various additional parameters used for model configuration.

The model is configured based on the task type and specific parameters provided.

_init_model(model_name, **params)[source]

Initializes the neural network model based on the provided model name and parameters.

Parameters:
  • model_name – (str) The name of the model to initialize.

  • params – Additional parameters for model configuration.

Returns:

An instance of the specified neural network model.

Raises:

ValueError – If the model name is not recognized.

collect_data(X, y, idx)[source]

Collects and formats the training or validation data.

Parameters:
  • X – (np.ndarray or dict) The input features, either as a numpy array or a dictionary of tensors.

  • y – (np.ndarray) The target values as a numpy array.

  • idx – Indices to select the specific data samples.

Returns:

A tuple containing processed input data and target values.

Raises:

ValueError – If X is neither a numpy array nor a dictionary.

run()[source]

Executes the training process of the model. This involves data preparation, model training, validation, and computing metrics for each fold in cross-validation.

dump(data, dir, name)[source]

Saves the specified data to a file.

Parameters:
  • data – The data to be saved.

  • dir – (str) The directory where the data will be saved.

  • name – (str) The name of the file to save the data.

evaluate(trainer=None, checkpoints_path=None)[source]

Evaluates the model by making predictions on the test set and averaging the results.

Parameters:
  • trainer – An optional trainer instance to use for prediction.

  • checkpoints_path – (str) The path to the saved model checkpoints.

count_parameters(model)[source]

Counts the number of trainable parameters in the model.

Parameters:

model – The model whose parameters are to be counted.

Returns:

(int) The number of trainable parameters.

unimol_tools.models.nnmodel.NNDataset(data, label=None)[source]

Creates a dataset suitable for use with PyTorch models.

Parameters:
  • data – The input data.

  • label – Optional labels corresponding to the input data.

Returns:

An instance of TorchDataset.

class unimol_tools.models.nnmodel.TorchDataset(*args: Any, **kwargs: Any)[source]

A custom dataset class for PyTorch that handles data and labels. This class is compatible with PyTorch’s Dataset interface and can be used with a DataLoader for efficient batch processing. It’s designed to work with both numpy arrays and PyTorch tensors.

__init__(data, label=None)[source]

Initializes the dataset with data and labels.

Parameters:
  • data – The input data.

  • label – The target labels for the input data.

Loss

unimol_tools.models.loss.py contains different loss functions.

class unimol_tools.models.loss.GHM_Loss(*args: Any, **kwargs: Any)[source]

A GHM_Loss class.

__init__(bins=10, alpha=0.5)[source]

Initializes the GHM_Loss module with the specified number of bins and alpha value.

Parameters:
  • bins – (int) The number of bins to divide the gradient. Defaults to 10.

  • alpha – (float) The smoothing parameter for updating the last bin count. Defaults to 0.5.

_g2bin(g)[source]

Maps gradient values to corresponding bin indices.

Parameters:

g – (torch.Tensor) Gradient tensor.

Returns:

(torch.Tensor) Bin indices for each gradient value.

_custom_loss(x, target, weight)[source]

Custom loss function to be implemented in subclasses.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth labels.

  • weight – (torch.Tensor) Weights for the loss.

Raises:

NotImplementedError – Indicates that the method should be implemented in subclasses.

_custom_loss_grad(x, target)[source]

Custom gradient computation function to be implemented in subclasses.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth labels.

Raises:

NotImplementedError – Indicates that the method should be implemented in subclasses.

forward(x, target)[source]

Forward pass for computing the GHM loss.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth labels.

Returns:

(torch.Tensor) Computed GHM loss.

class unimol_tools.models.loss.GHMC_Loss(*args: Any, **kwargs: Any)[source]

Inherits from GHM_Loss. GHM_Loss for classification.

__init__(bins, alpha)[source]

Initializes the GHMC_Loss with specified number of bins and alpha value.

Parameters:
  • bins – (int) Number of bins for gradient division.

  • alpha – (float) Smoothing parameter for bin count updating.

_custom_loss(x, target, weight)[source]

Custom loss function for GHM classification loss.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth labels.

  • weight – (torch.Tensor) Weights for the loss.

Returns:

Binary cross-entropy loss with logits.

_custom_loss_grad(x, target)[source]

Custom gradient function for GHM classification loss.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth labels.

Returns:

Gradient of the loss.

class unimol_tools.models.loss.GHMR_Loss(*args: Any, **kwargs: Any)[source]

Inherits from GHM_Loss. GHM_Loss for regression

__init__(bins, alpha, mu)[source]

Initializes the GHMR_Loss with specified number of bins, alpha value, and mu parameter.

Parameters:
  • bins – (int) Number of bins for gradient division.

  • alpha – (float) Smoothing parameter for bin count updating.

  • mu – (float) Parameter used in the GHMR loss formula.

_custom_loss(x, target, weight)[source]

Custom loss function for GHM regression loss.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth values.

  • weight – (torch.Tensor) Weights for the loss.

Returns:

GHMR loss.

_custom_loss_grad(x, target)[source]

Custom gradient function for GHM regression loss.

Parameters:
  • x – (torch.Tensor) Predicted values.

  • target – (torch.Tensor) Ground truth values.

Returns:

Gradient of the loss.

unimol_tools.models.loss.MAEwithNan(y_pred, y_true)[source]

Calculates the Mean Absolute Error (MAE) loss, ignoring NaN values in the target.

Parameters:
  • y_pred – (torch.Tensor) Predicted values.

  • y_true – (torch.Tensor) Ground truth values, may contain NaNs.

Returns:

(torch.Tensor) MAE loss computed only on non-NaN elements.

unimol_tools.models.loss.FocalLoss(y_pred, y_true, alpha=0.25, gamma=2)[source]

Calculates the Focal Loss, used to address class imbalance by focusing on hard examples.

Parameters:
  • y_pred – (torch.Tensor) Predicted probabilities.

  • y_true – (torch.Tensor) Ground truth labels.

  • alpha – (float) Weighting factor for balancing positive and negative examples. Defaults to 0.25.

  • gamma – (float) Focusing parameter to scale the loss. Defaults to 2.

Returns:

(torch.Tensor) Computed focal loss.

unimol_tools.models.loss.FocalLossWithLogits(y_pred, y_true, alpha=0.25, gamma=2.0)[source]

Calculates the Focal Loss using predicted logits (raw scores), automatically applying the sigmoid function.

Parameters:
  • y_pred – (torch.Tensor) Predicted logits.

  • y_true – (torch.Tensor) Ground truth labels, may contain NaNs.

  • alpha – (float) Weighting factor for balancing positive and negative examples. Defaults to 0.25.

  • gamma – (float) Focusing parameter to scale the loss. Defaults to 2.0.

Returns:

(torch.Tensor) Computed focal loss.

unimol_tools.models.loss.myCrossEntropyLoss(y_pred, y_true)[source]

Calculates the cross-entropy loss between predictions and targets.

Parameters:
  • y_pred – (torch.Tensor) Predicted logits or probabilities.

  • y_true – (torch.Tensor) Ground truth labels.

Returns:

(torch.Tensor) Computed cross-entropy loss.

Transformers

unimol_tools.models.transformers.py contains a custom Transformer Encoder module that extends PyTorch’s nn.Module.

unimol_tools.models.transformers.softmax_dropout(input, dropout_prob, is_training=True, mask=None, bias=None, inplace=True)[source]

softmax dropout, and mask, bias are optional.

Args:

input (torch.Tensor): input tensor dropout_prob (float): dropout probability is_training (bool, optional): is in training or not. Defaults to True. mask (torch.Tensor, optional): the mask tensor, use as input + mask . Defaults to None. bias (torch.Tensor, optional): the bias tensor, use as input + bias . Defaults to None.

Returns:

torch.Tensor: the result after softmax

unimol_tools.models.transformers.get_activation_fn(activation)[source]

Returns the activation function corresponding to activation

class unimol_tools.models.transformers.SelfMultiheadAttention(*args: Any, **kwargs: Any)[source]
__init__(embed_dim, num_heads, dropout=0.1, bias=True, scaling_factor=1)[source]
class unimol_tools.models.transformers.TransformerEncoderLayer(*args: Any, **kwargs: Any)[source]

Implements a Transformer Encoder Layer used in BERT/XLM style pre-trained models.

__init__(embed_dim: int = 768, ffn_embed_dim: int = 3072, attention_heads: int = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, activation_fn: str = 'gelu', post_ln=False) None[source]
forward(x: torch.Tensor, attn_bias: torch.Tensor | None = None, padding_mask: torch.Tensor | None = None, return_attn: bool = False) torch.Tensor[source]

LayerNorm is applied either before or after the self-attention/ffn modules similar to the original Transformer implementation.

class unimol_tools.models.transformers.TransformerEncoderWithPair(*args: Any, **kwargs: Any)[source]

A custom Transformer Encoder module that extends PyTorch’s nn.Module. This encoder is designed for tasks that require understanding pair relationships in sequences. It includes standard transformer encoder layers along with additional normalization and dropout features.

Attributes:
  • emb_dropout: Dropout rate applied to the embedding layer.

  • max_seq_len: Maximum length of the input sequences.

  • embed_dim: Dimensionality of the embeddings.

  • attention_heads: Number of attention heads in the transformer layers.

  • emb_layer_norm: Layer normalization applied to the embedding layer.

  • final_layer_norm: Optional final layer normalization.

  • final_head_layer_norm: Optional layer normalization for the attention heads.

  • layers: A list of transformer encoder layers.

Methods:

forward: Performs the forward pass of the module.

__init__(encoder_layers: int = 6, embed_dim: int = 768, ffn_embed_dim: int = 3072, attention_heads: int = 8, emb_dropout: float = 0.1, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, max_seq_len: int = 256, activation_fn: str = 'gelu', post_ln: bool = False, no_final_head_layer_norm: bool = False) None[source]

Initializes and configures the layers and other components of the transformer encoder.

Parameters:
  • encoder_layers – (int) Number of encoder layers in the transformer.

  • embed_dim – (int) Dimensionality of the input embeddings.

  • ffn_embed_dim – (int) Dimensionality of the feedforward network model.

  • attention_heads – (int) Number of attention heads in each encoder layer.

  • emb_dropout – (float) Dropout rate for the embedding layer.

  • dropout – (float) Dropout rate for the encoder layers.

  • attention_dropout – (float) Dropout rate for the attention mechanisms.

  • activation_dropout – (float) Dropout rate for activations.

  • max_seq_len – (int) Maximum sequence length the model can handle.

  • activation_fn – (str) The activation function to use (e.g., “gelu”).

  • post_ln – (bool) If True, applies layer normalization after the feedforward network.

  • no_final_head_layer_norm – (bool) If True, does not apply layer normalization to the final attention head.

forward(emb: torch.Tensor, attn_mask: torch.Tensor | None = None, padding_mask: torch.Tensor | None = None) torch.Tensor[source]

Conducts the forward pass of the transformer encoder.

Parameters:
  • emb – (torch.Tensor) The input tensor of embeddings.

  • attn_mask – (Optional[torch.Tensor]) Attention mask to specify positions to attend to.

  • padding_mask – (Optional[torch.Tensor]) Mask to indicate padded elements in the input.

Returns:

(torch.Tensor) The output tensor after passing through the transformer encoder layers. It also returns tensors related to pair representation and normalization losses.