Models and Featurizations
Models and Featurizations
Standard classification model by applying logistic function to weighted linear combination of input features.
Standard classification and regression method based on an ensemble of decision trees, each trained on a different subsampled version of the original dataset.
Standard neural network prediction method designed for multitask settings. Input features are processed through multiple shared fully-connected layers and then fed into separate linear classifiers/regressors for each different task. In case of single task dataset, it will become vanilla neural network model.
A modified version of multitask network designed for uncorrelated tasks. Based on the structure of multitask network, it adds "bypass" layers directly connecting input features and each individual task, hence increasing explanatory power in case of unrelated variations in the sample.
Refined K-nearest neighbour classifier. Using the hypothesis that compounds with similar substructures have similar functionality, it makes prediction by combining labels from the top-K compounds most similar to the sample.
A similar adaptive graph-based model that treats molecules as undirected graphs. Instead of doing convolution locally(central atom and neighbour atoms), it applies global convolutions to central atom and all other atoms in the molecule, together with their corresponding pair features.
An alternate of graph-based method that applies to directed graphs. The model regards each molecule as a set of directed acyclic graphs, each originated from a different atom. Results from all possible graphs of a molecule are calculated and averaged to yield molecular-level properties.
Adaptable extension of the Coulomb Matrix featurizer. Nuclear charges(atom types) are mapped to feature vectors, which are further updated based on distance matrix and neighbour atoms. Final states of all atoms' feature vectors are mapped to the outputs and then summed to predict molecular properties.
A learnable version of circular fingerprint which replaces fixed hash functions by differentiable network layers. In graph convolutional models, molecules are treated as undirected graphs: atoms as nodes and bonds as edges. Each convolutional layer will extend the feature vector of the central atom by applying convolutional functions(network layer) on itself and its neighbours(other nodes connected by edges).
Molecule is decomposed into segments of variable sizes, all originated from heavy atoms(C, N, O).
All segment are then assigned with unique identifiers, which are hashed together into a fixed length binary fingerprint.
All segment are then assigned with unique identifiers, which are hashed together into a fixed length binary fingerprint.
Coulomb Matrix encodes nuclear charges and corresponding Cartesian coordinates into a matrix, with diagonal elements representing nuclear charges and off-diagonal elements representing Coulomb repulsions.
Symmetry functions is another encoding of Cartesian coordinates which focuses on preserving the rotational and permutation symmetry of the system. It introduces a series of radial and angular symmetry functions with different distance and angle cutoffs.
Grid Featurizer, initially built for PDBbind, relies on detailed structures of protein-ligand pair to summarize inter-molecular forces. It incorporates fingerprints of both proteins and ligands, as well as an enumeration of salt bridges, hydrogen bonding, etc.
With the same feature vectors for atoms as Graph convolutions featurizer, Weave featurizer elaborates the neighbour list as a matrix of pair feature vectors, each representing the connectivity and distance between a pair of atoms.
Molecule is represented by a neighbout list and a set of initial feature vectors , each corresponding to a single atom,. Feature vector summarizes the atom's local chemical environment, including atom-types, hybridization types and valence structures.
Models
Models
Featurizations
Featurizations
Message passing neural network(MPNN) is a generalized graph-based model. Its prediction process is separated into two phases: message passing phase(an edge dependent neural network) and readout phase(seq2seq model for sets).