Dataset Collection

Quantum
Mechanics

Physical
Chemistry

Biophysics

Physiology

  • QM7/QM7b (structure): Electronic properties(atomization energy, HOMO/LUMO, etc.) determined using ab-initio density functional theory(DFT).

  • QM8 (structure): Electronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods.

  • QM9 (structure): Geometric,  energetic, electronic and thermodynamic properties of DFT-modelled small molecules.
  • ESOL: Water solubility data(log solubility in mols per litre) for common organic small molecules.

  • FreeSolv: Experimental and calculated hydration free energy of small molecules in water.

  • Lipophilicity: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
  • BBBP: Binary labels of blood-brain barrier penetration(permeability).

  • Tox21: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.

  • ToxCast: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks.

  • SIDER: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.

  • ClinTox: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.
  • PCBA: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening.

  • MUV: Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques.

  • HIV: Experimentally measured abilities to inhibit HIV replication.

  • PDBbind: Binding affinities for bio-molecular complexes, both structures of proteins and ligands are provided.

  • BACE: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1).

Regression

Regression

Regression

Regression

Regression

Regression

3D Coordinates

3D Coordinates

3D Coordinates

Regression

3D Coordinates

Classification

Classification

Regression

Classification

Classification

Classification

Classification

Classification

Classification

Classification

a All MoleculeNet datasets are split into training, validation and test subsets following a 80/10/10 ratio. Different  splittings are recommended depending on each dataset's contents. For details of splitting methods please refer to the paper.

b Different classification and regress metrics are recommended based on previous works and dataset's contents:
          ROC-AUC:  Area Under Curve of Receiver Operating Characteristics
          PRC-AUC:  Area Under Curve of Precision Recall Curve
          RMSE: Root-Mean-Square Error
          MAE: Mean Absolute Error
    For details of metrics please refer to the paper.

Dataset Details