Dataset Collection
Dataset Collection
- ESOL: Water solubility data(log solubility in mols per litre) for common organic small molecules.
- FreeSolv: Experimental and calculated hydration free energy of small molecules in water.
- Lipophilicity: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
- BBBP: Binary labels of blood-brain barrier penetration(permeability).
- Tox21: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.
- ToxCast: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks.
- SIDER: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.
- ClinTox: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.
- PCBA: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening.
- MUV: Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques.
- HIV: Experimentally measured abilities to inhibit HIV replication.
- PDBbind: Binding affinities for bio-molecular complexes, both structures of proteins and ligands are provided.
- BACE: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1).

a All MoleculeNet datasets are split into training, validation and test subsets following a 80/10/10 ratio. Different splittings are recommended depending on each dataset's contents. For details of splitting methods please refer to the paper.
b Different classification and regress metrics are recommended based on previous works and dataset's contents:
ROC-AUC: Area Under Curve of Receiver Operating Characteristics
PRC-AUC: Area Under Curve of Precision Recall Curve
RMSE: Root-Mean-Square Error
MAE: Mean Absolute Error
For details of metrics please refer to the paper.
Dataset Details
Dataset Details