QSA(P)R and Machine Learning methods rely on finding correlations between molecular properties of known values for a given set of molecules in order to predict the properties of another set of molecules closely related to the former. These molecular properties to be correlated are known as ‘descriptors’, which are values that can be measured (pKa, refraction index, boiling point, etc.) or calculated (HOMO/LUMO gap, polarizbility, electronic energy, etc.) and depend on the structure of a molecule. Sometimes descriptors can be very abstract features with little to no physicochemical meaning, but since we’re interested in finding a correlation instead of causation, these descriptors can be quite useful.

There are many codes available for calculating descriptors for a set of molecules, but I’ve recently come across Mordred, a python-based code which does precisely that with ease. If you’re interested in finding more details about Mordred you can read the original paper in the Journal of Cheminformatics DOI: 10.1186/s13321-018-0258-y and their github is here.

Below, I gather a workflow needed in order to work with Mordred. As with other posts, these instructions aren’t thorough, but will let you quickly get descriptors on the fly for your molecules.

Mordred is a python based code, so the first step is having the Anaconda environment installed. Download the installer for Anaconda here, and install the .sh file:

$ bash Anaconda-installer-filename.sh

Default settings are good to start, you can change them latter if needed. Close the terminal and re-open it again. To test the installation you can type the following instruction to get a list of installed packages.

$ conda list

Now we need to install RDKit with Anaconda (Conda)

$ conda create -c conda-forge -n my-rdkit-env rdkit
$ conda activate my-rdkit-env

Alternatively you can type:

$ conda install -c rdkit -c mordred-descriptor mordred
$ pip install 'mordred[full]'

To now calculate descriptors for a given molecule you need their smile codes *.smi

$ conda activate
$ conda activate my-rdkit-env
$ python3 -m mordred example.smi -o example.csv

The example.smi file can contain one or several smiles for a batch calculation; all results are printed to the output (-o filename.csv) which you can later open with various programs, including MS Excel if you have to. When finished, the command line may print out several warnings or errors, which I still haven’t quite figured out, but so far they can be dismissed without much risk.

You can always access their help page from the command line with

$ python3 -m mordred --help

Finally, deactivate your environment to return your terminal to normal if needed.

$ conda deactivate

You can ask Mordred for a specific kind or number of descriptors for each molecule, the previous example yields all 1614 available descriptors for each molecule. A later principal components analysis (PCA) is needed to discriminate the really useful descriptors for the given set.

Again, this isn’t a thorough lesson on chemoinformatics, but I hope it helps you to calculate descriptors in a cheap and fast way.