Virtual Compound Screening

Bringing computer aided drug design to the class room

This project was developed in the scope of the Life Science Hackathon 2019 in London.

This interactive notebook is meant to be used for teaching purposes and we will lead you through a (simplified) virtual molecular compound screening pipeline. Before we can look at the biological background in more detail, we have to import a couple of prerequistes. We separated some complicated looking code out into functions, you can have a peak here if you want to learn more about how the code works under the hood :)

In [3]:
# ngl molecular visualisation package
import nglview as nv 
from func import *

Let's get started!

Proteins are the molecular machinery of cells

Every cell in our body has a big range of tasks to fulfill to stay alife, grow, move and replicate. All of these tasks are facilitated by proteins and we can group them into different classes depending on their function:

  1. Defence
  2. Communication
  3. Enzymatic Reactions
  4. Transport
  5. Storage
  6. Structure

To get a good overview over proteins have a look at this video by the protein data bank.

Misbehaving proteins lead to disease

If a protein does no longer fulfill it's function this can have a big impact on a cell, a tissue and ultimate the whole body. Many diseases we commonly know can be attributed to misfunctioning proteins, like cystic fibrosis, alzheimer's disease and many types of cancer.

One concrete example is breast cancer. The biggest subclass of breast cancer is dependent on a protein called estrogen receptor (ER for short; view on protein data bank: 1G50). More information on the connection between ER and breast cancer can be found on this wikipedia page.

But we also have a little summary picture for you:

title

When estrogen binds to estrogen receptor the protein locates into the cell nucleus and binds to our genetic material. This leads to an increase in expression of a wide range of target genes which in turn leads to an increase in cell growth. This ultimately leads to tumour formation.

You can play around with the 3D structure of estrogen receptor by running the code below.

In [4]:
view_estrogen_receptor()

Finding new drugs against proteins to target disease

To be able to help people suffering from a protein related disease we want to develop drugs which inhibit the underlying protein. Thousands of scientists world wide dedicate their research to finding new durg targets and of course big pharma companys aid in this process.

The first step of finding a new drug is to identify a new chemical compound which can inhibit a given protein target. This is called a compound screening which narrows a library of hundred thousand molecules down to a few promising lead compounds. We visualized this in the figure below.

Drawing

However, this process is very labour and time intensive in a laboratory setting. This is where computers come into play.

Computer Aided Drug Design

The field of Computer Aided Drug Design or short CADD encompasses a wide range of methodologies in all stages of the drug discovery pipeline. This tutorial is giving an example of a virtual compound screen, which is basically translating the laboratory compound screen into the computer.

We supply you with a dataset of compounds which have been shown to bind to estrogen receptor (you remember? the protein responsible for breast cancer). This dataset which we downloaded from a big database called ChEMBL, comes with a key value for each compound - the so called inhibitory concentration 50 or IC50. This is a measure of the effectiveness of a substance in inhibiting a protein.

We then use a machine learning approach to predict how well a new set of molecules would inhibit estrogen receptor.

If you follow the instructions on the next slides and run the code, you can identify your own lead compounds.

Step 1: Data Cleaning

To train a machine learning tool to predict how well a new molecule is going to inhibit ER, we need training data:

In [5]:
# Read in our training dataset 
training_data = pd.read_csv("Data/training_data_raw.csv")
# This is showing you how the dataset looks. It is essentially a big table
training_data.head()
Out[5]:
Compound ID Structure IC50
0 CHEMBL3623004 C[C@@H]1Cc2c([nH]c3ccccc23)[C@H](N1CC(C)(C)F)c... 0.138
1 CHEMBL195515 CC\C(=C(/c1ccc(O)cc1)\c2ccc(\C=C\C(=O)O)cc2)\c... 0.150
2 CHEMBL198803 C[C@@H](COc1ccc(cc1)C(=O)c2c(sc3cc(O)ccc23)c4c... 0.200
3 CHEMBL180873 Cc1cccc(c1)C(=O)N2CCN(CC2)c3ccc(cc3)C(=O)c4c(s... 0.270
4 CHEMBL181368 C[C@H]1[C@@H]([C@@H](Oc2ccc(O)c(F)c12)c3ccc(OC... 0.300

We need to clean this data to get rid of duplicated molecules and rows which have missing data.

In [6]:
training_data = clean_data(training_data)
Cleaning data...done!

Step 2: Train a Random Forest Predictor

Now we train a random forest predictor which is a type of machine learning algorithm.

In [7]:
random_forest = train_random_forest(training_data)
Training random forest...done!

Step 3: Use the Random Forest to predict IC50 values

Now we can use the trained random forest to predict the inhibitory potential of some new structures and identify promising compounds.

In [8]:
# Read in the dataset of new molecules 
test_data = pd.read_csv("Data/test_data_clean.csv")
# The random forest algorithm predicts IC50 values for each new compound.
test_data = predict_ic50(random_forest, test_data)
# This is showing you how the dataset looks.
test_data.head()
Predicting IC50 values using random forest...done!
Out[8]:
Compound ID Structure Predicted IC50
84 CHEMBL183467 C[C@@H]1CN(CCOc2ccc(cc2)[C@@H]3Oc4ccc(O)cc4S[C... 0.551612
41 CHEMBL182794 CC(C)[C@H](COc1ccc(cc1)[C@@H]2Oc3ccc(O)cc3S[C@... 0.631991
89 CHEMBL181369 C[C@H]1[C@@H]([C@@H](Oc2ccc(O)c(F)c12)c3ccc(OC... 0.706330
48 CHEMBL3774647 C\C(=C(\c1ccc(O)cc1)/c2ccc(OCCN3CCCCC3)cc2)\c4... 3.645439
3 CHEMBL3427394 OC(=O)\C=C\c1ccc(CC2=C(C(=O)Oc3cc(O)ccc23)c4cc... 6.295644

Congrats! You have done a virtual compound screen!!!

Let's look at these results :)

Step 4: Visualise the Top Targets

To get more of an idea of how your identified lead compounds look we will plot their structure.

In [9]:
# Your top hit is loaded into a molecule object
mol = load_molecule_from_smiles(test_data["Structure"].iloc[1])
In [10]:
# This function produces a 2D picture of the compound
draw_pretty_2D_pics(mol)
Out[10]:
O O OH S OH N
In [11]:
# If you run this code you get a 3D plot of your compound. You can even play around with it :) 
mol.RemoveAllConformers()
show_molecule_3D_structure(mol)

If you want to investigate other lead molecules, simply change the number in the square brackets behind iloc in the cell above like this: \ mol = load_molecule_from_smiles(test_data["Structure"].iloc[2])

Have fun exploring!