This interactive notebook is meant to be used for teaching purposes and we will lead you through a (simplified) virtual molecular compound screening pipeline. Before we can look at the biological background in more detail, we have to import a couple of prerequistes. We separated some complicated looking code out into functions, you can have a peak here if you want to learn more about how the code works under the hood :)
# ngl molecular visualisation package
import nglview as nv
from func import *
Let's get started!
Every cell in our body has a big range of tasks to fulfill to stay alife, grow, move and replicate. All of these tasks are facilitated by proteins and we can group them into different classes depending on their function:
To get a good overview over proteins have a look at this video by the protein data bank.
If a protein does no longer fulfill it's function this can have a big impact on a cell, a tissue and ultimate the whole body. Many diseases we commonly know can be attributed to misfunctioning proteins, like cystic fibrosis, alzheimer's disease and many types of cancer.
One concrete example is breast cancer. The biggest subclass of breast cancer is dependent on a protein called estrogen receptor (ER for short; view on protein data bank: 1G50). More information on the connection between ER and breast cancer can be found on this wikipedia page.
But we also have a little summary picture for you:
When estrogen binds to estrogen receptor the protein locates into the cell nucleus and binds to our genetic material. This leads to an increase in expression of a wide range of target genes which in turn leads to an increase in cell growth. This ultimately leads to tumour formation.
You can play around with the 3D structure of estrogen receptor by running the code below.
view_estrogen_receptor()
To be able to help people suffering from a protein related disease we want to develop drugs which inhibit the underlying protein. Thousands of scientists world wide dedicate their research to finding new durg targets and of course big pharma companys aid in this process.
The first step of finding a new drug is to identify a new chemical compound which can inhibit a given protein target. This is called a compound screening which narrows a library of hundred thousand molecules down to a few promising lead compounds. We visualized this in the figure below.
However, this process is very labour and time intensive in a laboratory setting. This is where computers come into play.
The field of Computer Aided Drug Design or short CADD encompasses a wide range of methodologies in all stages of the drug discovery pipeline. This tutorial is giving an example of a virtual compound screen, which is basically translating the laboratory compound screen into the computer.
We supply you with a dataset of compounds which have been shown to bind to estrogen receptor (you remember? the protein responsible for breast cancer). This dataset which we downloaded from a big database called ChEMBL, comes with a key value for each compound - the so called inhibitory concentration 50 or IC50. This is a measure of the effectiveness of a substance in inhibiting a protein.
We then use a machine learning approach to predict how well a new set of molecules would inhibit estrogen receptor.
If you follow the instructions on the next slides and run the code, you can identify your own lead compounds.
To train a machine learning tool to predict how well a new molecule is going to inhibit ER, we need training data:
# Read in our training dataset
training_data = pd.read_csv("Data/training_data_raw.csv")
# This is showing you how the dataset looks. It is essentially a big table
training_data.head()
Compound ID | Structure | IC50 | |
---|---|---|---|
0 | CHEMBL3623004 | C[C@@H]1Cc2c([nH]c3ccccc23)[C@H](N1CC(C)(C)F)c... | 0.138 |
1 | CHEMBL195515 | CC\C(=C(/c1ccc(O)cc1)\c2ccc(\C=C\C(=O)O)cc2)\c... | 0.150 |
2 | CHEMBL198803 | C[C@@H](COc1ccc(cc1)C(=O)c2c(sc3cc(O)ccc23)c4c... | 0.200 |
3 | CHEMBL180873 | Cc1cccc(c1)C(=O)N2CCN(CC2)c3ccc(cc3)C(=O)c4c(s... | 0.270 |
4 | CHEMBL181368 | C[C@H]1[C@@H]([C@@H](Oc2ccc(O)c(F)c12)c3ccc(OC... | 0.300 |
We need to clean this data to get rid of duplicated molecules and rows which have missing data.
training_data = clean_data(training_data)
Cleaning data...done!
Now we train a random forest predictor which is a type of machine learning algorithm.
random_forest = train_random_forest(training_data)
Training random forest...done!
Now we can use the trained random forest to predict the inhibitory potential of some new structures and identify promising compounds.
# Read in the dataset of new molecules
test_data = pd.read_csv("Data/test_data_clean.csv")
# The random forest algorithm predicts IC50 values for each new compound.
test_data = predict_ic50(random_forest, test_data)
# This is showing you how the dataset looks.
test_data.head()
Predicting IC50 values using random forest...done!
Compound ID | Structure | Predicted IC50 | |
---|---|---|---|
84 | CHEMBL183467 | C[C@@H]1CN(CCOc2ccc(cc2)[C@@H]3Oc4ccc(O)cc4S[C... | 0.551612 |
41 | CHEMBL182794 | CC(C)[C@H](COc1ccc(cc1)[C@@H]2Oc3ccc(O)cc3S[C@... | 0.631991 |
89 | CHEMBL181369 | C[C@H]1[C@@H]([C@@H](Oc2ccc(O)c(F)c12)c3ccc(OC... | 0.706330 |
48 | CHEMBL3774647 | C\C(=C(\c1ccc(O)cc1)/c2ccc(OCCN3CCCCC3)cc2)\c4... | 3.645439 |
3 | CHEMBL3427394 | OC(=O)\C=C\c1ccc(CC2=C(C(=O)Oc3cc(O)ccc23)c4cc... | 6.295644 |
Let's look at these results :)
To get more of an idea of how your identified lead compounds look we will plot their structure.
# Your top hit is loaded into a molecule object
mol = load_molecule_from_smiles(test_data["Structure"].iloc[1])
# This function produces a 2D picture of the compound
draw_pretty_2D_pics(mol)
# If you run this code you get a 3D plot of your compound. You can even play around with it :)
mol.RemoveAllConformers()
show_molecule_3D_structure(mol)
If you want to investigate other lead molecules, simply change the number in the square brackets behind iloc in the cell above like this: \
mol = load_molecule_from_smiles(test_data["Structure"].iloc[2])
Have fun exploring!