Machine learning strategies for identifying repurposed drugs for cancer therapy

Introduction

The 2018 Nobel Prize in Physiology or Medicine was awarded to two researchers for their discovery of immune checkpoint proteins [ ]. Although the discovery initiated the development of the revolutionary targeted immunotherapies, critical challenges remain to be addressed, making them far from perfection [ ]. Limitations have been observed earlier in chemotherapies, where cytotoxic molecular compounds are administered to directly kill tumor cells and slow the progression of disease [ ]. Other types of targeted therapies, where small molecules interfere with tumor-specific molecular abnormalities also has been actively studied for over half a century and used in parallel with other types of anticancer therapies. While advancements in our understanding of cancer biology and the treatment options made some types of cancers manageable, challenges remain in all types of anticancer therapies. First of all, the efficacy is often limited to a small subset of patients, and it is a difficult task to predict patients' responses to treatments [ ]. Furthermore, it is now well known that many anticancer drugs, including both immunotherapeutic and other targeted therapeutic agents, cause serious side effects [ ]. Natural redundancy and diversity in biological network, such as feedback mechanisms, which enhance robustness of phenotypical outcomes against perturbations, makes it difficult to pinpoint the targets that effectively stop tumor progression, and multidrug resistance develops over time, making the tumors immortal against the previously effective treatments [ ]. More effective and safer drugs are needed to battle against and eventually conquer cancers.

Despite the urgent need, drug discovery is generally a lengthy and costly process with high risk of failure. A variety of risk factors delay new drugs entering the market and increases the chance of withdrawal, increasing the overall cost of drug development [ ]. Drug-induced side effects and toxicity are one of the key issues relevant to the high rate of drug attrition [ ]. The limited success of drug development suggests flaws in the long-standing paradigm: one drug–one target–one disease, where the goal is to design a molecule that inhibits a biological target (e.g., protein) known to be crucial for the disease. Under the paradigm, drug candidate molecules are optimized to interact with the intended target (on-target), without proactively understanding off-target interactions and leaving safety and toxicity tests to the later stage of pipeline. Off-target interactions, most of the time unexpected, may cause undesirable outcomes, even fatal ones in extreme cases, leaving irreparable damages to the business and patients [ ]. On the other hand, the treatment of complex diseases may benefit from off-target interactions if both on- and off-target synergistically reverse the pathological processes. Indeed, it is now known that many approved drugs interact with more than one biological targets [ , ], and the importance of understanding the complex biological interactions for such multitargeting drug molecules is emphasized as a new paradigm of drug discovery, polypharmacology [ , ].

Instead of suffering from unexpected outcomes caused by off-target interactions, polypharmacology is an attempt to understand drug actions as results of multiple different interactions involving the drug molecule and maximize the benefit from the interactions. Indeed, the superiority of multitargeting drugs over highly selective single-target drugs is suggested [ ]. Under the new paradigm, off-target interactions of existing drugs can be used to repurpose them for new indications. Protein–ligand interaction profiles for new ligands can be computationally predicted, and the chemical scaffolds active for multiple targets of interest can be integrated into a single molecule to maximize therapeutic effects and minimize adverse events [ ]. New discoveries in biological studies reveal previously unattended anticancer targets, and new drugs can be designed to play multiple roles in the biological network [ , ]. Therefore, early understanding of drug–target interaction profile across whole genome space is essential for development of new, more effective and safer drugs. However, our knowledge of intermolecular interactions drug molecules cause is limited. It is prohibitive to experimentally evaluate all possible drug interactions. Drugs and drug candidate molecules are typically screened against a subset of potential biological targets, resulting in biased, noisy, sparse, and incomplete interaction profiles. At present, no experimental techniques are affordable and scalable enough to experimentally screen compounds for their complete bioactivity profile in humans.

Although not scalable yet to the whole human genome, high-throughput experimental methods have produced tremendous amount of compound bioactivity data, providing rich resources for data-driven knowledge discovery. To rapidly and systematically explore the data and discover hidden knowledge, various types of computational methods have been developed and applied to predict potential interactions of drug molecules. Among many classes of computational methods applicable for drug discovery, those specifically designed to predict unknown protein–ligand interactions are particularly suitable to fill in the sparse knowledge in bioactivity. This review aims to discuss the recent advancements in machine learning methods for the prediction of protein–ligand interactions as well as the efforts to collect and curate experimental data. While some methods directly aim to discover new anticancer therapies, other methods also provide opportunities to discover anticancer therapies when appropriate data sets and validation steps are incorporated. We try to guide readers who are interested in computer-aided drug discovery by providing information about collecting data, preprocessing the data, and methods that can take the processed data for inference. We discuss the major sources of biopharmaceutical databases that are frequently used to train and evaluate computational methods. Then, different types of computational approaches to predict intermolecular interactions are discussed with examples. Fig. 3.1 illustrates a general workflow for computational protein–ligand interaction prediction projects.

Figure 3.1, Illustration of a general workflow for computational protein–ligand interaction prediction projects.

It should be emphasized that drug action is a complex process. The genome-wide protein–ligand interaction alone may be insufficient to predict clinical end points, such as therapeutic efficacies and side effects. A systems biology approach is needed to model the collective behavior of biomolecular interactions (e.g., DNA, RNAD, protein, metabolite, drug, etc.), which is beyond the scope of this chapter. Interested readers are referred to other publications [ ].

Open-access databases for computational drug discovery projects

High-quality and large-scale compound activity data (e.g., protein–ligand interactions) are indispensable in artificial intelligence–based drug discovery projects. To train a computational prediction method, often called a computational model, known protein–ligand interactions are provided in the way the model can accept. Many public databases curating experimentally measured protein–ligand binding affinities can be used for such purpose. The protein–ligand pairs are numerically represented using fingerprinting techniques described below, and the model is trained and evaluated on the numerically processed data sets. Therefore, it is important to appropriately choose and utilize the databases. Databases that provide compound bioactivity data or proteomic information are especially useful for protein–ligand interaction prediction projects. While they are roughly divided into bioactivity-centric and proteomic databases, many current databases are actively maintained and updated to integrate data from multiple resources, providing more rich and comprehensive data sets. Also, biological network data can be integrated into computational models to help better prediction. The following sections are to describe the commonly used bioactivity and proteomics databases for artificial intelligence–based drug discovery processes and the standard training and evaluation methods ( Table 3.1 ) . Other types of large-scale omics data include genomics, transcriptomics, metabolomics, and biological pathway information, and relevant databases are discussed elsewhere [ ].

Table 3.1

Commonly used databases for artificial intelligence–based drug discovery projects.

	Name (reference)	Features	Link
Bioactivity-centric	ZINC [ ]	Large-scale compound library, compound 3D structures ready for docking	https://zinc.docking.org
	ChEMBL [ ]	Continuous-valued bioactivities, large-scale bioassays from publications	https://www.ebi.ac.uk/chembl
	PubChem [ ]	Large-scale compound library with activities	https://pubchem.ncbi.nlm.nih.gov
	BindingDB [ ]	Continuous-valued bioactivities with focus on potential drug targets	https://www.bindingdb.org/bind/index.jsp
	LINCS [ ]	Kinase-specific bioactivities	https://lincs.hms.harvard.edu
	STITCH [ ]	Protein–ligand and ligand–ligand interaction data, including predicted activities	http://stitch.embl.de
	BioGRID [ ]	Protein–ligand and protein–protein interaction data from publications	https://thebiogrid.org
	SIDER [ ]	Drug-induced side effect data	http://sideeffects.embl.de
	KEGG [ ]	Bioactivity and biological pathway data	https://www.genome.jp/kegg/kegg1.html
	DrugBank [ ]	Drug–protein interactions for approved and investigational drugs	https://www.drugbank.ca
	ExCAPE-DB [ ]	Example of systematic data integration	https://doi.org/10.5281/zenodo.2543724
	MUV [ ]	Example of systematic data split	https://omictools.com/muv-tool
	DUD-E [ ]	Example of systematic data split	http://dude.docking.org
Proteomic	UniProt [ ]	Primary amino acid sequences and functional domain information	https://www.uniprot.org
	PDB [ ]	Largest existing protein 3D structure database to date	https://www.rcsb.org
	STRING [ ]	Protein–protein interaction with functional annotations, including predictions	https://string-db.org
	The Human Protein Atlas [ ]	Human protein classifications based on functions and phenotypes	https://www.proteinatlas.org
	Harmonizome [ ]	Multiple categories of data relevant to genes, proteins, cell lines, and pathways	http://amp.pharm.mssm.edu/Harmonizome

Bioactivity-centric databases

Bioactivity-centric databases provide rich resources for protein–ligand interaction data, which can be used as the known protein–ligand interaction for computational models. While these provide large-scale data, the interaction profiles are not complete, and computational models are used to fill in the incomplete part of the interaction profiles. The databases introduced here are not comprehensive, but these have been updated frequently to maintain the quality and provide more comprehensive knowledge at the time. The bioactivity data that they provide are not necessarily mutually exclusive.

ZINC is a free database of commercially available compounds for virtual screening tasks [ ]. Currently, it contains over 700 million purchasable compounds, among which over 200 million are with their 3D structures, making the database especially useful for virtual docking experiments. ChEMBL, a part of the European Molecular Biology Laboratory—European Bioinformatics Institute (EMBL-EBI), is a publicly available database of bioactivities from multiple sources [ ]. ChEMBL database (version 22) contains approximately 14 million bioactivity values from more than 1 million assays, covering over 1.6 million unique compounds and 9000 proteins. While major sources of its bioactivity samples are publications, it also contains samples from both nonprofit and commercial organizations deposited data sets. PubChem, a chemical information database at the US National Center for Biotechnology Information, contains about 250 million bioactivities for over 200 million substances and 17,000 protein targets from over 30 million publications and 3 million patents [ ]. BindingDB contains over 1.7 million measured protein–ligand binding affinities, which mainly focuses on small, druglike molecules, and potential druggable target proteins [ ]. The Library of Integrated Network–based Cellular Signatures (LINCS) is a database of cell-based perturbation-response signatures, which contains data samples for small molecules, cells, genes, and proteins categorized by the assay types [ ]. STITCH is a protein–ligand interaction database containing over 400,000 chemicals and their interacting protein targets [ ]. Its protein–ligand interaction data contain computationally predicted samples as well as samples from other databases. BioGRID contains protein and genetic interactions as well as chemical interactions with posttranslational modification information [ ]. SIDER, which is also a part of EMBL, is a database containing known drug–side effect associations [ ]. KEGG contains a large-spectrum biochemical and biomedical data sets, including protein–ligand interactions for approved drugs, gene–biological pathway associations, and biomolecular functions (KEGG Orthology) [ ]. DrugBank provides rich resources for approved as well as investigational drugs with their known interactions with protein targets [ ]. DrugBank is also a great source of drug–drug interactions. ExCAPE-DB is a relatively new database providing open access to the combination of bioactivities from ChEMBL and PubChem databases [ ]. While its bioactivity samples are not new, ExCAPE-DB provides a unified set of samples from both databases with appropriate data integration steps.

Proteomic databases

UniProt is a large-scale database maintained by EMBL-EBI, containing over 120 million proteins across all branches of life, with their primary sequences [ ]. RCSB Protein Data Bank (PDB) is a database containing 3D structures for biomolecules, including proteins and protein–ligand complexes [ ]. STRING database allows researchers to connect these proteins to build protein–protein interaction networks [ ]. The Human Protein Atlas project aims to map all human proteins in cells, tissues, and organs using various experimental and computational techniques [ ]. It provides category information for genes and proteins based on their functions, compartments, and relevant diseases and drugs. Harmonizome contains data sets for genes and proteins with their associations with other biomolecules, expressions in cells and tissues, and knockout phenotypes [ ]. Harmonizome is a collection of data sets from multiple sources, so users may also obtain other types of data sets.

Data preparation for training and evaluating computational models

The above databases provide rich data sets regarding chemical compound structures and their bioactivities, and their target information, including primary sequences and 3D structures of proteins and biomolecules. The differences and variety of data sets make it a critical and challenging step to appropriately prepare and divide samples for machine learning and artificial intelligence–based drug development projects. It may be too simplistic to train a computational model on a database and evaluate its performance on another. The databases mentioned above are regularly maintained and refined, and they often include samples from other databases for better coverage. As a result, for many databases of similar kinds, there are significant number of overlapping samples, making it necessary to properly combine and filter the databases. ExCAPE-DB is an example where appropriate merging and filtering steps are applied [ ]. Once combined, a proper data split strategy must be applied so that the samples used to train the model are not included in the test sample set. A possible way of such data preparation step is to integrate multiple databases of interest (e.g., ChEMBL, PubChem, and ZINC for unique chemicals, proteins, and chemical–protein association pairs), filter out redundant samples, and randomly divide samples into training and evaluation sets. This naïve random split strategy often leads to overly optimistic evaluation of computational models.

A better data preparation strategy is to add property-matching filters in addition to the naïve random split so that the training or evaluation sample sets are not overrepresented by some factors that are unlikely to differentiate binding ligands and the others. Maximum Unbiased Validation (MUV) [ ] and Database of Useful Decoys-Enhanced (DUD-E) [ ] data sets are frequently used to avoid such overrepresentation issues and provide better-prepared data sets for computational drug development projects. MUV data set was designed to overcome artificial enrichment and analogue bias, two major biases in virtual screening data sets. Artificial enrichment stands for the bias when some simple molecular features separates the actives and decoys, such as molecular weight. Computational models suffer from analogue bias when they are trained with data sets having some substructures overrepresented in the actives. A computational model well-trained on such data sets is likely to make biased predictions based on the simple features or overrepresented chemical scaffolds. DUD-E adopts similar idea to the two biases while it focuses on protein–ligand pairs with 3D structures available, making it especially useful for virtual docking or structure-based models. The two data sets, however, have limitations. MUV excludes frequent hitters, the ligands that are active in most tested assays, making it less suitable for projects where the aim is to predict all unknown associations for drugs. DUD-E samples include activities for nonhuman homologs and exclude mutated targets, making it be subjective to species-specific or mutant-specific activities. Furthermore, despite the efforts, DUD-E samples are still not free from the biases as shown in a recent study regarding 3D structure–based protein–ligand binding prediction methods [ ]. Both data sets are relatively small as well, partly because of the developed date and the structural coverage. As a result, the data sets and split strategies are often study-specific, and performances may be provided on these data sets for comparison. A gold-standard data set for artificial intelligence–based protein–ligand binding projects is needed.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here