[Note] Modeling Polypharmacy Side Effects with Graph Convolutional Network

multi-type link predication, multimodal graph, GCN...

Posted by Hao Xu on January 4, 2019

Modeling Polypharmacy Side Effects with Graph Convolutional Network

Home Page: http://snap.stanford.edu/decagon

1. Dataset

The original dataset is a huge multimodal graph.

Node:

Each node either represents a drug(D) or a protein(P).

Edge with label:

Edges are undirected and labeled: $E={(\text{node1, label, node2})}$. There are three types of nodes:

  1. D-P link $(n_d, t, n_p)$: drug $d$ targets on protein $p$
  2. P-P link $(n_{p1}, b, n_{p2})$: protein $p_1$ and protein $p_2$ have physical binding
  3. D-D link $(n_{d1}, r_i, n_{d2})$: drug $d_1$ and drug $d_2$ can cause multipharmacy side effect $r_i$

This means all D-P / P-P links have the same label $t$ / $b$, but the label of a D-D link is chosen from ${r_i}_{i=1,…,N}$, where each $r_i$ represents a side effect. Note that between a pair of drugs there might be more than one links with different labels (a pair of drug might cause more than one side effects).

Use $L$ represents all kinds of labels.

2. Method

Data Drive:

80% - training, 10% - validation, 10% - testing of drug-drug edges

GCN Encoder:

A network is designed to embed nodes. For node $n_i$, the input is its feature vector $\mathbf{x_i}$ and the output is a $d$-dimensional embeding vector $\mathbf{z_i}$.

Layer: There are $K$ layers in this embedding network, where $K$ is less than the maximum number of neighborhood order over the graph. Using $h_i^k$ to denote the hidden state of node $n_i$ in the $k^{th}$ layer. $\mathcal{N}_l^i$ denoting the set of neighbors of node $n_i$ under relation $l$. The input to the first layer are node feature vectors, $h_i^{(0)} = \mathbf{x}_i$. The output of the $K^{th}$ layer are the final embedding for each node, $\mathbf{z}_i = h_i^{K}$ Function for the $k^{th}$ layer :

\[h_i^{k+1} = \phi(\sum_r \sum_{j\in \mathcal{N}_r^i} c_r^{ij} \mathbf{W}_r^{(k)} \mathbf{h}_j^{(k)} + c_r^i \mathbf{h}_i^{(k)})\]

, where $\phi$ is an non-linear element-wise activation function (i.e., ReLU), and

\[c_r^{ij} = \cfrac{1}{\sqrt{|\mathcal{N}_r^i| |\mathcal{N}_r^j|}}\ ,\qquad \qquad c_r^i = \cfrac{1}{\sqrt{|\mathcal{N}_r^i|}}\]

Tensor Factorization Decoder (DEDICOM):

The probability of the existence of the edge $(n_i, l, n_j)$ is computed by two functions in sequence, for any $l \in L$.

Tensor factorization:

\[g(n_i, l, n_j)=\begin {cases} \mathbf{z}_i^\top \mathbf{D}_l \mathbf{R} \mathbf{D}_l \mathbf{z}_j & z_i\ and\ z_j\ are\ drugs \\ \mathbf{z}_i^\top \mathbf{M}_l \mathbf{z}_j & otherwise \end {cases} \tag{2}\]

Sigmoid function:

\[p_l^{ij} = p((n_i, l, n_j) \in \mathcal{R}) = \sigma(g(\mathbf{z}_i, l, \mathbf{z}_j)) \tag{3}\]

where $\mathcal{R}$ is the set of D-D edges.

Loss Function:

The cross-entropy loss for a single edge $(n_i, l, n_j)$:

\[J_l(i,j) = −\log p_l^{ij} − \mathbb{E}_{n∼P_l(j)} \log(1−p_l^{in} )\tag{4}\]

The final loss function is

\[J = \sum_{(n_i, l, n_j)\in \mathcal{R}} J_l(i, j) \tag{5}\]

Optimization:

Parameters: $\mathbf{D}_l,\ \mathbf{R},\ \mathbf{M}_l,\ \mathbf{W}_l$

Optimizer: Adam(learning_rate=0.001, max_epoch=100, early_stopping=2)

Initialization: Xavier (Glorot and Bengio, 2010) accordingly normalize node feature vectors.

Dropout: regular dropout (Srivastava et al., 2014) to hidden layer units (Eq(1)).

Gradient Descent: mini-batching - sampling a fixed number of contributions to the loss function in Eq(5).