Gat Introduction - Shinning

背景：ICLR2018 论文，Graph Attention Network在GNN中非常重要，再之前图卷积网络GCN的基础之上引入了注意力机制，非常实用。

论文地址：https://arxiv.org/abs/1710.10903

代码地址： https://github.com/Diego999/pyGAT

文章出处：

2018_ICLR_GRAPH ATTENTION NETWORKS

https://blog.csdn.net/weixin_35479108/article/details/84978996

https://blog.csdn.net/weixin_36474809/article/details/89401552

一、背景与概览

图注意力网络(GAT)Graph attention network缩写为GAT，若按照首字母大写，会与对抗生成网络GAN混淆。所以后面GAT即本文的图注意力网络。

1.1 相关研究

下面三篇论文递进关系：

Semi-Supervised Classification with Graph Convolutional Networks，ICLR 2017，图卷积网络 https://arxiv.org/abs/1609.02907

Graph Attention Networks，ICLR 2018. 图注意力网络，就是此篇文章所解析的论文 https://arxiv.org/abs/1710.10903

Relational Graph Attention Networks ，ICLR2019 关联性图注意力网络，整合了GCN+Attention+Relational

1.2 贡献点

引入masked self-attentional layers 来改进前面图卷积graph convolution的缺点

对不同的相邻节点分配相应的权重，既不需要矩阵运算，也不需要事先知道图结构

四个数据集上达到state of the art的准确率Cora、Citeseer、Pubmed、protein interaction

1.3 相关工作

对待图结构的数据有两种方法，谱方法和非谱方法

谱方法 spectral approaches

即Semi-Supervised Classification with Graph Convolutional Networks，ICLR 2017这篇文章中的方法

解析：GCN (Graph Convolutional Network) 图卷积网络概览

Finally, Kipf & Welling (2017) simplified the previous method by restricting the filters to operate in a 1-step neighborhood around each node.在每个节点周围对卷积核做一阶邻接近似。但是此方法也有一些缺点：

必须基于相应的图结构才能学到拉普拉斯矩阵L

对于一个图结构训练好的模型，不能运用于另一个图结构（所以此文称自己为半监督的方法）

非谱方法 non-spectral approaches

One of the challenges of these approaches is to define an operator which works with different sized neighborhoods and maintains the weight sharing property of CNNs. 即每个节点的相邻链接数量是不一样的，如何能设置相应的卷积尺寸来保证CNN能够对不同的相邻节点进行操作。下文这种方法运用GraphSAGE, 取得了不错的结果

William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on largegraphs. Neural Information Processing Systems (NIPS), 2017.

这种方法是将相邻节点设置为固定的长度，然后进行specific aggregator，例如mean over all the sampled neighbors’ feature vectors, or the result of feeding them through a recurrent neural network。这种方法在几个大的benchmarks上取得了非常好的效果。

注意力机制 self-attention

优点：可以处理任意大小输入的问题，并且关注最具有影响能力的输入。

注意力机制再RNN与CNN之中，都取得了不错的效果，并且可以达到state of the art的性能。

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

二、方法（重点）

一整套推导下来，并不算难，没有过于复杂的推导。下面每个公式以及每个变量我都明确的给出了其中的意义。

2.1 方法特性

The idea is to compute the hidden representations of each node in the graph, by attending over its neighbors, following a self-attention strategy. 针对每一个节点运算相应的隐藏信息，在运算其相邻节点的时候引入注意力机制：

高效：针对相邻的节点对，并且可以并行运算

灵活：针对有不同度（degree）的节点，可以运用任意大小的weight与之对应。（这里我们解释一个概念，节点的度degree：表示的是与这个节点相连接的节点的个数）

可移植：可以将模型应用于从未见过的图结构数据，不需要与训练集相同。

2.2 图注意力层Graph Attention layer

输入与输出

输入

N为节点的个数，F为feature的个数，这表示输入为N个节点的每个节点的F个feature

输出

表示对这N个节点的 F’ 个输出，输出位N个节点的每个节点的F’个feature

这里我们明确一下，针对的是N个节点，按照其输入的feature预测输出的feature。

输入是一个图中每个节点的特征，输出是经过注意力层变换之后的节点特征。

特征提取与注意力机制

为了得到相应的输入与输出的转换，我们需要根据输入的feature至少一次线性变换得到输出的feature，所以我们需要对所有节点训练一个权值矩阵：

这个权值矩阵就是输入与输出的F个feature与输出的F’个feature之间的关系。

We then perform self-attention on the nodes—a shared attentional mechanism，针对每个节点实行self-attention的注意力机制，机制为

注意力互相关系数为attention coefficients：

作者通过masked attention将这个注意力机制引入图结构之中，masked attention的含义：只计算节点 i 的相邻的节点 j

节点 j 为

其中Ni为节点i的所有相邻节点。为了使得互相关系数更容易计算和便于比较，

我们引入了softmax对所有的i的相邻节点j进行正则化：

实验之中，注意力机制a是一个单层的前馈神经网络，通过权值向量来确定

，并且加入了 LeakyRelu的非线性激活，这里小于零斜率为0.2。（这里我们回顾下几种Relu函数，relu:小于0就是0，大于零斜率为1；LRelu:小于零斜率固定一个值，大于零斜率为1；PRelu:小于零斜率可变，大于零斜率为1；还有CRelu,Elu,SELU）。注意力机制如下：

也是我们前面需要得到的注意力互相关系数

在模型中应用相互注意机制a（Whi，Whj），通过权重向量 a 参数化，应用 LeakyReLU 激活

模型权重为

转置表示为T

concatenation 用

表示

公式含义就是权值矩阵与F’个特征相乘，然后节点相乘后并列在一起，与权重

相乘，LRelu激活后指数操作得到softmax的分子

 def forward(self, x):

​    \# [B_batch,N_nodes,C_channels]

​    B, N, C = x.size()

​    \# h = torch.bmm(x, self.W.expand(B, self.in_features, self.out_features)) # [B,N,C]

​    h = torch.matmul(x, self.W) # [B,N,C]

​    a_input = torch.cat([h.repeat(1, 1, N).view(B, N * N, C), h.repeat(1, N, 1)], dim=2).view(B, N, N,

​                                                 2 * self.out_features) # [B,N,N,2C]

​    \# temp = self.a.expand(B, self.out_features * 2, 1)

​    \# temp2 = torch.matmul(a_input, self.a)

​    attention = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(3)) # [B,N,N]

​    attention = F.softmax(attention, dim=2) # [B,N,N]

​    attention = F.dropout(attention, self.dropout, training=self.training)

​    h_prime = torch.bmm(attention, h) # [B,N,N]*[B,N,C]-> [B,N,C]

​    out = F.elu(h_prime + self.beta * h)

​    return out

-——————–

作者：邢翔瑞

来源：CSDN

原文：https://blog.csdn.net/weixin_36474809/article/details/89401552

Output features

通过上面，运算得到了正则化后的不同节点之间的注意力互相关系数normalized attention coefficients，可以用来预测每个节点的output feature：

我们再回顾一下含义，W为与feature相乘的权值矩阵

a为前面算得的注意力互相关系数

sigema为非线性激活

遍历的j 表示所有与i 相邻的节点

这个公式表示就是，该节点的输出feature与与之相邻的所有节点有关，是他们的线性和的非线性激活

这个线性和的线性系数是前面求得的注意力互相关系数

multi-head attention

multi-head attention与下面这个工作类似：

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

在上面的output feature加入计算multi-head的运算公式:

concate操作为
第k个注意力机制为

共大K个注意力机制需要考虑，小k表示大K中的第k个
输入特征的线性变换表示为

最终的输出为h’ 共由KF’ 个特征影响

例如，K=3时候，结构如下

例如此图，节点1在邻域中具有多端注意机制，不同的箭头样式表示独立的注意力计算，通过连接或平均每个head获取 h1

对于最终的输出，concate操作可能不那么敏感了，所以我们直接用K平均来取代concate操作，得到最终的公式：

2.3 与同类工作的对比

因为我们比较重点关注于运用，这部分我们粗略的看一下就好，需要的话再来细看更新。

算法复杂度低

GAT运算得到F’个特征需要的算法复杂度

F’：为输出特征的个数

F：为输入特征的个数

：节点的个数

E	：节点之间连接的个数

并且引入K之后，对于每个head的运算都独立并且可以并行

更好鲁棒性

与GCN的不同在于，GAT针对不同的相邻节点的重要性进行预测，模型具有更好的性能并且对于扰动更加鲁棒。

不需要整张Graph

引入注意力机制之后，只与相邻节点有关，即共享边的节点有关，无需得到整张graph的信息。

即使丢失了i,j之间的链接，则不计算即可

可以将模型运用于inductive learning，更好解释性，即使graph完全看不到completely unseen，也可以运行训练过程

比LSTM更强

The recently published inductive method of Hamilton et al. (2017) samples a fixed-size neighborhood of each node, in order to keep its computational footprint consistent; this does not allow it access to the entirety of the neighborhood while performing inference. Moreover, this technique achieved some of its strongest results when an LSTM (Hochreiter & Schmidhuber, 1997)-based neighborhood aggregator is used. This assumes the existence of a consistent sequential node ordering across neighborhoods, and the authors have rectified it by consistently feeding randomly-ordered sequences to the LSTM. Our technique does not suffer from either of these issues—it works with the entirety of the neighborhood (at the expense of a variable computational footprint, which is still on-par with methods like the GCN), and does not assume any ordering within it.

2017年Hamilton提出的inductive method为每一个node都抽取一个固定尺寸的neighborhood，为了计算的时候footprint是一致的（指的应该是计算的时候处理neighborhood的模式是固定的，不好改变，因此每次都抽样出固定数量的neighbor参与计算），这样，在计算的时候就不是所有的neighbor都能参与其中。此外，Hamilton的这个模型在使用一些基于LSTM的方法的时候能得到最好的结果，这样就是假设了每个node的neighborhood的node一直存在着一个顺序，使得这些node成为一个序列。但是本文提出的方法就没有这个问题，每次都可以将neighborhood所有的node都考虑进来，而且不需要事先假定一个neighborhood的顺序

与MoNet的对比

As mentioned in Section 1, GAT can be reformulated as a particular instance of MoNet (Monti et al., 2016). More specifically, setting the pseudo-coordinate function to be u(x, y) = f(x)kf(y), where f(x) represent (potentially MLP-transformed) features of node x and k is concatenation; and the weight function to be wj(u) = softmax(MLP(u)) (with the softmax performed over the entire neighborhood of a node) would make MoNet’s patch operator similar to ours. Nevertheless, one should note that, in comparison to previously considered MoNet instances, our model uses node features for similarity computations, rather than the node’s structural properties (which would assume knowing the graph structure upfront). GAT可以看做一个MoNet的特例。

三、实验与评估

实验分成两部分，transductive learning（半监督学习）和inductive learning（归纳学习）。模型用了两层的GAT

3.1 数据集

图结构的数据集，以及数据集之中的信息如下：

3.2 半监督学习transductive learning

两层 GAT

在Cora 数据集上优化网络结构的超参数，应用到Citeseer 数据集

第一层 8 head, F`=8 , ELU 作为非线性函数

第二层为分类层，一个 attention head 特征数C，后跟 softmax 函数,为了应对小训练集，正则化（L2)

两层都采用 0.6 的dropout,相当于计算每个node位置的卷积时都是随机的选取了一部分近邻节点参与卷积

两个细节暂时不细看，贴出结果，均取得了state-of-the-art的结果

3.3 归纳学习inductive learning

三层GAT 模型

前两层 K=4, F1=256 ,ELU作为非线性函数

最后一层用来分类 K=6, F`=121 ，激活函数为sigmoid

该任务中，训练集足够大不需要使用正则化和 dropout

两个任务都是用Glorot初始化初始的，并且是用Adam SGD来最小化交叉熵进行优化

四、结论与个人总结

引用了注意力机制，并且模型性能达到state of the art.

运算相邻节点，更加具有鲁棒性，不需要整张图。

更具有可解释性，公式也更直观。

来自 https://blog.csdn.net/weixin_36474809/article/details/89401552