Class 6: Measuring Robustness of ML Models
Motivation
In what seems to be an endless back-and-forth between new adversarial attacks and new defenses against those attacks, we would like a means of formally verifying the robustness of machine learning algorithms to adversarial attacks. In the privacy domain, there is the idea of a differential privacy budget, which quantifies privacy over all possible attacks. In the following three papers, we see attempts at deriving an equivalent benchmark for security, one that will allow the evaluation of defenses against all possible attacks instead of just a specific one.
Provably Minimally Distorted Adversarial Examples
Nicholas Carlini, Guy Katz, Clark Barrett, David L. Dill. Provably Minimally-Distorted Adversarial Examples. arXiv preprint arXiv:1709.10207. September 2017.
Guy Katz, Clark Barrett, David Dill, Kyle Julian, Mykel Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. International Conference on Computer Aided Verification. 2017. [PDF]
There have been many attacking techniques against deep learning models that can effectively generate adversarial examples, such as FGSM, JSMA, DeepFool and the Carlini & Wagner attacks. But most of them couldn’t verify the absence of adversarial examples even if they fail to find any result.
Researchers have started to borrow the idea and techniques from the program verification field to verify the robustness of neural networks. In order to verify the correctness of a program, we can encode a program into SAT-modulo-theories (SMT) formulas and use some off-the-shelf solvers (e.g. Microsoft Z3) to verify a correctness property. An SMT solver generates sound and complete results, either telling you the program never violates the property, or giving you some specific counter-examples. However, we may not be able to get the results for some large programs in our lifetime, because it is an NP-complete problem.
Similarly, the current neural network verification techniques haven’t been able to deal with deep learning models of arbitrary size. But one of the prototype named Reluplex has produced some promising results on the MNIST dataset.
The Reluplex is an extension of the Simplex algorithm. It introduces a new domain theory solver to take care of the ReLU activation function because the Simplex only deals with linear real arithmetic. You can find more details about Reluplex in:
Guy Katz, Clark Barrett, David Dill, Kyle Julian, Mykel Kochenderfer. Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. Arxiv. 2017.
The paper we discuss here uses Reluplex to verify the local adversarial robustness of two MNIST classification models. We say a model is delta-locally-robust at input \(x\) if for every \(x’\) such that \( ||x-x'||p \le \delta \), the model predicts the same label for \(x\) and \(x’\). Local robustness is certified for individual inputs, which is substantially different from the global robustness that certifies the whole input space. The paper performs binary search on delta to find the minimal distortion at a certain precision, and each delta corresponds to one execution instance of Reluplex. The paper only considers the \(L_{infinity}\) norm and the \(L_1\) norm because it is easier to encode these constraints with Reluplex. For example, the \(L_1\) norm could be encoded as \( |x| = max(x, -x)=ReLu(2x)-x \).
The evaluation is conducted on a fully-connected, 3-layer network that has 20k weights and fewer than 100 hidden neurons. The testing accuracy of the model is 97%. The model is smaller than most of the state-of-the-art models and has inferior accuracy, but should be good enough for the model verification prototype. The authors arbitrarily selected 10 source images with known labels from the MNIST test set, which produced 90 targeted attack instances in total.
Even though the Reluplex method is faster than most of the existing general SMT solvers, it is not fast enough for verifying the MNIST classification models. For every configuration with different target models and different \(L_p\) norm constraints, Reluplex always timed out for some of the 90 instances. The experiments with the \(L_1\) constraint were generally slower than those with the L_infinity constraint, because it introduced more ReLU components. However, we still found some interesting results from the successful verification instances.
The paper compared the minimally-distorted adversarial examples found by Reluplex with those generated by the CW attack and concluded that iterative optimization-based attacks are effective because the CW attack produced adversarial examples within 12% of optimal on the specific models.
The paper also evaluated a particular defense technique proposed by Madry et al. which is an adversarial training method that uses the PGD attack and enlarges the model capacity. The paper concluded that the PGD-based adversarial training increased the robustness to adversarial examples by 4.2x on the examined samples. Even though this result doesn’t guarantee the efficacy at larger scale, it proves that the defense increases the robustness against all future attacks while it is only trained with the PGD attack.
Evaluating the Robustness of Neural Networks
Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, Luca Daniel. Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. ICLR 2018. January 2018 [PDF]
Little work has been done towards developing a comprehensive measure of robustness for neural networks, primarily due to them growing mathematically complex as the number of layers increases. The authors contribute a lower bound on the minimal perturbation needed to generate an adversarial sample. They accomplish this by using the extreme value theorem to estimate the local Lipschitz constant for a given sample.
The work is motivated by the success of a particular attack by Carlini and Wagner against several previous defense strategies such as defensive distillation, adversarial training, and model ensemble [2]. This highlights the need for a means for evaluating a defenses effectiveness against all attacks rather than just the ones tested against.
Previous attempts at deriving such a lower bound have shortcomings of their own. Szegedy et al. compute the product of the global Lipschitz constant of each layer of the network to derive a metric of instability of a deep neural network’s output; however the global Lipschitz constant is a loose bound [3].
Hein and Andriushchenko derived a closed-form bound using the local Lipschitz constant, but such a bound is only feasible for networks with one hidden layer. Several other approaches used linear programming to verify properties of neural networks, but they are also infeasible for large networks [4].
The most similar work uses a linear programming formulation to derive an upper bound on the minimum distortion needed, which is not as useful as a universal robustness metric, and also is infeasible for large networks due to the computational complexity of linear programming.
Robustness metric
In order to derive a formal robustness guarantee, the authors formulate the lower bound for the minimum adversarial distortion needed to be a successful perturbed example (success here meaning fooling the classifier, thus becoming adversarial). They accomplish this by relating p-norm of the perturbation \(\lVert \delta \rVert_p \) to the local Lipschitz constant \( L_q^j \), local meaning defined over a l-ball around the input They use properties of Lipschitz continuous functions to prove the following:
$$ \lVert \delta \rVert \le \min_{j \ne c} \frac{f_c(x_0) - f_j(x_0)}{L_q^j} $$
That is to say that if the the p-norm of the perturbation is less than the difference between any two classifications of the input, \((x_0\)), divided by the local Lipschitz constant.
The authors go on to provide a similar guarantee for networks where the activation function is not differentiable, for example an ReLU network. In such a case, the local Lipschitz constant can be replaced with the supremum of the directional derivatives for each direction heading towards the non-differentiable point.
CLEVER Robustness Metric
Since the above formulations of the lower bound are difficult to compute, the authors present a technique for estimating the lower bound, which is much more feasible computationally. They make use of the extreme value theorem, which claims that for any random variable, the maximum value of infinite samples of it follows a known distribution. In this case, the random variable is the p-norm of the gradient of a given sample. The authors assume a Weibull non-degenerate distribution for this paper and verify it as a reasonable assumption empirically.
To apply the theorem, they generate N samples in a ball around a given sample and calculate the gradient norm of each sample. Using the maximum value of the gradient over those N samples, they apply maximum likelihood to get distribution parameters that maximize the probability of those gradients. It should be noted that this approach assumes that the adversarial examples are well-distributed enough that enough random noise will generate them.
The resulting CLEVER score is an approximate lower bound on the distortion needed for an attack to succeed.
Experiments and Results
To evaluate how effective an indicator of robustness the CLEVER score is, the authors conducted experiments using the CIFAR-10, MNIST, and ImageNet data sets, pairing each data set with a set of neural network architectures and corresponding popular defenses for those networks.
They estimate the Weibull distribution parameter and conduct a goodness-of-fit test to verify that the distribution fits the data empirically.
They then apply two recent state-of-the-art attacks to their models, the Carlini and Wagner attack [2], covered in a previous blog post, and I-FGSM [3]
To evaluate the CLEVER score’s effectiveness, the authors compare it with the average \( L_2 \) and L_infinity distortions for each adversarial example generated by each type of attack. Since the score is an estimate of the lower bound of the minimal distortion needed, if it is an accurate estimate, then no attack should succeed with a distortion less than the lower bound. Their results show that this holds under both distortion metrics.
These results also show that the Carlini Wagner attack produces adversarial examples with distortion much closer to minimal than I-FGSM, which demonstrates the CLEVER score’s ability to evaluate the strength of attacks themselves. The score also serves as a metric for the effectiveness of of defenses against adversarial attacks since the score was shown to increase for defensively distilled networks. The authors generally contribute a means of providing theoretical guarantees about neural networks that is no limited by the size of the network.
However, there does not seem to be a correlation between the CLEVER score and the distortion needed by the Carlini Wagner attack. If the score were to truly indicate how hard it is to generate adversarial examples, then we would expect that networks with a higher CLEVER score to have higher average distortions than networks with lower scores, but this was not always the case.
Lower bounds on the Robustness to Adversarial Perturbations
Jonathan Peck, Joris Roels, Bart Goossens, Yvan Saeys Lower bounds on the robustness to adversarial perturbations. NIPS 2017.
In this paper, the authors propose theoretical lower bounds on the adversarial perturbations on different types of layers in neural networks. They then combine these theoretical lower bounds derived layer-by-layer, and apply it to the entire network to calculate the theoretical lower bound for adversarial perturbations for the network. In contrast to previous work on this topic, the authors derive the lower bounds directly in terms of the model parameters. This is useful for applying the bounds on real-world DNN models.
Lower bounds on different classifiers
The authors use a modular approach to find the lower bound of a feedforward neural network. In this approach, they derive the lower bound at a particular layer by working their way backward starting from the output layer, and gradually towards the input layer. More precisely, given a layer that takes as input \(y\) and computes a function \(h\) on the input, if we know the robustness bound of the following layer \(k\), then the goal at the current layer is to find the perturbation \(r\) such that,
$$||h(y+r)|| = ||h(y)|| + k$$
Any adversarial perturbation to that layer, \(q\) is guaranteed to satisfy \(||q|| \geq ||r||\).
The lower bounds for different types of layers, as proposed in the paper, is shown below. The proofs can be found in the supplemental materials provided by the authors of the paper.
Softmax output layers
Let \(r\) be the smallest perturbations to the input \(x\) of a softmax layer such that \(f(x+r) \neq f(x)\). The authors have shown that this condition is then satisfied:
Fully connected layers
Here, the assumption is that the next layer has a robustness of \(k\). In that case, the authors have shown that the minimum perturbations \(r\) satisfies this condition:
Here, \(J(x)\) is the Jacobian matrix of \(h_L\) at \(x\), and \(M\) is the bound of the second order derivative of \(h_L\).
Convolutional layers
For a convolutional layer with filter tensor \(W \in R^{k \times c \times q \times q}\), and input tensor \(X\), the adversarial perturbation \(R\) satisfies this condition:
Pooling layers
For a MAX or average pooling layer, the adversarial perturbation \(R\) satisfies:
And for \(L_p\) pooling:
Results
The authors conducted their experiments using the MNIST and CIFAR-10 datasets on the LeNet network. For generating the adversarial perturbations for testing the theoretical bounds, they used the fast gradient sign method (FGSM). They used a binary search to find the smallest perturbation parameter of FGSM that resulted in misclassification of a sample.
In their experiments, the authors did not find any adversarial sample that violated the theoretical lower bounds. The experimental and theoretical perturbation norms can be seen in the following two tables:
It can be seen that the mean theoretical perturbation is much lower than the experimental ones for both MNIST and CIFAR-10 datasets. The authors suggest this is due to FGSM not being the most efficient attacking technique.
In conclusion, the authors suggest it is still unclear how tight these theoretical bounds are. It will be interesting to verify how close they are to the minimal distortion achieved by more efficient attacking techniques. Moreover, they have only tried their method on feedforward networks. Applying it to recurrent networks is a possible future direction. Finally, they propose the adoption of a precise characterization of networks in terms of tradeoff between robustness and accuracy.
— Team Bus:
Anant Kharkar,
Atallah Hezbor,
Bhuvanesh Murali,
Mainuddin Jonas,
Weilin Xu
References
[1] T.-W Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh, L. Daniel “Evaluating the robustness of neural networks: An extreme value theory approach” ICLR 2018, January 2018.
[2] N. Carlini, D. Wagner “Towards evaluating the robustness of neural networks” arXiv preprint arXiv:1608.04644, March 2017.
[3] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus. “Intriguing properties of neural networks” arXiv preprint arXiv:1312.6199, 2013.
[4] M. Hein, M. Andriushchenko. “Formal guarantees on the robustness of a classifier against adversarial manipulation” arXiv preprint arXiv:1705.08475, 2017.
[5] I.J. Goodfellow, J. Shlens, C. Szegedy. “Explaining and harnessing adversarial examples” arXiv preprint arXiv:1412.6572, March 2015.
[6] Nicholas Carlini, Guy Katz, Clark Barrett, David L. Dill “Provably Minimally-Distorted Adversarial Examples” arXiv preprint arXiv:1709.10207, February 2018.
[7] Guy Katz, Clark Barrett, David Dill, Kyle Julian, Mykel Kochenderfer “Reluplex: An efficient SMT solver for verifying deep neural networks.” International Conference on Computer Aided Verification. Springer, Cham, 2017.