Non-Technical FAQ

Neural networks achieve superhuman performance in many areas, but they are easily fooled.

In the demo above, we can force neural networks to predict anything we want. By adding nearly-invisible noise to an image, we turn "1"s into "9"s, "Stop" signs into "120 km/hr" signs, and dogs into hot dogs.

These noisy images are called adversarial examples. They break the integrity of machine learning systems, and the illusion of their superhuman performance.

Our world is becoming increasingly automated, yet these systems have strange failure modes.

If machine learning systems are not properly defended, attackers could:

- Impersonate others in facial recognition systems
- Force autonomous vehicles to misrecognize street signs & obstacles
- Bypass content moderation and spam filters in social networks
- Inject adversarial bytes into malware to bypass antivirus systems
- Digitally alter numbers on a check in a mobile banking app
- (and more)

No. Adversarial examples exist for almost every machine learning task: speech recognition, text classification, fraud detection, machine translation, reinforcement learning, ....

Moreover, all machine learning models (not just neural networks) are vulnerable. In fact, simpler models such as logistic regression are even more easily attacked.

Finally – beyond adversarial examples – there are many more adversarial attack vectors, including data poisoning, model backdooring, data extraction, and model stealing.

There are several proposed defenses, including adversarial training and admission control.

However, no defense is universal and many have proven ineffective, so work with an expert to quantify your risks and invest in defenses appropriately.

(What happens if someone can make your system predict anything they want?).

The attacks in this library do (i.e. they are white-box). However, in real life, there are effective black-box attacks as well, which do not require access to gradients.

It's also well-known that the same adversarial example can affect multiple models, even with different architectures and training sets.

The attacks essentially differ in (1) the type of valid noise, (2) what function we use to measure "adversarialness", and (3) how hard we try to maximize that function.

Stronger attacks spend more compute to maximize the adverarialness of the image.

See this answer for the technical details.

Open-source libraries help accelerate research in adversarial attacks. In fact, we are not the first; there are several great and heavily-used Python libraries for adversarial machine learning:

The purpose of adversarial.js is to showcase an easy-to-understand demo and build awareness of this failure mode of machine learning.

The attacks are implemented in TensorFlow.js. The UI is built from scratch with Skeleton.

Here's a list of good resources, in rough order of approachability:

- The full FAQ
- The directory of attacks (try running locally and playing with various settings)
- [Blog] CleverHans – start here
- [Blog] Gradient Science – start here
- [Tutorial] Adversarial Robustness - Theory and Practice
- [Paper] SoK: Towards the Science of Security and Privacy in Machine Learning
- [Paper] Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning

Last – feel free to email me questions.

Technical FAQ

MNIST uses a tiny feedforward network with 128 hidden units.

GTSRB uses a tiny CNN with 2 convolutional layers.

CIFAR-10 uses a standard CNN with 3 VGG blocks.

ImageNet uses the largest pre-trained MobileNet V2.

The general idea of most attacks is to **maximize** the model's loss w.r.t. the input image. We take gradient ascent steps that modify the image towards greater mispredictions.

Note: This is quite similar to normal training, where we **minimize** the model's loss w.r.t. weights. Just backpropagate all the way to the input.

The \(L\) norms correspond to different perturbation sets:

- \(L_\infty\) – we can change all pixels up to a maximum amount
- \(L_0\) – we can change a limited number of pixels by any amount
- \(L_2\) – we are incentivized to make tiny changes to more pixels

**Fast Gradient Sign Method (\(L_\infty\) attack)**

FGSM was one of the earliest methods to generate adversarial examples. It's simple, fast, and weak. It wasn't designed to produce high-quality adversarial examples, so it is not a good benchmark for robustness.

Essentially, it takes one step in the direction of the gradient (clipped so all elements have the same magnitude):

\[ x_{adv} = x + \epsilon \, \text{sign}(\nabla_x \ell(f(x), y_{true})) \]

where \(f\) is the neural network, and \(\ell\) is cross-entropy loss.

**Basic Iterative Method (\(L_{\infty}\) attack)**

BIM is just FGSM repeated for multiple steps, with an appropriate step size. At each step, we make sure the resulting image is still within \(\epsilon\) distance (in \(L_\infty\) world) of the original.

\[ x_{adv_{t+1}} = \text{clip}_{\epsilon, x} (x_{adv_t} + \alpha \, \text{sign}(\nabla_x \ell(f(x_{adv_t}), y_{true}))) \]

\[ \text{where } x_{adv_0} = x\]

This attack is the second strongest, and succeeds most of the time.

**Jacobian-based Saliency Map Attack (\(L_0\) attack)**

JSMA directly uses the derivative of the neural network \(\nabla_x f(x)\).

(Recall that \(f(x)\) is a vector of softmax probabilities, so \(\nabla_x f(x)\) is a Jacobian matrix.)

From the derivative, we can determine which pixel has the most impact on a class prediction. So we strategically define a measure of "saliency" – how much a pair of pixel increases the target class probability + decreases all other class probabilities.

On each iteration, we maximally modify the two highest-saliency pixels. The actual equations are unwieldy and omitted (see the paper).

This attack works well on small images, but is too memory-inefficient for large images.

(I believe JSMA can fit into the "loss maximization" framework if we view it as coordinate ascent on a loss function whose derivative is saliency.)

**Jacobian-based Saliency Map Attack, One Pixel (\(L_0\) attack)**

JSMA doesn't scale beyond CIFAR-sized images, so this is my faster and simpler variant.

We directly use \(\nabla_x f(x)\) to identify the single pixel with the highest impact on the target class, and change only that one pixel per iteration.

That's it. And surprisingly, this performs as well as normal JSMA.

(Although this avoids the quadratic space complexity of JSMA, the number of iterations we would need to run this to be effective on ImageNet is still too slow.)

**Carlini & Wagner (\(L_2\) attack)**

The C&W attack is a bit complex. They start from the original constrained optimization problem:

\[\begin{aligned} &\min_\delta ||\delta||_2^2 \\ \text{s.t. } &f(x + \delta) = T \\ &x + \delta \in [0, 1]^n \end{aligned}\]

where \(T\) is the target class. This turns into the equivalent unconstrained optimization problem:

\[ \min_\delta ||\delta||_2^2 + c \cdot \max\{\max\{Z(x+\delta)_i \mid i \neq T\} - Z(x+\delta)_T, -\kappa\}\]

where \(Z\) is the logits before the softmax, and \(\kappa, c\) are hyperparameters.

However, to respect the \(x + \delta \in [0, 1]^n\) box constraint, we apply a change of variables and optimize over \(w\) instead.

\[ \min_w ||\delta(w)||_2^2 + c \cdot \max\{\max\{Z(x+\delta(w))_i \mid i \neq T\} - Z(x+\delta(w))_T, -\kappa\}\]

\[\text{where } \delta(w) = \frac{1}{2}(\tanh(w) + 1) - x\]

We minimize this with Adam, and finally get \(x_{adv} = x + \delta(w_{min})\).

This is the strongest attack, and always succeeds with reasonable speed even on large images.

(There are also \(L_0\) and \(L_\infty\) variants of C&W, but I believe \(L_2\) is the most popular and is the one implemented in adversarial.js.)

I'm not sure what the generally-accepted answer is (if you know – tell me!).

One explanation is in the FGSM paper: "for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output".

Another explanation is that machine learning methodology (e.g. ERM) assumes I.I.D. test & training time samples. So obviously, ERM doesn't produce models that perform well on an adversarial sample drawn from some worst-case non-I.I.D. distribution.

(I've also heard vague comments about adversarial examples hiding in measure-zero regions.)

This is standard practice – mostly because \(L_p\) norm balls are mathematically convenient for machine learning researchers.

It's an open research question to design metrics that better match human perceptual distance, and consider robustness within that set of imperceptible perturbations.

You can argue that \(L_p\) perturbation robustness is necessary, but not sufficient, for true adversarial robustness.

There are tons of interesting topics, including:

- effective defenses (both as theoretical and engineering problems)
- more realistic perturbation sets than \(L_p\) norm balls
- the tradeoff between adversarial robustness and non-adversarial accuracy
- the side effects of robustness (e.g. interpretability, disentanglement)
- transferability of adversarial examples