Non-Technical FAQ

#### What is the demo doing?

Neural networks achieve superhuman performance in many areas, but they are easily fooled.

In the demo above, we can force neural networks to predict anything we want. By adding nearly-invisible noise to an image, we turn "1"s into "9"s, "Stop" signs into "120 km/hr" signs, and dogs into hot dogs.

These noisy images are called adversarial examples. They break the integrity of machine learning systems, and the illusion of their superhuman performance.

#### Why does this matter?

Our world is becoming increasingly automated, yet these systems have strange failure modes.

If machine learning systems are not properly defended, attackers could:

• Impersonate others in facial recognition systems
• Force autonomous vehicles to misrecognize street signs & obstacles
• Bypass content moderation and spam filters in social networks
• Inject adversarial bytes into malware to bypass antivirus systems
• Digitally alter numbers on a check in a mobile banking app
• (and more)

#### Is this limited to image classification with neural networks?

No. Adversarial examples exist for almost every machine learning task: speech recognition, text classification, fraud detection, machine translation, reinforcement learning, ....

Moreover, all machine learning models (not just neural networks) are vulnerable. In fact, simpler models such as logistic regression are even more easily attacked.

Finally – beyond adversarial examples – there are many more adversarial attack vectors, including data poisoning, model backdooring, data extraction, and model stealing.

#### How do I defend against adversarial examples?

There are several proposed defenses, including adversarial training and admission control.

However, no defense is universal and many have proven ineffective, so work with an expert to quantify your risks and invest in defenses appropriately.

(What happens if someone can make your system predict anything they want?).

#### Do these attacks require access to the model?

The attacks in this library do (i.e. they are white-box). However, in real life, there are effective black-box attacks as well, which do not require access to gradients.

It's also well-known that the same adversarial example can affect multiple models, even with different architectures and training sets.

#### What are the differences between the attacks?

The attacks essentially differ in (1) the type of valid noise, (2) what function we use to measure "adversarialness", and (3) how hard we try to maximize that function.

Stronger attacks spend more compute to maximize the adverarialness of the image.

See this answer for the technical details.

#### Is this library a risk?

Open-source libraries help accelerate research in adversarial attacks. In fact, we are not the first; there are several great and heavily-used Python libraries for adversarial machine learning:

The purpose of adversarial.js is to showcase an easy-to-understand demo and build awareness of this failure mode of machine learning.

#### How did you make this?

The attacks are implemented in TensorFlow.js. The UI is built from scratch with Skeleton.

#### Who made this?

Me.

Here's a list of good resources, in rough order of approachability:

Last – feel free to email me questions.

Technical FAQ

#### What model architectures do you use?

MNIST uses a tiny feedforward network with 128 hidden units.

GTSRB uses a tiny CNN with 2 convolutional layers.

CIFAR-10 uses a standard CNN with 3 VGG blocks.

ImageNet uses the largest pre-trained MobileNet V2.

#### Can you describe the attacks?

The general idea of most attacks is to maximize the model's loss w.r.t. the input image. We take gradient ascent steps that modify the image towards greater mispredictions.

Note: This is quite similar to normal training, where we minimize the model's loss w.r.t. weights. Just backpropagate all the way to the input.

The $$L$$ norms correspond to different perturbation sets:

• $$L_\infty$$ – we can change all pixels up to a maximum amount
• $$L_0$$ – we can change a limited number of pixels by any amount
• $$L_2$$ – we are incentivized to make tiny changes to more pixels

Fast Gradient Sign Method ($$L_\infty$$ attack)

FGSM was one of the earliest methods to generate adversarial examples. It's simple, fast, and weak. It wasn't designed to produce high-quality adversarial examples, so it is not a good benchmark for robustness.

Essentially, it takes one step in the direction of the gradient (clipped so all elements have the same magnitude):

$x_{adv} = x + \epsilon \, \text{sign}(\nabla_x \ell(f(x), y_{true}))$

where $$f$$ is the neural network, and $$\ell$$ is cross-entropy loss.

Basic Iterative Method ($$L_{\infty}$$ attack)

BIM is just FGSM repeated for multiple steps, with an appropriate step size. At each step, we make sure the resulting image is still within $$\epsilon$$ distance (in $$L_\infty$$ world) of the original.

$x_{adv_{t+1}} = \text{clip}_{\epsilon, x} (x_{adv_t} + \alpha \, \text{sign}(\nabla_x \ell(f(x_{adv_t}), y_{true})))$

$\text{where } x_{adv_0} = x$

This attack is the second strongest, and succeeds most of the time.

Jacobian-based Saliency Map Attack ($$L_0$$ attack)

JSMA directly uses the derivative of the neural network $$\nabla_x f(x)$$.

(Recall that $$f(x)$$ is a vector of softmax probabilities, so $$\nabla_x f(x)$$ is a Jacobian matrix.)

From the derivative, we can determine which pixel has the most impact on a class prediction. So we strategically define a measure of "saliency" – how much a pair of pixel increases the target class probability + decreases all other class probabilities.

On each iteration, we maximally modify the two highest-saliency pixels. The actual equations are unwieldy and omitted (see the paper).

This attack works well on small images, but is too memory-inefficient for large images.

(I believe JSMA can fit into the "loss maximization" framework if we view it as coordinate ascent on a loss function whose derivative is saliency.)

Jacobian-based Saliency Map Attack, One Pixel ($$L_0$$ attack)

JSMA doesn't scale beyond CIFAR-sized images, so this is my faster and simpler variant.

We directly use $$\nabla_x f(x)$$ to identify the single pixel with the highest impact on the target class, and change only that one pixel per iteration.

That's it. And surprisingly, this performs as well as normal JSMA.

(Although this avoids the quadratic space complexity of JSMA, the number of iterations we would need to run this to be effective on ImageNet is still too slow.)

Carlini & Wagner ($$L_2$$ attack)

The C&W attack is a bit complex. They start from the original constrained optimization problem:

\begin{aligned} &\min_\delta ||\delta||_2^2 \\ \text{s.t. } &f(x + \delta) = T \\ &x + \delta \in [0, 1]^n \end{aligned}

where $$T$$ is the target class. This turns into the equivalent unconstrained optimization problem:

$\min_\delta ||\delta||_2^2 + c \cdot \max\{\max\{Z(x+\delta)_i \mid i \neq T\} - Z(x+\delta)_T, -\kappa\}$

where $$Z$$ is the logits before the softmax, and $$\kappa, c$$ are hyperparameters.

However, to respect the $$x + \delta \in [0, 1]^n$$ box constraint, we apply a change of variables and optimize over $$w$$ instead.

$\min_w ||\delta(w)||_2^2 + c \cdot \max\{\max\{Z(x+\delta(w))_i \mid i \neq T\} - Z(x+\delta(w))_T, -\kappa\}$

$\text{where } \delta(w) = \frac{1}{2}(\tanh(w) + 1) - x$

We minimize this with Adam, and finally get $$x_{adv} = x + \delta(w_{min})$$.

This is the strongest attack, and always succeeds with reasonable speed even on large images.

(There are also $$L_0$$ and $$L_\infty$$ variants of C&W, but I believe $$L_2$$ is the most popular and is the one implemented in adversarial.js.)

#### Why do adversarial examples exist?

I'm not sure what the generally-accepted answer is (if you know – tell me!).

One explanation is in the FGSM paper: "for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output".

Another explanation is that machine learning methodology (e.g. ERM) assumes I.I.D. test & training time samples. So obviously, ERM doesn't produce models that perform well on an adversarial sample drawn from some worst-case non-I.I.D. distribution.

(I've also heard vague comments about adversarial examples hiding in measure-zero regions.)

#### Why do you only consider $$L_p$$ perturbations?

This is standard practice – mostly because $$L_p$$ norm balls are mathematically convenient for machine learning researchers.

It's an open research question to design metrics that better match human perceptual distance, and consider robustness within that set of imperceptible perturbations.

You can argue that $$L_p$$ perturbation robustness is necessary, but not sufficient, for true adversarial robustness.

#### What are some open areas of research?

There are tons of interesting topics, including:

• effective defenses (both as theoretical and engineering problems)
• more realistic perturbation sets than $$L_p$$ norm balls