{{+bindTo:partials.standard_nacl_article}}
<b><font color="#cc0000">
NOTE:
Deprecation of the technologies described here has been announced
for platforms other than ChromeOS.<br/>
Please visit our
<a href="/native-client/migration">migration guide</a>
for details.
</font></b>
<hr/><section id="id1">
<h1 id="id1">ARM 32-bit Sandbox</h1>
<p>Native Client for ARM is a sandboxing technology for running
programs—even malicious ones—safely, on computers that use 32-bit
ARM processors. The ARM sandbox is an extension of earlier work on
Native Client for x86 processors. Security is provided with a low
performance overhead of about 10% over regular ARM code, and as you’ll
see in this document the sandbox model is beautifully simple, meaning
that the trusted codebase is much easier to validate.</p>
<p>As an implementation detail, the Native Client 32-bit ARM sandbox is
currently used by Portable Native Client to execute code on 32-bit ARM
machines in a safe manner. The portable bitcode contained in a <strong>pexe</strong>
is translated to a 32-bit ARM <strong>nexe</strong> before execution. This may change
at a point in time: Portable Native Client doesn’t necessarily need this
sandbox to execute code on ARM. Note that the Portable Native Client
compiler itself is also untrusted: it too runs in the ARM sandbox
described in this document.</p>
<p>On this page, we describe how Native Client works on 32-bit ARM. We
assume no prior knowledge about the internals of Native Client, on x86
or any other architecture, but we do assume some familiarity with
assembly languages in general.</p>
<div class="contents local" id="contents" style="display: none">
<ul class="small-gap">
<li><p class="first"><a class="reference internal" href="#an-introduction-to-the-arm-architecture" id="id3">An Introduction to the ARM Architecture</a></p>
<ul class="small-gap">
<li><a class="reference internal" href="#about-arm-and-armv7-a" id="id4">About ARM and ARMv7-A</a></li>
<li><a class="reference internal" href="#arm-programmer-s-model" id="id5">ARM Programmer’s Model</a></li>
</ul>
</li>
<li><p class="first"><a class="reference internal" href="#the-native-client-approach" id="id6">The Native Client Approach</a></p>
<ul class="small-gap">
<li><p class="first"><a class="reference internal" href="#nacl-arm-pure-software-fault-isolation" id="id7">NaCl/ARM: Pure Software Fault Isolation</a></p>
<ul class="small-gap">
<li><a class="reference internal" href="#load-and-store" id="id8"><em>Load</em> and <em>Store</em></a></li>
<li><a class="reference internal" href="#the-stack-pointer-thread-pointer-and-program-counter" id="id9">The Stack Pointer, Thread Pointer, and Program Counter</a></li>
<li><a class="reference internal" href="#indirect-branch" id="id10"><em>Indirect Branch</em></a></li>
<li><a class="reference internal" href="#literal-pools-and-data-bundles" id="id11">Literal Pools and Data Bundles</a></li>
</ul>
</li>
<li><p class="first"><a class="reference internal" href="#trampolines-and-memory-layout" id="id12">Trampolines and Memory Layout</a></p>
<ul class="small-gap">
<li><a class="reference internal" href="#memory-map" id="id13">Memory Map</a></li>
<li><a class="reference internal" href="#inside-a-trampoline" id="id14">Inside a Trampoline</a></li>
</ul>
</li>
<li><p class="first"><a class="reference internal" href="#loose-ends" id="id15">Loose Ends</a></p>
<ul class="small-gap">
<li><a class="reference internal" href="#forbidden-instructions" id="id16">Forbidden Instructions</a></li>
<li><a class="reference internal" href="#coprocessors" id="id17">Coprocessors</a></li>
<li><a class="reference internal" href="#validator-code" id="id18">Validator Code</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div><h2 id="an-introduction-to-the-arm-architecture">An Introduction to the ARM Architecture</h2>
<p>In this section, we summarize the relevant parts of the ARM processor
architecture.</p>
<h3 id="about-arm-and-armv7-a">About ARM and ARMv7-A</h3>
<p>ARM is one of the older commercial “RISC” processor designs, dating back
to the early 1980s. Today, it is used primarily in embedded systems:
everything from toys, to home automation, to automobiles. However, its
most visible use is in cellular phones, tablets and some
laptops.</p>
<p>Through the years, there have been many revisions of the ARM
architecture, written as ARMv<em>X</em> for some version <em>X</em>. Native Client
specifically targets the ARMv7-A architecture commonly used in high-end
phones and smartbooks. This revision, defined in the mid-2000s, adds a
number of useful instructions, and specifies some portions of the system
that used to be left to individual chip manufacturers. Critically,
ARMv7-A specifies the “eXecute Never” bit, or <em>XN</em>. This pagetable
attribute lets us mark memory as non-executable. Our security relies on
the presence of this feature.</p>
<p>ARMv8 adds a new 64-bit instruction set architecture called A64, while
also enhancing the 32-bit A32 ISA. For Native Client’s purposes the A32
ISA is equivalent to the ARMv7 ARM ISA, albeit with a few new
instructions. This document only discussed the 32-bit A32 instruction
set: A64 would require a different sandboxing model.</p>
<h3 id="arm-programmer-s-model">ARM Programmer’s Model</h3>
<p>While modern ARM chips support several instruction encodings, 32-bit
Native Client on ARM focuses on a single one: a fixed-width encoding
where every instruction is 32-bits wide called A32 (previously, and
confusingly, called simply ARM). Thumb, Thumb2 (now confusingly called
T32), Jazelle, ThumbEE and such aren’t supported by Native Client. This
dramatically simplifies some of our analyses, as we’ll see later. Nearly
every instruction can be conditionally executed based on the contents of
a dedicated condition code register.</p>
<p>ARM processors have 16 general-purpose registers used for integer and
memory operations, written <code>r0</code> through <code>r15</code>. Of these, two have
special roles baked in to the hardware:</p>
<ul class="small-gap">
<li><code>r14</code> is the Link Register. The ARM <em>call</em> instruction
(<em>branch-with-link</em>) doesn’t use the stack directly. Instead, it
stashes the return address in <code>r14</code>. In other circumstances, <code>r14</code>
can be (and is!) used as a general-purpose register. When <code>r14</code> is
playing its Link Register role, it’s referred to as <code>lr</code>.</li>
<li><code>r15</code> is the Program Counter. While it can be read and written like
any other register, setting it to a new value will cause execution to
jump to a new address. Using it in some circumstances is also
undefined by the ARM architecture. Because of this, <code>r15</code> is never
used for anything else, and is referred to as <code>pc</code>.</li>
</ul>
<p>Other registers are given roles by convention. The only important
registers to Native Client are <code>r9</code> and <code>r13</code>, which are used as the
Thread Pointer location and Stack Pointer. When playing this role,
they’re referred to as <code>tp</code> and <code>sp</code>.</p>
<p>Like other RISC-inspired designs, ARM programs use explicit <em>load</em> and
<em>store</em> instructions to access memory. All other instructions operate
only on registers, or on registers and small constants called
immediates. Because both instructions and data words are 32-bits, we
can’t simply embed a 32-bit number into an instruction. ARM programs use
three methods to work around this, all of which Native Client exploits:</p>
<ol class="arabic simple">
<li>Many instructions can encode a modified immediate, which is an 8-bit
number rotated right by an even number of bits.</li>
<li>The <code>movw</code> and <code>movt</code> instructions can be used to set the top and
bottom 16-bits of a register, and can therefore encode any 32-bit
immediate.</li>
<li>For values that can’t be represented as modified immediates, ARM
programs use <code>pc</code>-relative loads to load data from inside the
code—hidden in a place where it won’t be executed such as “constant
pools”, just past the final return of a function.</li>
</ol>
<p>We’ll introduce more details of the ARM instruction set later, as we
walk through the system.</p>
<h2 id="the-native-client-approach">The Native Client Approach</h2>
<p>Native Client runs an untrusted program, potentially from an unknown or
malicious source, inside a sandbox created by a trusted runtime. The
trusted runtime allows the untrusted program to “call-out” and perform
certain actions, such as drawing graphics, but prevents it from
accessing the operating system directly. This “call-out” facility,
called a trampoline, looks like a standard function call to the
untrusted program, but it allows control to escape from the sandbox in a
controlled way.</p>
<p>The untrusted program and trusted runtime inhabit the same process, or
virtual address space, maintained by the operating system. To keep the
trusted runtime behaving the way we expect, we must prevent the
untrusted program from accessing and modifying its internals. Since they
share a virtual address space, we can’t rely on the operating system for
this. Instead, we isolate the untrusted program from the trusted
runtime.</p>
<p>Unlike modern operating systems, we use a cooperative isolation
method. Native Client can’t run any off-the-shelf program compiled for
an off-the-shelf operating system. The program must be compiled to
comply with Native Client’s rules. The details vary on each platform,
but in general, the untrusted program:</p>
<ul class="small-gap">
<li>Must not attempt to use certain forbidden instructions, such as system
calls.</li>
<li>Must not attempt to modify its own code without abiding by Native
Client’s code modification rules.</li>
<li>Must not jump into the middle of an instruction group, or otherwise do
tricky things to cause instructions to be interpreted multiple ways.</li>
<li>Must use special, strictly-defined instruction sequences to perform
permitted but potentially dangerous actions. We call these sequences
pseudo-instructions.</li>
</ul>
<p>We can’t simply take the program’s word that it complies with these
rules—we call it “untrusted” for a reason! Nor do we require it to be
produced by a special compiler; in practice, we don’t trust our
compilers either. Instead, we apply a load-time validator that
disassembles the program. The validator either proves that the program
complies with our rules, or rejects it as unsafe. By keeping the rules
simple, we keep the validator simple, small, and fast. We like to put
our trust in small, simple things, and the validator is key to the
system’s security.</p>
<aside class="note">
For the computationally-inclined, all our validators scale linearly in
the size of the program.
</aside>
<h3 id="nacl-arm-pure-software-fault-isolation">NaCl/ARM: Pure Software Fault Isolation</h3>
<p>In the original Native Client system for the x86, we used unusual
hardware features of that processor (the segment registers) to isolate
untrusted programs. This was simple and fast, but won’t work on ARM,
which has nothing equivalent. Instead, we use pure software fault
isolation.</p>
<p>We use a fixed address space layout: the untrusted program gets the
lowest gigabyte, addresses <code>0</code> through <code>0x3FFFFFFF</code>. The rest of the
address space holds the trusted runtime and the operating system. We
isolate the program by requiring every <em>load</em>, <em>store</em>, and <em>indirect
branch</em> (to an address in a register) to use a pseudo-instruction. The
pseudo-instructions ensure that the address stays within the
sandbox. The <em>indirect branch</em> pseudo-instruction, in turn, ensures that
such branches won’t split up other pseudo-instructions.</p>
<p>At either side of the sandbox, we place small (8KiB) guard
regions. These are simply areas in the process’s address space that are
mapped without read, write, or execute permissions, so any attempt to
access them for any reason—<em>load</em>, <em>store</em>, or <em>jump</em>—will cause a
fault.</p>
<p>Finally, we ban the use of certain instructions, notably direct system
calls. This is to ensure that the untrusted program can be run on any
operating system supported by Native Client, and to prevent access to
certain system features that might be used to subvert the sandbox. As a
side effect, it helps to prevent programs from exploiting buggy
operating system APIs.</p>
<p>Let’s walk through the details, starting with the simplest part: <em>load</em>
and <em>store</em>.</p>
<h4 id="load-and-store"><em>Load</em> and <em>Store</em></h4>
<p>All access to memory must be through <em>load</em> and <em>store</em>
pseudo-instructions. These are simply a native <em>load</em> or <em>store</em>
instruction, preceded by a guard instruction.</p>
<p>Each <em>load</em> or <em>store</em> pseudo-instruction is similar to the <em>load</em> shown
below. We use abstract “placeholder” registers instead of specific
numbered registers for the sake of discussion. <code>rA</code> is the register
holding the address to load from. <code>rD</code> is the destination for the
loaded data.</p>
<pre>
bic rA, #0xC0000000
ldr rD, [rA]
</pre>
<p>The first instruction, <code>bic</code>, clears the top two bits of <code>rA</code>. In
this case, that means that the value in <code>rA</code> is forced to an address
inside our sandbox, between <code>0</code> and <code>0x3FFFFFFF</code>, inclusive.</p>
<p>The second instruction, <code>ldr</code>, uses the previously-sandboxed address
to load a value. This address might not be the address that the program
intended, and might cause an access to an unmapped memory location
within the sandbox: <code>bic</code> forces the address to be valid, by clearing
the top two bits. This is a no-op in a correct program.</p>
<p>This illustrates a common property of all Native Client systems: we aim
for safety, not correctness. A program using an invalid address in
<code>rA</code> here is simply broken, so we are free to do whatever we want to
preserve safety. In this case the program might load an invalid (but
safe) value, or cause a segmentation fault limited to the untrusted
code.</p>
<p>Now, if we allowed arbitrary branches within the program, a malicious
program could set up carefully-crafted values in <code>rA</code>, and then jump
straight to the <code>ldr</code>. This is why we validate that programs never
split pseudo-instructions.</p>
<h5 id="alternative-sandboxing">Alternative Sandboxing</h5>
<pre>
tst rA, #0xC0000000
ldreq rD, [rA]
</pre>
<p>The first instruction, <code>tst</code>, performs a bitwise-<code>AND</code> of <code>rA</code>
and the modified immediate literal, <code>0xC0000000</code>. It sets the
condition flags based on the result, but does not write the result to a
register. In particular, it sets the <code>Z</code> condition flag if the result
was zero—if the two values had no set bits in common. In this case,
that means that the value in <code>rA</code> was an address inside our sandbox,
between <code>0</code> and <code>0x3FFFFFFF</code>, inclusive.</p>
<p>The second instruction, <code>ldreq</code>, is a conditional load if equal. As we
mentioned before, nearly all ARM instructions can be made
conditional. In assembly language, we simply stick the desired condition
on the end of the instruction’s mnemonic name. Here, the condition is
<code>EQ</code>, which causes the instruction to execute only if the <code>Z</code> flag
is set.</p>
<p>Thus, when the pseudo-instruction executes, the <code>tst</code> sets <code>Z</code> if
(and only if) the value in <code>rA</code> is an address within the bounds of the
sandbox, and then the <code>ldreq</code> loads if (and only if) it was. If <code>rA</code>
held an invalid address, the <em>load</em> does not execute, and <code>rD</code> is
unchanged.</p>
<aside class="note">
The <code>tst</code>-based sequence is faster than the <code>bic</code>-based sequence
on modern ARM chips. It avoids a data dependency in the address
register. This is why we keep both around. The <code>tst</code>-based sequence
unfortunately leaks information on some processors, and is therefore
forbidden on certain processors. This effectively means that it cannot
be used for regular Native Client <strong>nexe</strong> files, but can be used with
Portable Native Client because the target processor is known at
translation time from <strong>pexe</strong> to <strong>nexe</strong>.
</aside>
<h5 id="addressing-modes">Addressing Modes</h5>
<p>ARM has an unusually rich set of addressing modes. We allow all but one:
register-indexed, where two registers are added to determine the
address.</p>
<p>We permit simple <em>load</em> and <em>store</em>, as shown above. We also permit
displacement, pre-index, and post-index memory operations:</p>
<pre>
bic rA, #0xC0000000
ldr rD, [rA, #1234] ; This is fine.
bic rA, #0xC0000000
ldr rD, [rA, #1234]! ; Also fine.
bic rA, #0xC0000000
ldr rD, [rA], #1234 ; Looking good.
</pre>
<p>In each case, we know <code>rA</code> points into the sandbox when the <code>ldr</code>
executes. We allow adding an immediate displacement to <code>rA</code> to
determine the final address (as in the first two examples here) because
the largest immediate displacement is ±4095 bytes, while our guard pages
are 8192 bytes wide.</p>
<p>We also allow ARM’s more unusual <em>load</em> and <em>store</em> instructions, such
as <em>load-multiple</em> and <em>store-multiple</em>, etc.</p>
<h5 id="conditional-load-and-store">Conditional <em>Load</em> and <em>Store</em></h5>
<p>There’s one problem with the pseudo-instructions shown above: they are
unconditional (assuming <code>rA</code> is valid). ARM compilers regularly use
conditional <em>load</em> and <em>store</em>, so we should support this in Native
Client. We do so by defining alternate, predictable
pseudo-instructions. Here is a conditional <em>store</em>
(<em>store-if-greater-than</em>) using this pseudo-instruction sequence:</p>
<pre>
bicgt rA, #0xC0000000
strgt rX, [rA, #123]
</pre>
<h4 id="the-stack-pointer-thread-pointer-and-program-counter">The Stack Pointer, Thread Pointer, and Program Counter</h4>
<h5 id="stack-pointer">Stack Pointer</h5>
<p>In C-like languages, the stack is used to store return addresses during
function calls, as well as any local variables that won’t fit in
registers. This makes stack operations very common.</p>
<p>Native Client does not require guard instructions on any <em>load</em> or
<em>store</em> involving the stack pointer, <code>sp</code>. This improves performance
and reduces code size. However, ARM’s stack pointer isn’t special: it’s
just another register, called <code>sp</code> only by convention. To make it safe
to use this register as a <em>load</em> or <em>store</em> address without guards, we
add a rule: <code>sp</code> must always contain a valid address.</p>
<p>We enforce this rule by restricting the sorts of operations that
programs can use to alter <code>sp</code>. Programs can alter <code>sp</code> by adding or
subtracting an immediate, as a side-effect of a <em>load</em> or <em>store</em>:</p>
<pre>
ldr rX, [sp], #4! ; Load from stack, then add 4 to sp.
str rX, [sp, #1234]! ; Add 1234 to sp, then store to stack.
</pre>
<p>These are safe because, as we mentioned before, the largest immediate
available in a <em>load</em> or <em>store</em> is ±4095. Even after adding or
subtracting 4095, the stack pointer will still be within the sandbox or
guard regions.</p>
<p>Any other operation that alters <code>sp</code> must be followed by a guard
instruction. The most common alterations, in practice, are addition and
subtraction of arbitrary integers:</p>
<pre>
add sp, rX
bic sp, #0xC0000000
</pre>
<p>The <code>bic</code> is similar to the one we used for conditional <em>load</em> and
<em>store</em>, and serves exactly the same purpose: after it completes, <code>sp</code>
is a valid address.</p>
<aside class="note">
Clever assembly programmers and compilers may want to use this
“trusted” property of <code>sp</code> to emit more efficient code: in a hot
loop instead of using <code>sp</code> as a stack pointer it can be temporarily
used as an index pointer (e.g. to traverse an array). This avoids the
extra <code>bic</code> whenever the pointer is updated in the loop.
</aside>
<h5 id="thread-pointer-loads">Thread Pointer Loads</h5>
<p>The thread pointer and IRT thread pointer are stored in the trusted
address space. All uses and definitions of <code>r9</code> from untrusted code
are forbidden except as follows:</p>
<pre>
ldr Rn, [r9] ; Load user thread pointer.
ldr Rn, [r9, #4] ; Load IRT thread pointer.
</pre>
<h5 id="pc-relative-loads"><code>pc</code>-relative Loads</h5>
<p>By extension, we also allow <em>load</em> through the <code>pc</code> without a
mask. The explanation is quite similar:</p>
<ul class="small-gap">
<li>Our control-flow isolation rules mean that the <code>pc</code> will always
point into the sandbox.</li>
<li>The maximum immediate displacement that can be used in a
<code>pc</code>-relative <em>load</em> is smaller than the width of the guard pages.</li>
</ul>
<p>We do not allow <code>pc</code>-relative stores, because they look suspiciously
like self-modifying code, or any addressing mode that would alter the
<code>pc</code> as a side effect of the <em>load</em>.</p>
<h4 id="indirect-branch"><em>Indirect Branch</em></h4>
<p>There are two types of control flow on ARM: direct and indirect. Direct
control flow instructions have an embedded target address or
offset. Indirect control flow instructions take their destination
address from a register. The <code>b</code> (branch) and <code>bl</code>
(<em>branch-with-link</em>) instructions are <em>direct branch</em> and <em>call</em>,
respectively. The <code>bx</code> (<em>branch-exchange</em>) and <code>blx</code>
(<em>branch-with-link-exchange</em>) are the indirect equivalents.</p>
<p>Because the program counter <code>pc</code> is simply another register, ARM also
has many implicit indirect control flow instructions. Programs can
operate on the <code>pc</code> using <em>add</em> or <em>load</em>, or even outlandish (and
often specified as having unpredictable-behavior) things like multiply!
In Native Client we ban all such instructions. Indirect control flow is
exclusively through <code>bx</code> and <code>blx</code>. Because all of ARM’s control
flow instructions are called <em>branch</em> instructions, we’ll use the term
<em>indirect branch</em> from here on, even though this includes things like
<em>virtual call</em>, <em>return</em>, and the like.</p>
<h5 id="the-trouble-with-indirection">The Trouble with Indirection</h5>
<p><em>Indirect branch</em> present two problems for Native Client:</p>
<ul class="small-gap">
<li>We must ensure that they don’t send execution outside the sandbox.</li>
<li>We must ensure that they don’t break up the instructions inside a
pseudo-instruction, by landing on the second one.</li>
</ul>
<aside class="note">
On the x86 architectures we must also ensure that it doesn’t land
inside an instruction. This is unnecessary on ARM, where all
instructions are 32-bit wide.
</aside>
<p>Checking both of these for <em>direct branch</em> is easy: the validator just
pulls the (fixed) target address out of the instruction and checks what
it points to.</p>
<h5 id="the-native-client-solution-bundles">The Native Client Solution: “Bundles”</h5>
<p>For <em>indirect branch</em>, we can address the first problem by simply
masking some high-order bits off the address, like we did for <em>load</em> and
<em>store</em>. The second problem is more subtle. Detecting every possible
route that every <em>indirect branch</em> might take is difficult. Instead, we
take the approach pioneered by the original Native Client: we restrict
the possible places that any <em>indirect branch</em> can land. On Native
Client for ARM, <em>indirect branch</em> can target any address that has its
bottom four bits clear—any address that’s <code>0 mod 16</code>. We call these
16-byte chunks of code “bundles”. The validator makes sure that no
pseudo-instruction straddles a bundle boundary. Compilers must pad with
<code>nop</code> to ensure that every pseudo-instruction fits entirely inside one
bundle.</p>
<p>Here is the <em>indirect branch</em> pseudo-instruction. As you can see, it
clears the top two and bottom four bits of the address:</p>
<pre>
bic rA, #0xC000000F
bx rA
</pre>
<p>This particular pseudo-instruction (a <code>bic</code> followed by a <code>bx</code>) is
used for computed jumps in switch tables and returning from functions,
among other uses. Recall that, under ARM’s modified immediate rules, we
can fit the constant <code>0xC000000F</code> into the <code>bic</code> instruction’s
immediate field: <code>0xC000000F</code> is the 8-bit constant <code>0xFC</code>, rotated
right by 4 bits.</p>
<p>The other useful variant is the <em>indirect branch-with-link</em>, which is
the ARM equivalent to <em>call</em>:</p>
<pre>
bic rA, #0xC000000F
blx rA
</pre>
<p>This is used for indirect function calls—commonly seen in C++ programs
as virtual calls, but also for calling function pointers in C.</p>
<p>Note that both <em>indirect branch</em> pseudo-instructions use <code>bic</code>, rather
than the <code>tst</code> instruction we allow for <em>load</em> and <em>store</em>. There are
two reasons for this:</p>
<ol class="arabic simple">
<li>Conditional <em>branch</em> is very common. Much more common than
conditional <em>load</em> and <em>store</em>. If we supported an alternative
<code>tst</code>-based sequence for <em>branch</em>, it would be rare.</li>
<li>There’s no performance benefit to using <code>tst</code> here on modern ARM
chips. <em>Branch</em> consumes its operands later in the pipeline than
<em>load</em> and <em>store</em> (since they don’t have to generate an address,
etc) so this sequence doesn’t stall.</li>
</ol>
<aside class="note">
<p>At this point astute readers are wondering what the <code>x</code> in <code>bx</code>
and <code>blx</code> means. We told you it stood for “exchange”, but exchange
to what? ARM, for all the reduced-ness of its instruction set, can
change execution mode from A32 (ARM) to T32 (Thumb) and back with
these <em>branch</em> instructions, called <em>interworking branch</em>. Recall that
A32 instructions are 32-bit wide, and T32 instructions are a mix of
both 16-bit or 32-bit wide. The destination address given to a
<em>branch</em> therefore cannot sensibly have its bottom bit set in either
instruction set: that would be an unaligned instruction in both cases,
and ARM simply doesn’t support this. The bottom bit for the <em>indirect
branch</em> was therefore cleverly recycled by the ARM architecture to
mean “switch to T32 mode” when set!</p>
<p>As you’ve figured out by now, Native Client’s sandbox won’t be very
happy if A32 instructions were to be executed as T32 instructions: who
know what they correspond to? A malicious person could craft valid
A32 code that’s actually very naughty T32 code, somewhat like forming
a sentence that happens to be valid in English and French but with
completely different meanings, complimenting the reader in one
language and insulting them in the other.</p>
<p>You’ve figured out by now that the bundle alignment restrictions of
the Native Client sandbox already take care of making this travesty
impossible: by masking off the bottom 4 bits of the destination the
interworking nature of ARM’s <em>indirect branch</em> is completely avoided.</p>
</aside>
<h5 id="call-and-return"><em>Call</em> and <em>Return</em></h5>
<p>On ARM, there is no <em>call</em> or <em>return</em> instruction. A <em>call</em> is simply a
<em>branch</em> that just happen to load a return address into <code>lr</code>, the link
register. If the called function is a leaf (that is, if it calls no
other functions before returning), it simply branches to the address
stored in <code>lr</code> to <em>return</em> to its caller:</p>
<pre>
bic lr, #0xC000000F
bx lr
</pre>
<p>If the function called other functions, however, it had to spill <code>lr</code>
onto the stack. On x86, this is done implicitly, but it is explicit on
ARM:</p>
<pre>
push { lr }
; Some code here...
pop { lr }
bic lr, #0xC000000F
bx lr
</pre>
<p>There are two things to note about this code.</p>
<ol class="arabic simple">
<li>As we mentioned before, we don’t allow arbitrary instructions to
write to the Program Counter, <code>pc</code>. Thus, while a traditional ARM
program might have popped directly into <code>pc</code> to end the function,
we require a pop into a register, followed by a pseudo-instruction.</li>
<li>Function returns really are just <em>indirect branch</em>, with the same
restrictions. This means that functions can only return to addresses
that are bundle-aligned: <code>0 mod 16</code>.</li>
</ol>
<p>The implication here is that a <em>call</em>—the <em>branch</em> that enters
functions—must be placed at the end of the bundle, so that the return
address they generate is <code>0 mod 16</code>. Otherwise, when we clear the
bottom four bits, the program would enter an infinite loop! (Native
Client doesn’t try to prevent infinite loops, but the validator actually
does check the alignment of calls. This is because, when we were writing
the compiler, it was annoying to find out our calls were in the wrong
place by having the program run forever!)</p>
<aside class="note">
Properly balancing the CPU’s <em>call</em>/<em>return</em> actually allows it to
perform much better by allowing it to speculatively execute the return
address’ code. For more information on ARM’s <em>call</em>/<em>return</em> stack see
ARM’s technical reference manual.
</aside>
<h4 id="literal-pools-and-data-bundles">Literal Pools and Data Bundles</h4>
<p>In the section where we described the ARM architecture, we mentioned
ARM’s unusual immediate forms. To restate:</p>
<ul class="small-gap">
<li>ARM instructions are fixed-length, 32-bits, so we can’t have an
instruction that includes an arbitrary 32-bit constant.</li>
<li>Many ARM instructions can include a modified immediate constant, which
is flexible, but limited.</li>
<li>For any other value (particularly addresses), ARM programs explicitly
load constants from inside the code itself.</li>
</ul>
<aside class="note">
ARMv7 introduces some instructions, <code>movw</code> and <code>movt</code>, that try to
address this by letting us directly load larger constants. Our
toolchain uses this capability in some cases.
</aside>
<p>Here’s a typical example of the use of a literal pool. ARM assemblers
typically hide the details—this is the sort of code you’d see produced
by a disassembler, but with more comments.</p>
<pre>
; C equivalent: "table[3] = 4"
; 'table' is a static array of bytes.
ldr r0, [pc, #124] ; Load the address of the 'table',
; "124" is the offset from here
; to the constant below.
add r0, #3 ; Add the immediate array index.
mov r1, #4 ; Get the constant '4' into a register.
bic r0, #0xC0000000 ; Mask our array address.
strb r1, [r0] ; Store one byte.
; ...
.word table ; Constant referenced above.
</pre>
<p>Because table is a static array, the compiler knew its address at
compile-time—but the address didn’t fit in a modified immediate. (Most
don’t). So, instead of loading an immediate into <code>r0</code> with a <code>mov</code>,
we stashed the address in the code, generated its address using <code>pc</code>,
and loaded the constant. ARM compilers will typically group all the
embedded data together into a literal pool. These typically live just
past the end of functions, where they won’t be executed.</p>
<p>This is an important trick in ARM code, so it’s important to support it
in Native Client... but there’s a potential flaw. If we let programs
contain arbitrary data, mingled in with the code, couldn’t they hide
malicious instructions this way?</p>
<p>The answer is no, because the validator disassembles the entire
executable region of the program, without regard to whether the
programmer said a certain chunk was code or data. But this brings the
opposite problem: what if the program needs to contain a certain
constant that just happens to encode a malicious instruction? We want
to allow this, but we have to be certain it will never be executed as
code!</p>
<h5 id="data-bundles-to-the-rescue">Data Bundles to the Rescue</h5>
<p>As we discussed in the last section, ARM code in Native Client is
structured in 16-byte bundles. We allow literal pools by putting them in
special bundles, called data bundles. Each data bundle can contain 12
bytes of arbitrary data, and the program can have as many data bundles
as it likes.</p>
<p>Each data bundle starts with a breakpoint instruction, <code>bkpt</code>. This
way, if an <em>indirect branch</em> tries to enter the data bundle, the process
will take a fault and the trusted runtime will intervene (by terminating
the program). For example:</p>
<pre>
.p2align 4
bkpt #0x5BE0 ; Must be aligned 0 mod 16!
.word 0xDEADBEEF ; Arbitrary constants are A-OK.
svc #30 ; Trying to make a syscall? OK!
str r0, [r1] ; Unmasked stores are fine too.
</pre>
<p>So, we have a way for programs to create an arbitrary, even dangerous,
chunk of data within their code. We can prevent <em>indirect branch</em> from
entering it. We can also prevent fall-through from the code just before
it, by the <code>bkpt</code>. But what about <em>direct branch</em> straight into the
middle?</p>
<p>The validator detects all data bundles (because this <code>bkpt</code> has a
special encoding) and marks them as off-limits for <em>direct branch</em>. If
it finds a <em>direct branch</em> into a data bundle, the entire program is
rejected as unsafe. Because <em>direct branch</em> cannot be modified at
runtime, the data bundles cannot be executed.</p>
<aside class="note">
Clever readers may wonder: why use <code>bkpt #0x5BE0</code>, that seems
awfully specific when you just need a special “roadblock” instruction!
Quite true, young Padawan! It happens that this odd <code>bkpt</code>
instruction is encoded as <code>0xE125BE70</code> in A32, and in T32 the
<code>bkpt</code> instruction is encoded as <code>0xBExx</code> (where <code>xx</code> could be
any 8-bit immediate, say <code>0x70</code>) and <code>0xE125</code> encodes the <em>branch</em>
instruction <code>b.n #0x250</code>. The special roadblock instruction
therefore doubles as a roadblock in T32, if anything were to go so
awry that we tried to execute it as a T32 instruction! Much defense,
such depth, wow!
</aside>
<h3 id="trampolines-and-memory-layout">Trampolines and Memory Layout</h3>
<p>So far, the rules we’ve described make for boring programs: they can’t
communicate with the outside world!</p>
<ul class="small-gap">
<li>The program can’t call an external library, or the operating system,
even to do something simple like draw some pixels on the screen.</li>
<li>It also can’t read or write memory outside of its dedicated sandbox,
so communicating that way is right out.</li>
</ul>
<p>We fix this by allowing the untrusted program to call into the trusted
runtime using a trampoline. A trampoline is simply a short stretch of
code, placed by the trusted runtime at a known location within the
sandbox, that is permitted to do things the untrusted program can’t.</p>
<p>Even though trampolines are inside the sandbox, the untrusted program
can’t modify them: the trusted runtime marks them read-only. It also
can’t do anything clever with the special instructions inside the
trampoline—for example, call it at a slightly offset address to bypass
some checks—because the validator only allows trampolines to be
reached by <em>indirect branch</em> (or <em>branch-with-link</em>). We structure the
trampolines carefully so that they’re safe to enter at any <code>0 mod 16</code>
address.</p>
<p>The validator can detect attempts to use the trampolines because they’re
loaded at a fixed location in memory. Let’s look at the memory map of
the Native Client sandbox.</p>
<h4 id="memory-map">Memory Map</h4>
<p>The ARM sandbox is always at virtual address <code>0</code>, and is exactly 1GiB
in size. This includes the untrusted program’s code and data, the
trampolines, and a small guard region to detect null pointer
dereferences. In practice, the untrusted program takes up a bit more
room than this, because of the need for additional guard regions at
either end of the sandbox.</p>
<table border="1" class="docutils">
<colgroup>
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Address</th>
<th class="head">Size</th>
<th class="head">Name</th>
<th class="head">Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code>-0x2000</code></td>
<td>8KiB</td>
<td>Bottom Guard</td>
<td>Keeps negative-displacement <em>load</em> or <em>store</em> from escaping.</td>
</tr>
<tr class="row-odd"><td><code>0</code></td>
<td>64KiB</td>
<td>Null Guard</td>
<td>Catches null pointer dereferences, guards against kernel exploits.</td>
</tr>
<tr class="row-even"><td><code>0x10000</code></td>
<td>64KiB</td>
<td>Trampolines</td>
<td>Up to 2048 unique syscall entry points.</td>
</tr>
<tr class="row-odd"><td><code>0x20000</code></td>
<td>~1GiB</td>
<td>Untrusted Sandbox</td>
<td>Contains untrusted code, followed by its heap/stack/memory.</td>
</tr>
<tr class="row-even"><td><code>0x40000000</code></td>
<td>8KiB</td>
<td>Top Guard</td>
<td>Keeps positive-displacement <em>load</em> or <em>store</em> from escaping.</td>
</tr>
</tbody>
</table>
<p>Within the trampolines, the untrusted program can call any address
that’s <code>0 mod 16</code>. However, only even slots are used, so useful
trampolines are always <code>0 mod 32</code>. If the program calls an odd slot,
it will fault, and the trusted runtime will shut it down.</p>
<aside class="note">
This is a bit of speculative flexibility. While the current bundle
size of Native Client on ARM is 16 bytes, we’ve considered the
possibility of optional 32-byte bundles, to enable certain compiler
improvements. While this option isn’t available to untrusted programs
today, we’re trying to keep the system “32-byte clean”.
</aside>
<h4 id="inside-a-trampoline">Inside a Trampoline</h4>
<p>When we introduced trampolines, we mentioned that they can do things
that untrusted programs can’t. To be more specific, trampolines can jump
to locations outside the sandbox. On ARM, this is all they do. Here’s a
typical trampoline fragment on ARM:</p>
<pre>
; Even trampoline bundle:
push { r0-r3 } ; Save arguments that may be in registers.
push { lr } ; Save the untrusted return address,
; separate step because it must be on top.
ldr r0, [pc, #4] ; Load the destination address from
; the next bundle.
blx r0 ; Go!
; The odd trampoline that immediately follows:
bkpt 0x5be0 ; Prevent entry to this data bundle.
.word address_of_routine
</pre>
<p>The only odd thing here is that we push the incoming value of <code>lr</code>,
and then use <code>blx</code>—not <code>bx</code>—to escape the sandbox. This is
because, in practice, all trampolines jump to the same routine in the
trusted runtime, called the syscall hook. It uses the return address
produced by the final <code>blx</code> instruction to determine which trampoline
was called.</p>
<h3 id="loose-ends">Loose Ends</h3>
<h4 id="forbidden-instructions">Forbidden Instructions</h4>
<p>To complete the sandbox, the validator ensures that the program does not
try to use certain forbidden instructions.</p>
<ul class="small-gap">
<li>We forbid instructions that directly interact with the operating
system by going around the trusted runtime. We prevent this to limit
the functionality of the untrusted program, and to ensure portability
across operating systems.</li>
<li>We forbid instructions that change the processor’s execution mode to
Thumb, ThumbEE, or Jazelle. This would cause the code to be
interpreted differently than the validator’s original 32-bit ARM
disassembly, so the validator results might be invalidated.</li>
<li>We forbid instructions that aren’t available to user code (i.e. have
to be used by an operating system kernel). This is purely out of
paranoia, because the hardware should prevent the instructions from
working. Essentially, we consider it “suspicious” if a program
contains these instructions—it might be trying to exploit a hardware
bug.</li>
<li>We forbid instructions, or variants of instructions, that are
implementation-defined (“unpredictable”) or deprecated in the ARMv7-A
architecture manual.</li>
<li>Finally, we forbid a small number of instructions, such as <code>setend</code>,
purely out of paranoia. It’s easier to loosen the validator’s
restrictions than to tighten them, so we err on the side of rejecting
safe instructions.</li>
</ul>
<p>If an instruction can’t be decoded at all within the ARMv7-A instruction
set specification, it is forbidden.</p>
<aside class="note">
<p>Here is a list of instructions currently forbidden for security
reasons (that is, excluding deprecated or undefined instructions):</p>
<ul class="small-gap">
<li><code>BLX</code> (immediate): always changes to Thumb mode.</li>
<li><code>BXJ</code>: always changes to Jazelle mode.</li>
<li><code>CPS</code>: not available to user code.</li>
<li><code>LDM</code>, exception return version: not available to user code.</li>
<li><code>LDM</code>, kernel version: not available to user code.</li>
<li><code>LDR*T</code> (unprivileged load operations): theoretically harmless,
but suspicious when found in user code. Use <code>LDR</code> instead.</li>
<li><code>MSR</code>, kernel version: not available to user code.</li>
<li><code>RFE</code>: not available to user code.</li>
<li><code>SETEND</code>: theoretically harmless, but suspicious when found in
user code. May make some future validator extensions difficult.</li>
<li><code>SMC</code>: not available to user code.</li>
<li><code>SRS</code>: not available to user code.</li>
<li><code>STM</code>, kernel version: not available to user code.</li>
<li><code>STR*T</code> (unprivileged store operations): theoretically harmless,
but suspicious when found in user code. Use <code>STR</code> instead.</li>
<li><code>SVC</code>/<code>SWI</code>: allows direct operating system interaction.</li>
<li>Any unassigned hint instruction: difficult to reason about, so
treated as suspicious.</li>
</ul>
<p>More details are available in the <a class="reference external" href="http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table">ARMv7 instruction table definition</a>.</p>
</aside>
<h4 id="coprocessors">Coprocessors</h4>
<p>ARM has traditionally added new instruction set features through
coprocessors. Coprocessors are accessed through a small set of
instructions, and often have their own register files. Floating point
and the NEON vector extensions are both implemented as coprocessors, as
is the MMU.</p>
<p>We’re confident that the side-effects of coprocessors in slots 10 and 11
(that is, floating point, NEON, etc.) are well-understood. These are in
the coprocessor space reserved by ARM Ltd. for their own extensions
(<code>CP8</code>–<code>CP15</code>), and are unlikely to change significantly. So, we
allow untrusted code to use coprocessors 10 and 11, and we mandate the
presence of at least VFPv3 and NEON/AdvancedSIMD. Multiprocessor
Extension, VFPv4, FP16 and other extensions are allowed but not
required, and may fail on processors that do not support them, it is
therefore the program’s responsibility to validate their availability
before executing them.</p>
<p>We don’t allow access to any other ARM-reserved coprocessor
(<code>CP8</code>–<code>CP9</code> or <code>CP12</code>–<code>CP15</code>). It’s possible that read
access to <code>CP15</code> might be useful, and we might allow it in the
future—but again, it’s easier to loosen the restrictions than tighten
them, so we ban it for now.</p>
<p>We do not, and probably never will, allow access to the vendor-specific
coprocessor space, <code>CP0</code>–<code>CP7</code>. We’re simply not confident in our
ability to model the operations on these coprocessors, given that
vendors often leave them poorly-specified. Unfortunately this eliminates
some legacy floating point and vector implementations, but these are
superceded on ARMv7-A parts anyway.</p>
<h4 id="validator-code">Validator Code</h4>
<p>By now you’re itching to see the sandbox validator’s code and dissect
it. You’ll have a disappointing read: at less that 500 lines of code
<a class="reference external" href="http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/validator.cc">validator.cc</a>
is quite simple to understand and much shorter than this document. It’s
of course dependent on the <a class="reference external" href="http://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/validator_arm/armv7.table">ARMv7 instruction table definition</a>,
which teaches it about the ARMv7 instruction set.</p>
</section>
{{/partials.standard_nacl_article}}