Reproducible Firmware Builds

If you have ever worked on a large scale embedded project before, you’ve probably run into situations where the build system or binary behaves differently depending on where it was compiled. For example, maybe the binary completely fails to compile on one computer and on another it compiles but crashes on boot!

In this article we will discuss what a Reproducible Build is, walk through the process of updating a firmware project so the build is reproducible, and explore how we can leverage what we have learned for other aspects of development.

Reproducible Builds

A build is said to be “Reproducible” when, given a set of source code, the exact same binary can be generated by different people on different computers. Specifically, this means:

  • code size and placement of sections are the same
  • filepaths emitted in the binary are the same
  • when a binary diff is computed between the build artifacts (e.g the ELF or .bin), there will be zero differences

Many open source projects (such as Debian GNU/Linux) are working towards making the entire build reproducible1.

Why does it matter if the build is Reproducible?

A common problem that arises when a build is not reproducible is the binary behaving differently for developers on the team. Debugging and supporting these kinds of issues can be a significant time sink. It’s hard to debug a problem when you can’t reproduce it yourself! Making builds reproducible can eliminate this class of problem and is the primary reason I think reproducible builds are useful for embedded projects. We’ll explore an example of this type of problem in the next section.

Having reproducible builds can also be helpful when trying to recreate the debug symbols for an image where the symbols were not preserved or to be able to re-generate a build that hasn’t been updated in a while (e.g. an image used in the factory).

Open source projects also strive for reproducible builds for security reasons – to ensure no backdoors were added into the final binary as part of the compilation process2 3

Making a Build Reproducible

In this article we will examine a very simple app running FreeRTOS which passes a message between two tasks. Every time a message is sent a 64 bit counter is incremented. We will compile the binary from two different directories. If the build is reproducible, we should find the binaries match and behave the same when flashed on target. However, what we will find out is that one of the binaries crashes on boot and one does not!

Example Project Setup

In this article we will be using:

  • the GNU Arm Embedded Toolchain 8-2019-q3-update4 for our compiler
  • the nRF52840-DK5 (ARM Cortex-M4F) as our development board
  • SEGGER JLinkGDBServer6 as our GDB Server.

Enforcing a Compiler Version

A first step for ensuring a build will be reproducible is to enforce that everyone use the same compiler version. It’s very easy to add a check to the build system for this. Here’s an example of what you could add to a Makefile to accomplish this:

GCC_COMPILER ?= arm-none-eabi-gcc
GCC_VERSION := $(strip $(shell $(GCC_COMPILER) -dumpversion))
EXPECTED_GCC_VERSION := 8.3.1
ifneq ($(GCC_VERSION),$(EXPECTED_GCC_VERSION))
$(error Examples were all compiled against $(EXPECTED_GCC_VERSION). You are using $(GCC_VERSION))
endif

Building the Project

Let’s clone the Interrupt repository and build the project:

$ mkdir -p /private/tmp/repos
$ cd /private/tmp/repos && git clone git@github.com:memfault/interrupt.git
$ cd interrupt/example/reproducible-build/
$ make
Compiling main.c
Compiling startup.c
Compiling minimal_heap.c
Compiling freertos_kernel/tasks.c
Compiling freertos_kernel/queue.c
Compiling freertos_kernel/list.c
Compiling freertos_kernel/timers.c
Compiling freertos_kernel/portable/GCC/ARM_CM4F/port.c
Compiling freertos_kernel/portable/MemMang/heap_1.c
Linking library
Generated build/nrf52.elf

Flashing the Project

In one terminal, you will need to start a GDB server:

JLinkGDBServer  -if swd -device nRF52840_xxAA

In the other you will need to start GDB and type continue to start running the application:

$ arm-none-eabi-gdb-py --eval-command="target remote localhost:2331" --ex="mon reset" --ex="load" --ex="mon reset" --se=build/nrf52.elf
[...]
Resetting target
Loading section .interrupts, size 0x48 lma 0x0
Loading section .gnu_build_id, size 0x24 lma 0x48
Loading section .text, size 0x17e8 lma 0x70
Loading section .data, size 0x1ec lma 0x1858
Start address 0x70, load size 6720
Transfer rate: 1640 KB/sec, 1680 bytes/write.
Resetting target
(gdb) continue
[...]

Great, looks like the app is happily running!

Building Project In Another Directory

Instead of cloning the repo in the /private/tmp/repos/ directory, let’s walk through the same steps but use the /private/tmp/dev/ directory instead.

$ mkdir -p /private/tmp/dev
$ cd /private/tmp/dev && git clone git@github.com:memfault/interrupt.git
$ cd interrupt/example/reproducible-build/
$ make
[...]
Linking library
Generated build/nrf52.elf
$ arm-none-eabi-gdb-py --eval-command="target remote localhost:2331" --ex="mon reset" --ex="load" --ex="mon reset" --se=build/nrf52.elf
[...]
(gdb) continue
Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000014c in my_fault_handler_c (frame=0x200007a8 <ucHeap+1140>) at /private/tmp/dev/interrupt/example/reproducible-build/startup.c:87
87    HALT_IF_DEBUGGING();

Now when we start the application, we crash on boot. How strange!

Why does only one of the builds crash?!

A full discussion of how to debug the crash is outside the scope of this article but can be found in this post.

We examine the Configurable Fault Status Register (CFSR) and can see that a UsageFault has taken place:

(gdb) svd SCB CFSR_UFSR_BFSR_MMFSR
Fields in SCB CFSR_UFSR_BFSR_MMFSR:
[...]
    UNALIGNED:    1  Unaligned access usage fault
[...]

We can examine the frame captured in the my_fault_handler_c handler and see that the UsageFault was triggered by an unaligned store to a 64 bit pointer.

(gdb) p/a *frame
$5 = {
  r0 = 0x200001ca <s_heap>,
  r1 = 0x1 <g_pfnVectors+1>,
  r2 = 0x200001ca <s_heap>,
  r3 = 0x0 <g_pfnVectors>,
  r12 = 0x0 <g_pfnVectors>,
  lr = 0x89 <prvQueuePingTask+24>,
  return_address = 0x8c <prvQueuePingTask+28>,
  xpsr = 0x81000000
}
(gdb) list *frame->return_address
0x8c is in prvQueuePingTask (/private/tmp/dev/interrupt/example/reproducible-build/main.c:33).
28    xNextWakeTime = xTaskGetTickCount();
29
30    uint64_t *total_queue_sends = minimal_heap_malloc(sizeof(uint64_t));
31
32    while (1) {
==> CRASH WAS HERE:
33      (*total_queue_sends)++;
34      vTaskDelayUntil(&xNextWakeTime, mainQUEUE_SEND_FREQUENCY_MS);
35      xQueueSend(xQueue, total_queue_sends, 0U);
36    }

The pointer which caused the bad store comes from the call to minimal_heap_malloc(). Let’s take a look at that C file:

#include "minimal_heap.h"

#include <stdbool.h>
#include <stdint.h>

#define MINIMAL_HEAP_TOTAL_SIZE 16

typedef struct {
  uint8_t heap[MINIMAL_HEAP_TOTAL_SIZE];
  bool space_free;
} sMinimalHeapContext;

static sMinimalHeapContext s_heap = {
  .space_free = true,
};

void *minimal_heap_malloc(size_t size) {
  if (!s_heap.space_free || size > MINIMAL_HEAP_TOTAL_SIZE) {
    return NULL;
  }
  s_heap.space_free = false;
  return &s_heap.heap[0];
}

void minimal_heap_free(void) {
  s_heap.space_free = true;
}

So allocations are always from the address of s_heap.heap[0]. We can take a look at where that address is in GDB:

(gdb) p &s_heap.heap
$6 = (uint8_t (*)[16]) 0x200001ca <s_heap>

We’ll that’s a bug! Our minimal malloc implementation is handing out pointers that are not 8 byte aligned. We can fix the issue by either making heap a uint64_t or leveraging the aligned attribute (__attribute__((aligned(8)))) to ensure our structure is always aligned on a 8 byte boundary. Let’s at least apply a fix for the bug we encountered:

$ cd /private/tmp/dev/interrupt/example/reproducible-build/
$ git apply 01-fix-alignment.patch
$ cd /private/tmp/repos/interrupt/example/reproducible-build/
$ git apply 01-fix-alignment.patch

Comparing the builds

Comparing the symbol address

It’s confusing how one build crashes and the other does not. If we did not crash in the first build it must mean the s_heap structure was word aligned. We can confirm that with the nm utility which can be used to inspect symbol information:

$ cd /private/tmp
$ arm-none-eabi-nm -S -l dev/interrupt/example/reproducible-build/build/nrf52.elf | grep s_heap
200001ca 00000011 d s_heap	/private/tmp/dev/interrupt/example/reproducible-build/minimal_heap.c:13
$ arm-none-eabi-nm -S -l repos/interrupt/example/reproducible-build/build/nrf52.elf | grep s_heap
200001d4 00000011 d s_heap	/private/tmp/repos/interrupt/example/reproducible-build/minimal_heap.c:13

It appears the addresses are different between the builds. Are the sizes of the build the same?

Comparing the binary sizes

$ arm-none-eabi-size dev/interrupt/example/reproducible-build/build/nrf52.elf
   text    data     bss     dec     hex	filename
   6228     480   11688   18396    47dc	dev/interrupt/example/reproducible-build/build/nrf52.elf
(.venv) chrisc: /private/tmp/
$ arm-none-eabi-size repos/interrupt/example/reproducible-build/build/nrf52.elf
   text    data     bss     dec     hex	filename
   6228     492   11684   18404    47e4	repos/interrupt/example/reproducible-build/build/nrf52.elf

We can see that one build is larger than the other and it looks like most of that difference comes from the .data section.

Examining strings in the binary

When sizes differ, one of the first things I like to check is to see if there are any differences in the strings within the binary. We can easily examine this with arm-none-eabi-strings. The -d option can be used to only scan .data sections in the file (that is, skip over any debug info) & the -n argument can be used to control the minimum string length that will be detected. Let’s try it out!

$ arm-none-eabi-strings -d dev/interrupt/example/reproducible-build/build/nrf52.elf -n 10
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/tasks.c
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/queue.c
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/timers.c
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/portable/GCC/ARM_CM4F/port.c
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/portable/MemMang/heap_1.c
$ arm-none-eabi-strings -d repos/interrupt/example/reproducible-build/build/nrf52.elf -n 10
/private/tmp/repos/interrupt/example/reproducible-build/freertos_kernel/tasks.c
/private/tmp/repos/interrupt/example/reproducible-build/freertos_kernel/queue.c
/private/tmp/repos/interrupt/example/reproducible-build/freertos_kernel/timers.c
/private/tmp/repos/interrupt/example/reproducible-build/freertos_kernel/portable/GCC/ARM_CM4F/port.c
/private/tmp/repos/interrupt/example/reproducible-build/freertos_kernel/portable/MemMang/heap_1.c

So we see the entire filename path has made it into the build!

This generally happens when the __FILE__ macro is used which is a fairly common occurence (e.g. a lot of logging implementations and assert handlers use it). The __FILE__ macro expands to “path by which the preprocessor opened the file”7. In practice this is the path passed to the compiler when the .c file is compiled. You can try it out yourself by running a simple test.c file through the preprocessor.

// contents of test.c file
const char *my_file_name = __FILE__;

To only invoke the preprocessor we can use the GCC -E argument:

$ /private/tmp/
$ arm-none-eabi-gcc -E test.c
[...]
const char *my_file_name = "test.c";
$ arm-none-eabi-gcc -E /private/tmp/test.c
[...]
const char *my_file_name = "/private/tmp/test.c";
$ arm-none-eabi-gcc -E ../tmp/test.c
[...]
const char *my_file_name = "../tmp/test.c";

The paths we see in nrf52.elf suggest our Makefile build commands must be passing absolute paths. Let’s inspect the Makefile:

$(BUILD_DIR)/%.o: $(ROOT_DIR)/%.c Makefile | $(BUILD_DIR) $(DEP_DIR) $(FREERTOS_PORT_ROOT)
    @echo "Compiling $*.c"
    @mkdir -p $(dir $@)
    $(Q) arm-none-eabi-gcc $(DEP_CFLAGS) $(CFLAGS) $(INCLUDE_PATHS) -c -o $@ $<

The c file name is coming from $< on the last line which is makefile syntax for the first prerequisite of the target8. This is indeed an absolute path. Instead let’s utilize $* to get the value which was matched with %, which will be relative to the root directory of the repo. Let’s apply the patch to fix that up:

$ cd /private/tmp/dev/interrupt/example/reproducible-build/
$ git apply 02-fix-compiler-invocation.patch
$ cd /private/tmp/repos/interrupt/example/reproducible-build/
$ git apply 02-fix-compiler-invocation.patch

Now let’s look at the string output:

$ arm-none-eabi-strings -d build/nrf52.elf -n 10
freertos_kernel/tasks.c
freertos_kernel/queue.c
freertos_kernel/timers.c
freertos_kernel/portable/GCC/ARM_CM4F/port.c
freertos_kernel/portable/MemMang/heap_1.c

Great! We no longer have absolute paths in the binary.

Comparing the Binary

At this point, what I like to do is diff the final binaries to see if they match:

$ cd /private/tmp
$ diff repos/interrupt/example/reproducible-build/build/nrf52.bin dev/interrupt/example/reproducible-build/build/nrf52.bin
Binary files repos/interrupt/example/reproducible-build/build/nrf52.bin and dev/interrupt/example/reproducible-build/build/nrf52.bin differ

Darn, they still differ. Sometimes a hexdump of the binary itself can give us clues:

$ sdiff -s <(xxd repos/interrupt/example/reproducible-build/build/nrf52.bin) <(xxd dev/interrupt/example/reproducible-build/build/nrf52.bin)
00000050: 0300 0000 474e 5500 4a8c 3651 f710 a3f6  ....GNU.J. |	00000050: 0300 0000 474e 5500 a3f4 c1f5 7500 cddb  ....GNU...
00000060: 63e7 2d9c a9c9 0aa4 38f4 9b54 0000 0000  c.-.....8. |	00000060: 5deb df50 e138 ff64 174c 6e12 0000 0000  ]..P.8.d.L

The “GNU” in the string in this case stands out. That’s likely from the GNU Build ID we are using in the firmware to uniquely identify the compilation.

The GNU Build ID is a SHA computed over the binary sections as well as the debug sections in an ELF.

If we run “strings” on either of the ELF files we will find the debug sections also include absolute paths:

$ arm-none-eabi-strings dev/interrupt/example/reproducible-build/build/nrf52.elf | grep private
[...]
/private/tmp/dev/interrupt/example/reproducible-build
/private/tmp/dev/interrupt/example/reproducible-build
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/include
/private/tmp/dev/interrupt/example/reproducible-build/freertos_kernel/portable/GCC/ARM_CM4F
/private/tmp/dev/interrupt/example/reproducible-build

Fortunately, the GNU toolchain has us covered!

Making the ELF debug info the same

We can use -fdebug-prefix-map=old=new9 to control what path is used to prefix debug information. We can add the following line to the CFLAGS in the makefile to make the path relative to the root directory!

CFLAGS += \
[...]
  -fdebug-prefix-map=$(ROOT_DIR)=.

Let’s apply the fix:

$ cd /private/tmp/dev/interrupt/example/reproducible-build/
$ git apply 03-fdebug-prefix-map.patch
$ cd /private/tmp/repos/interrupt/example/reproducible-build/
$ git apply 03-fdebug-prefix-map.patch

Ensuring GDB can find source when using -fdebug-prefix-map

By default, GDB scans the current working directory and the compilation directory to lookup source code. If you do not invoke GDB from the root directory of the repo, you will need to use the directory command (or --directory CLI argument) to tell GDB where the root of the source code is:

$ /private/tmp/repos/interrupt/example/reproducible-build/build/
# arm-none-eabi-gdb-py --eval-command="target remote localhost:2331" --ex="mon reset" --ex="load" --ex="mon reset" --se=nrf52.elf
[...]
Reset_Handler () at startup.c:36
36	startup.c: No such file or directory.

===> Fixing source directory
(gdb) directory /private/tmp/repos/interrupt/example/reproducible-build/
Source directories searched: /private/tmp/repos/interrupt/example/reproducible-build:$cdir:$cwd
(gdb) list *Reset_Handler
0x1b0 is in Reset_Handler (startup.c:36).
34
35  void Reset_Handler(void) {
36    prv_cinit();
37
38    main();

After making the changes, let’s recompile the project. Both binaries should now have the same build id!

$ make -C repos/interrupt/example/reproducible-build/
[...]
$ make -C dev/interrupt/example/reproducible-build/
[..]

$ arm-none-eabi-readelf -n build/nrf52.elf

Displaying notes found in: .gnu_build_id
  Owner                 Data size	Description
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 6fe2c11e68fb1c664d6644322b1d4e51bd2114e8
$ arm-none-eabi-readelf -n build/nrf52.elf

Displaying notes found in: .gnu_build_id
  Owner                 Data size	Description
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 6fe2c11e68fb1c664d6644322b1d4e51bd2114e8

Great, the Build IDs are an exact match. We could also diff the ELF files and confirm they are identical as well:

$ diff dev/interrupt/example/reproducible-build/build/nrf52.elf repos/interrupt/example/reproducible-build/build/nrf52.elf && echo "Artifacts Match"
Artifacts Match

Checklist

In summary, here are the steps to take in order to make a build reproducible:

First, compile your binary in two directories. If the binaries match, you are probably good! Otherwise, follow these strategies:

  1. Include a GNU Build ID in your binary so it’s always easy to identify the build in use.
  2. Enforce that all builds use one version of a compiler.
  3. Compare ELFs and BINs for differences using the tips discussed in this article. Typical differences arise from including absolute paths, timestamps, or shell information in the final binary.
  4. Utilize -fdebug-prefix-map=old=new to make sure any debug information emitted does not rely on absolute paths.
  5. At this point, builds will usually match for an embedded project. If they do not, it’s likely non-determinism in one of the build tools being used is causing issues.10

That’s it!

Applying Reproducible Build Methodology During Development

Over a projects lifetime, you will likely wind up making changes to the product that should have no impact on the final binary produced. Some examples of this type of change include:

  1. Migrating to a new build system (i.e Make to CMake)
  2. Refactoring a Makefile to be more readable
  3. Sorting headers in a C file to follow a coding convention
  4. Fixing a new compiler warning that shouldn’t impact the binary produced.
  5. Refactoring a linker script so it’s easier to extend

Some of these parts of a codebase can be “scary” ones to touch. By applying what we have learned above to debug how binaries differ, we can make these types of changes with confidence … if we update a Makefile and don’t expect a change in the binary but something has changed, it likely means we have messed up! Let’s walk through a quick example.

Cleaning up a Makefile

In the Makefile for the example app we have been using in this article we have the following block:

INCLUDE_PATH_DIRS = \
  $(FREERTOS_PORT_ROOT) \
  $(FREERTOS_ROOT_DIR)/include \
  $(ROOT_DIR) \

INCLUDE_PATH_DIRS += $(ROOT_DIR)/include

INCLUDE_PATH_DIRS += $(ROOT_DIR)/config

Let’s say we decide to clean this up by moving all the include paths into a single list:

INCLUDE_PATH_DIRS = \
  $(FREERTOS_PORT_ROOT) \
  $(FREERTOS_ROOT_DIR)/include \
  $(ROOT_DIR) \
  $(ROOT_DIR)/config \
  $(ROOT_DIR)/include

If you want to follow along, you can apply the patch locally:

$ cd /private/tmp/dev/interrupt/example/reproducible-build/
$ git apply 04-makefile-cleanup.patch
$ cd /private/tmp/repos/interrupt/example/reproducible-build/
$ git apply 04-makefile-cleanup.patch

If we have done things right, nothing in the project should change. Let’s save a copy of the build and rebuild with the modification:

$ cd /private/tmp/repos/interrupt/example/reproducible-build/
$ cp build/nrf52.elf nrf52_original.elf
$ cp build/nrf52.bin  nrf52_original.bin
# Apply changes & rebuild
$ make

If we compare the Build IDs we see they are now different!

$ arm-none-eabi-readelf -n build/nrf52.elf

Displaying notes found in: .gnu_build_id
  Owner                 Data size	Description
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 2b01965289ecab1bded71b8dffe0535661e7847a
$ arm-none-eabi-readelf -n nrf52_original.elf

Displaying notes found in: .gnu_build_id
  Owner                 Data size	Description
  GNU                  0x00000014	NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 6fe2c11e68fb1c664d6644322b1d4e51bd2114e8

We can compare the binaries like we did above. As we would expect the build IDs differ but now we also see a third row which differs:

sdiff -s <(xxd nrf52_original.bin) <(xxd build/nrf52.bin)
00000050: 0300 0000 474e 5500 6fe2 c11e 68fb 1c66  ....GNU.o. |	00000050: 0300 0000 474e 5500 2b01 9652 89ec ab1b  ....GNU.+.
00000060: 4d66 4432 2b1d 4e51 bd21 14e8 0000 0000  MfD2+.NQ.! |	00000060: ded7 1b8d ffe0 5356 61e7 847a 0000 0000  ......SVa.
000000d0: dc00 0020 03be 7047 13b5 0022 0821 0120  ... ..pG.. |	000000d0: dc00 0020 03be 7047 13b5 0022 0821 6420  ... ..pG..

Comparing objdump Output

Another thing we can do if the binaries are really close is to compare the assembly. This might give us a better clue as to what changed. Let’s try it out:

sdiff -s <(arm-none-eabi-objdump -d nrf52_original.elf) <(arm-none-eabi-objdump -d build/nrf52.elf)
nrf52_original.elf:     file format elf32-littlearm       |	build/nrf52.elf:     file format elf32-littlearm
      de:	2001        movs	r0, #1            |       de:	2064        movs	r0, #100	; 0x6

We are loading the value of 100 into register r0 instead of 1 in the other build. We can find what function this instruction is coming from by using addr2line:

arm-none-eabi-addr2line 0xde -e nrf52_original.elf
 ./main.c:61

or by dumping it in GDB:

(gdb) list *0xde
0xde is in main (main.c:61).
59
60	int main(void) {
==> CHANGE HERE
61   xQueue = xQueueCreate(MAIN_QUEUE_LENGTH, sizeof(uint64_t));
62    configASSERT(xQueue != NULL);
63
64    xTaskCreate(prvQueuePongTask,
65                "Ping", configMINIMAL_STACK_SIZE,

Per the ARM procedure call standard, $r0 is used to hold the first argument for a function call. This means that the $r0 operation we see above maps to MAIN_QUEUE_LENGTH. But how did that change between builds? Let’s grep9 through the codebase to see where it is defined.

$rg "define.*MAIN_QUEUE_LENGTH"
config/config.h
1:#define MAIN_QUEUE_LENGTH 100

include/config.h
1:#define MAIN_QUEUE_LENGTH 1

Error Analysis

Oh interesting, the define is provided in two headers. In main.c we just include "config.h". When the compiler tries to resolve include paths it will just walk through the list of includes we provided at compilation time in order. If we go back and re-examine the change made above, we can see that we accidentally changed the order of the includes. If we change the order back, we can recompile and confirm the build matches again!

NOTE: A full discussion of C include path best practices is outside the scope of this article but this example raises one reason using a flat include structure can be dangerous. Because the path is not explicit, it’s pretty opaque as to what include will path will be used. The bug above could have been avoided by not adding -Iconfig/ to CFLAGS and instead requiring it be included explicitly (i.e #include config/config.h).

Closing

I hope this post gave you a useful overview of the benefits of reproducible builds and how you might update your project to achieve this behavior.

Are the builds for your project reproducible today? Are there other matters on the topic that you would have liked to see us cover? Let us know in the discussion area below!

See anything you'd like to change? Submit a pull request or open an issue at GitHub

References

Chris Coleman is a founder and CTO at Memfault. Prior to founding Memfault, Chris worked on the embedded software teams at Sun, Pebble, and Fitbit.