David Lattimore’s Blog

Video: Wild linker talk at GOSIM China 2024

2024-11-12T13:00:00+00:00

In October, I attended the open source conference, GOSIM 2024 in China where I gave a talk about the Wild linker.

Rust dylib rabbit holes

2024-08-27T14:00:00+00:00

Bevy is a popular game engine for Rust. It’s pretty large and compilation times can be an issue. To help with this, Bevy provides an optional feature that when enabled, compiles most of Bevy as a dynamic library. This allows for faster iteration as you don’t need to relink all the Bevy internals each time you rebuild.

cargo run --features bevy/dynamic_linking

I was experimenting with this from the perspective of testing and profiling the linker I’m writing, Wild (see previous posts).

With that in mind, I was mostly looking at (a) how long it takes to link and (b) how well the resulting .so file works.

Initially, I was only looking at debug builds. To speed up the build, I turned off debug info.

[profile.dev]
debug = false

So this was perhaps more accurately described as a non-optimised build. Having optimisations off should make the build faster right? Probably it does, but it doesn’t necessarily make linking faster. Here’s the times for linking this shared object:

Linker	Time (ms)
lld (18)	1975
mold (2.32.1)	1763
wild	895

I’ll not include GNU ld because it’s more than 10 seconds, making it painful to benchmark.

If we now set opt-level = 2, then the link time drops quite dramatically:

Linker	Time (ms)
lld (18)	545
mold (2.32.1)	287
wild	183

I sometimes wonder if Rust (or more accurately Cargo) needs a third default profile “fastbuild” that doesn’t have debug info and is optimised for building fast. I’m sure there are a bunch of tradeoffs between compilation speed and debuggability that currently favour the latter. I bet there are optimisations that, if applied, would speed up the build, especially a warm build, but which are disabled in debug builds because they might make it harder to use a debugger on the code.

But what really drew my attention with the non-optimised build was what it’s getting the linker to do. We’re creating a shared object (.so file on Linux). Rustc gives instructions to the linker to tell it which symbols need to be exported. If a symbol is exported from the shared object, then an executable or another shared object that depends on our shared object can make use of those symbols. If a symbol isn’t exported then it cannot be directly referenced from outside the shared object.

In order to control which symbols get exported from the shared object, the linker is passed a version script which specifies which symbols should be global and then downgrades the rest to local.

{
  global:
    rust_metadata_bevy_dylib_2f311168f6c5d4f8;
    _ZN9hashbrown3set24HashSet$LT$T$C$S$C$A$GT$6insert17hcb8b576667efe889E;
    _ZN9hashbrown3set24HashSet$LT$T$C$S$C$A$GT$6remove17h53654c4e42de8b15E;
....

  local:
    *;
};

For a non-optimised build, this version script lists more than 300k symbols to export! Contrast this with the optimised build, where it lists only 18k symbols. Looking into this a bit, the majority of the extra symbols happen because non-optimised builds enable -Z share-generics by default. These shared generics not only get exported from the crates that monomorphise them, they also get exported from the dylib. The remainder of the extra symbols look to be functions that would have been inlined in an optimised build. This seems somewhat surprising that a public function would be exported from a dylib only if it didn’t get inlined.

But let us for the moment assume that we actually for some reason need all 300k symbols to be exported.

When a dynamically linked executable or a shared object gets loaded, the runtime can look up symbols that are provided by other shared objects. On Linux, symbol lookups can either be eager, meaning that they happen when the binary is loaded, or lazy, meaning that the symbol is only looked up when the function is first called. For security reasons, lazy binding is less popular these days and rust indeed sets linker flags to bind symbols at load time.

For shared objects produced by rustc, most of these non-lazy symbol lookups are done with GLOB_DAT relocations. These relocations are instructions to the runtime to put the address of a symbol with a particular name at a particular location in memory. For example, the following relocation says to look up the symbol __rust_alloc, then put the address of that symbol at address 0x93ec698.

00000000093ec698  0000000700000006 R_X86_64_GLOB_DAT      0000000000000000 __rust_alloc + 0

If we check how many GLOB_DAT relocations are in our bevy shared object, we get a bit of a surprise.

readelf -W -r libbevy_dylib.so | grep GLOB_DAT | wc -l
291185

But GLOB_DAT is for resolving references to symbols that the shared object depends on, so why is the number of outgoing references so similar to the number of symbols that the shared object exports?

Indeed, it turns out that this isn’t a coincidence. The majority of the symbols with GLOB_DAT relocations are for symbols that are defined by the dylib itself.

But why would the dylib request runtime resolution of a symbol that it itself defines? Dynamic linking on Linux allows symbols defined by shared objects to be overridden (also known as “interposing”). One use-case for this is to override the allocator provided by libc in order to perform runtime checks.

But we don’t really want to be able to override all these symbols, we just want them to be exported so that they can be used by our binary that uses the shared object. When the compiler builds an object file on Linux, symbols can be local or global. Locals are only accessible within that codegen unit, while globals can be referenced from other codegen units. Global symbols can then be further restricted by setting their visibility, which affects how they’ll be treated when dynamic linking.

Binding	Visibility	Accessible from other codegen units?	Accessible from other dynamic objects?	Can be overridden?
Local		No	No	No
Global	Hidden	Yes	No	No
Global	Protected	Yes	Yes	No
Global	Default	Yes	Yes	Yes

The key difference here is between default visibility and protected visibility. The latter means that the symbol cannot be interposed (overridden). A default visibility symbol however can be interposed, which means that if another shared object earlier in the load order, or the executable itself defines a symbol with the same name, that will take precedence.

OK, so we just need to set all our symbols to protected. That way they’ll be exported from the shared object, but won’t be permitted to be overridden.

I found the code in rustc that sets symbol visibility and prototyped changing it to set symbols to have protected visibility unless the symbol was marked as #[no_mangle]. This worked and drastically reduced the number of GLOB_DAT relocations. To test how much of a difference this makes, I tried loading shared objects with and without this change.

Default visibility: Shared object took about 150ms to load.
Protected visibility: Shared object took about 5ms to load.

OK, that’s great. At that point I thought I should look for existing issues related to this and indeed found one. The creator of the cranelift backend for rustc, bjorn3 had also attempted to change symbols to use protected visibility, but had hit issues when linking with GNU ld.

GNU ld complains that direct references to protected symbols cannot be used when building a shared object. I tried GNU ld and got the same problem.

But let’s think about this for a moment, why can’t a shared object have direct references to a protected symbol - it cannot be overridden, so it should be fine to reference it directly. Right?

To understand what GNU ld’s objection is here, we need to look at how GCC compiles C code. We’ll start by looking at what it does with C code that references data.

extern int my_value; // Likely from a header file

int main() {
    return my_value;
}

First, let’s look at what the clang compiler does with this.

48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 7 <main+0x7>
R_X86_64_REX_GOTPCRELX	my_value-0x4
8b 00                	mov    (%rax),%eax

The first instruction is reading a pointer to our variable my_value from the GOT (global offset table). The GOT is a table of pointers. These pointers are generally populated by the runtime at startup to point to functions and variables that come from different shared objects.

The second instruction then loads the value from that pointer. This instruction sequence will work fine even if the variable my_value ends up coming from a shared object.

If the variable ends up being statically linked into our binary, then the linker will transform this assembly to:

    1130:       48 8d 05 f1 2e 00 00    lea    0x2ef1(%rip),%rax        # 4028 <my_value>
    1137:       8b 00                   mov    (%rax),%eax

The lea instruction here is loading the relative address of our variable, which is now known at link time. That means that there’s no access to the global offset table.

Now, let’s look at what GCC does:

   4:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # a <main+0xa>
			6: R_X86_64_PC32	my_value-0x4

It’s using a PC32 relocation to access the variable my_value. This is a direct reference, which will only work if the address of the variable is known at link time. i.e. this won’t (or shouldn’t IMO) work if my_value comes from a shared object. If we add the flag -fPIC to gcc, then it produces the same code as clang.

So we have a trade-off. The code to directly access a variable that gets statically linked into our executable is shorter and presumably more efficient, but doesn’t really work if the variable ends up coming from a shared object. The code that does work for accessing the variable from a shared object is slightly longer and a bit less efficient, although with the linker optimising away the access to the global offset table, the efficiency difference is pretty small - however it remains longer than the direct access code.

I said that the direct access approach doesn’t work if the variable ends up coming from a shared object. Unfortunately that’s not entirely true. Linkers apply a horrible hack called copy-relocations in order to make it work. When they encounter a direct access to a variable that’s defined by a shared object, they allocate space for that variable in BSS (a zero-initialised section that doesn’t take up space in the file on disk), then at runtime the bytes of the variable get copied from the shared object that defined it into that space. That copy then overrides the definition provided by the shared object.

But what if the symbol definition in the shared object has protected visibility? That means it can’t be overridden right? GCC chose to interpret “can’t be overridden” as “can only be overridden by a copy relocation”.

For a shared object to work correctly when one of its symbols is overridden, there can’t be direct references to the symbol within the shared object. Here we get to a point of incompatibility between the GCC / GNU ld world and the LLVM / LLD world.

If we now look at the code that each compiler produces for putting into a shared object, we can see the other side of this difference. Here’s our C code:

__attribute__((visibility("protected")))
int my_value = 42;

int get_my_value(void) {
	return my_value;
}

We tell both compilers that we might put this into a shared object by compiling with -fPIC.

GCC produces the following assembly for the variable access.

  19:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 20 <get_my_value+0xf>
			1c: R_X86_64_REX_GOTPCRELX	my_value-0x4
  20:	8b 00                	mov    (%rax),%eax

i.e. even though the variable is protected, it still accesses it via the GOT.

Clang however produces a more efficient direct access to the variable.

  14:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # 1a <get_my_value+0xa>
			16: R_X86_64_PC32	my_value-0x4

So when building an executable, GCC ends up directly referencing all symbols, even those that might be protected symbols from a shared object. In order to make that work, it then uses indirect references when building shared objects.

Clang does the opposite, using indirect references when building an executable, but then allows direct references to protected symbols when building a shared object.

Mixing these two different, and incompatible models for when it’s OK to directly reference something can lead to problems. If your shared object is built by LLVM with direct access to protected variables, then your main binary is built by GCC with direct access to all variables, we end up with two separate copies of our variable. If the variable is mutable, then a change made in the main binary won’t be seen by the shared object and vice versa.

In order to protect against this, GNU ld detects direct access to protected variables and refuses to link the shared object. But the shared object would have worked fine so long as it was only used by a binary compiled with LLVM (Clang).

This can be seen if we try to compile a shared object with Clang and link it with GNU ld:

clang -fPIC -shared b.c -o libb.so
/usr/bin/ld: /tmp/b-09dfbd.o: relocation R_X86_64_PC32 against protected symbol `my_value` can not be used when making a shared object
/usr/bin/ld: final link failed: bad value

The examples so far used protected symbols that were data, not functions, however the same problem occurs with functions. The only real difference is that the linker won’t do a copy relocation for a function, instead it synthesises a PLT entry (a small bit of machine code that jumps to the actual function) then uses that to override the function definition in the shared object.

__attribute__((visibility("protected")))
int f1(void) {
	return 42;
}

typedef int (*int_fn_t)(void);

int_fn_t get_f1_ptr2(void) {
	return &f1;
}

Compiling this code with clang causes a link failure with GNU ld:

clang -shared -fPIC x.c
/usr/bin/ld: /tmp/x-f06305.o: relocation R_X86_64_PC32 against protected symbol `f1` can not be used when making a shared object
/usr/bin/ld: final link failed: bad value

This might seem like it’s just a trade-off between optimising code in the executable (GCC) or optimising code in the shared object (LLD), in which case we should presumably pick to optimise the executable, since for many uses that’s where the bulk of the code lives. However picking this relies on copy relocations, which are in my opinion, a hack. Like many hacks, they have a number of downsides.

They make the size of a variable part of its ABI. i.e. a shared object that defines a symbol now cannot change the size of that symbol without breaking the ABI.
They require that the variable gets copied into writable memory. If a shared library embeds a large bit of data, say a 100MiB machine learning model and a copy relocation occurs, then at startup, that 100MiB will need to be copied. Furthermore, if there are several copies of the binary running, we’re now going to have several independent copies of that 100MiB in RAM, whereas without a copy relocation, that 100MiB could be shared read-only between all the running processes.

The Rust compiler by default, uses LLVM to perform codegen. So when we change rustc to emit all rust-mangled symbols with protected visibility, LLVM does the same as Clang above and emits direct relocations to those symbols. This is fine provided we stick in the LLVM / LLD world, however if we try to link using GNU ld, it gets rejected because it doesn’t fit GNU’s model of relying on copy relocations for shared-object variable access from the main binary.

All of this came about because of GCC trying to simultaneously produce optimal code for executables while not knowing at compile time whether a symbol might come from a shared object. On Windows, a different path was taken. There, symbols that might come from a shared object (DLL on Windows) must be annotated in the source code with __declspec(dllimport). This allows the compiler to emit optimised, direct-access instructions for all other symbols.

An alternative to annotating the source to indicate whether a symbol will come from a shared object or be linked statically is to give the compiler access to the things we’re going to link against so that it can find where the definition comes from and make an appropriate decision. This would never fly in the C world where it’s expected that you can compile code with only access to the header file, but in most modern languages like Rust, it’s more of an option for the compiler to have access to your dependencies in order to make this kind of decision. Rust doesn’t currently do this, but it should be possible for Rust to always make the optimal choice between a direct or an indirect reference because it has all the information it needs to make that decision. Thanks to Reddit user u/Zoxc32 for the correction that Rust doesn’t currently do this.

Using default visibility for symbols in shared objects affects not only load time for those shared objects (150ms vs 5ms), but it also likely affects runtime performance, since all those variables now need to be accessed via the global offset table, which means an extra pointer hop to get to the data. There’s a good chance it also prevents LLVM from making various optimisations, since by using default visibility, we’re effectively telling it that any of these variables or functions might be swapped out for alternative definitions at runtime.

Some good news

I do my development on a system that’s based on Ubuntu 22.04, which has binutils version 2.38. Only after writing most of this blog post did I think to try checking the behaviour of more recent versions of GNU ld. As it turns out, binutils 2.40 fixes this problem in GNU ld.

Linking shared objects that have direct references to protected symbols is no longer an error. Kudos to LLD maintainer, Maskray for making this change!

Instead, building an executable that would require a copy relocation for a protected symbol is now an error.

/usr/bin/ld: /tmp/cciOjHc4.o: copy relocation against non-copyable protected symbol `my_value' in libb.so
collect2: error: ld returned 1 exit status

The error is now reported where it should be - when trying to build a binary that uses a shared object with protected symbols and the compiler emitted direct references to those symbols. The fix for that error is to compile the executable with -fPIC or switch to clang.

GCC maintains its behaviour of emitting direct relocations to variables and functions unless you compile with -fPIC, but that’s much less of a problem for Rust and other languages than the previous GNU ld behaviour.

Where to from here?

The fix to GNU ld is in binutils 2.40, which is in Ubuntu version 23.04 and later. However systems built on 22.04 will be around for a while, so I don’t think we can just switch to protected symbols and cause link errors on those older systems.

Work has been done to use lld by default for linking on Linux. This is currently on nightly versions of rustc. If we add a flag to enable emitting of protected symbols, then we could enable that flag when lld is being used as the linker.

It’s reasonable to ask, might creating shared objects with protected symbols cause those shared objects to be unusable from programs compiled with GCC? I believe the answer is no, since we’d only be making Rust mangled symbols as protected and they shouldn’t be getting referenced from code compiled by GCC.

Further resources

LLD maintainer, Maskray has an excellent blog post about this topic.
Removal of problematic error from GNU ld. Not sure what to link to, but you can search for “x86: Make protected symbols local for -shared”.
Disallow invalid relocation against protected symbol
Related rustc issues:

Thanks

Thanks to my github sponsors. Your contributions help to make it possible for me to continue to work on this kind of stuff rather than going and getting a “real job”.

bearcove
repi
marxin
bes
Urgau
jonhoo
Kobzol
coastalwhite
mstange
bcmyers
Shnatsel
Rafferty97
joshtriplett
teburd
wezm
davidcornu
tommythorn
flba-eb
acshi
teh
yerke
alexkirsz
NobodyXu
jplatte
ymgyt
Pratyush
ethanmsl
+2 anonymous

Discussion threads

Testing a linker

2024-07-17T00:00:00+00:00

I’ve been writing a linker, called Wild (see previous posts). Today, I’m going to talk about my approach to testing the linker. I think this is an interesting case study in its own right, but also there’s aspects of the approach that can likely be applied to other projects.

The properties that I like the tests for my projects to have are:

I want to feel confident that they will pick up bugs if I introduce them when refactoring.
They should be fast to run.
They should be easy to diagnose what’s wrong when they fail.
They should be easy to maintain. When I refactor code, I should need to change tests as little as possible, or maybe not at all.

These priorities are sometimes in conflict with each other. For example merging several tests together into a single test might make the test suite as a whole faster, but might also make diagnosing what’s wrong harder. Whether I choose to split or merge integration tests depends on circumstances. Sometimes splitting is the right approach, especially if there’s common work done by each separate test than can be cached, thus regaining the speed. Often however I prefer to merge. I’m more often running tests that pass than diagnosing tests that fail, so I’d prefer the speed. Also, often with extra tooling, diagnosing what’s wrong can be made easier, even in a large integration test that is doing many things.

Unit tests can be very fast, however when you refactor your code, if you change an interface that is unit tested, then the test needs updating or even rewriting. They can also very easily miss bugs when interfaces don’t change, but assumptions about who does what where in the code change.

I’ve been on projects that have relied entirely on unit tests and even with a high percentage of the code covered by those unit tests, in the absence of good integration tests, the system has felt incredibly fragile.

For these reasons, I generally focus first on integration tests, then resort to unit testing to fill in gaps where I don’t think the integration tests are sufficient or would be too slow to cover all the cases. I then build tooling in and around the integration tests to make them easier to diagnose and maintain.

To provide some specific examples, I’ll now go into how the integration tests for the Wild linker work.

When I started writing Wild, the first integration tests I wrote were of the form:

Compile a small C program using GCC
Link the program using GNU ld
Link the program again using Wild
Run the binaries produced by both linkers and make sure they both exit with the expected exit code.

Linking with GNU ld is important in order to ensure that the test itself is correct. We want the program to behave the same when linked with both linkers.

Already here we can see some opportunity to speed up our test slightly with caching. Generally when we rerun our test it’ll be because we made a change to the linker. However GCC and GNU ld are unlikely to have changed. So if the C program and the argument we’re passing didn’t change, then we can skip rerunning GCC and GNU ld. This can be a significant saving, since GNU ld is really slow - it often takes 10 to 30 times as long as Wild to link the same program.

Integration tests in Rust are typically put in a separate tests directory. Cargo will compile each file in this directory as a separate binary. So if you have lots of completely separate integration tests, this can get slow. For that reason, I generally only ever have a single integration test file and do all my integration testing from that one file. It’s fine however to have multiple tests in that file.

The Wild integration test compiles many small C, assembly and Rust programs, links them and runs them. I include instructions for the test runner inline in the test in the form of specially formatted comments.

//#Object:exit.c
//#ExpectSym: _start .text
//#ExpectSym: exit_syscall .text
//#EnableLinker:lld

#include "exit.h"

void _start(void) {
   exit_syscall(42);
}

In the example here, the first line tells the test runner to compile exit.c as an object file and include that in the link. Then there’s a couple of assertions to check that some symbols are in the correct output sections. The last instruction tells the test runner to enable linking with lld. This is in addition to GNU ld and Wild that are always enabled for all tests.

//#AbstractConfig:default
//#DiffIgnore:section.tdata.alignment

//#Config:llvm-static:default
//#CompArgs:--target x86_64-unknown-linux-musl -C relocation-model=static -C target-feature=+crt-static -C debuginfo=2

//#Config:cranelift-static:default
//#CompArgs:-Zcodegen-backend=cranelift --target x86_64-unknown-linux-musl -C relocation-model=static -C target-feature=+crt-static -C debuginfo=2 --cfg cranelift

//#Config:llvm-dynamic:default
//#CompArgs:-C debuginfo=2
//#DiffIgnore:.dynamic.DT_JMPREL

In this more complex example, we’ve defined an abstract config in which we provide some default settings. Then we have several configurations that inherit from that config and override various properties. Each config has a unique name that is used for naming output files and when reporting test failures. This test has a configuration that statically links with musl libc, one that uses the cranelift backend and one that dynamically links.

Early on when developing the linker, if a test failed, it was generally necessary to step through running the program in a debugger. I would step through both the output from GNU ld and the output from my linker and see where they would diverge. The replay debugger rr was great for this as it lets you step backwards in addition to forwards. However even with awesome tools like rr, this was still a slow process. Fortunately it’s something I rarely need to do anymore.

The reason for that is that I now make extensive use of diffing against the output of GNU ld using a tool I created called linker-diff. The binaries produced by different linkers are not byte-for-byte identical and I wouldn’t want to try to make them so. However there’s lots of things we can diff, even if the layout of the file is different. e.g.:

Values of many of the header fields.
- Even when the actual value of the header field is different, we can often interpret it in a way that can make it the same. e.g. when we look at the header field that contains the entry point for the program, the addresses will be different because the layout of the files is different, however if we look to see what symbol names point to that address, we’d expect them to be the same.
We can disassemble global functions and check that the instructions match.
- This is complicated somewhat because the instructions will often contain relative offsets to other functions, or absolute values that are expected to be different depending on how the linker laid out the binary. Similar to what we did with the entry point in the header, we can allow these instructions to match provided they point to a symbol with the same name.

Diffing linker outputs is non-trivial. Like linkers themselves, there are lots of corner cases. It can be challenging to avoid false positives, while still detecting actual differences that we care about. There’s still more than can be improved with the diff support, but already it has proved incredibly valuable in diagnosing problems.

linker-diff is integrated into the integration tests. This means that generally now if I’m changing how something works and I accidentally break something, rather than a mysterious and opaque test failure when the binary produces the wrong result, I get a diff report showing where I did something different to GNU ld.

One complication that arises, is where GNU ld is doing something that’s suboptimal. I observed this with the linker not applying a particular optimisation if a symbol in our output binary was referenced by a shared object that we were linking against. Trying to replicate GNU ld’s behaviour here would have made our output binary link slower, run slower and added significant complexity to our linker. Fortunately lld had better behaviour in this case. So what I ended up doing for my tests was diffing Wild’s output against both the output of GNU ld and lld. For each thing we diff, e.g. each instruction, header field etc, if Wild matches either GNU ld or lld’s output, then we accept it as correct.

This is what typical output from linker-diff looks like:

wild: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so
ld: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.ld.so
lld: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.lld.so
asm.get_weak_var
                  endbr64
                  push %rbp
                  mov %rsp,%rbp
  
  wild 0x00402429 48 8d 05 b0 10 00 00 lea 0x10BF,%rax  // weak_var
  ld   0x000011b2 48 8b 05 1f 2e 00 00 mov 0x2E2E,%rax  // DYNAMIC(weak_var)
  lld  0x00001a12 48 8b 05 3f 12 00 00 mov 0x124E,%rax  // DYNAMIC(weak_var)
  ORIG            48 8b 05 00 00 00 00 mov 7,%rax  // R_X86_64_REX_GOTPCRELX -> `weak_var`
  TRACE           relaxation=MovIndirectToLea value_flags=ADDRESS resolution_flags=DIRECT
  
                  mov (%rax),%eax
                  pop %rbp
                  ret

Here we can see the disassembly of the function get_weak_var. At the top and bottom are instructions that are the same in the output of all three linkers.

In the middle is an instruction that is different. First we have a row for each of the three linkers, wild, GNU ld and lld. We can see that GNU ld and lld both produced relative move instructions that reference a dynamic relocation for a variable called weak_var. Wild however is loading a relative address directly with no dynamic relocation. This may in fact still run correctly, but only if this variable isn’t overridden at runtime by the main executable or another shared object. So this is, or rather was, a bug in Wild.

When diagnosing failures like this, it’s very helpful to be able to see what was in the input file. I used to find this manually, however it’s somewhat time consuming. So I added support to the linker to write layout information to a .layout file. linker-diff then uses this to find where a particular instruction came from in an input file and display that. That is shown on the line prefixed with ORIG. The relocation type GOTPCRELX is especially useful in diagnosing what’s happening.

It’s often useful to be able to log the values of variables from the code in the linker. Matching these log statements up to the output of the linker can be tricky. To help fix this, the linker can associate tracing log statements with particular addresses in the output file. If linker-diff finds any log messages associated with any of the bytes for an instruction that has a diff, then it’ll display them. This is shown on the TRACE line above. The code in the linker that emitted this, then looks like this:

  let _span = tracing::span!(
      tracing::Level::TRACE, "relocation", address = place).entered();
  ...
  if let Some((relaxation, r_type)) =
      Relaxation::new(r_type, out, offset_in_section, value_flags, output_kind)
  {
      tracing::trace!(?relaxation, %value_flags, %resolution_flags);
      ...
  }

The first line creates the variable _span. Until this variable goes out of scope, all uses of tracing::trace! will be associated with the address specified when we created the span.

When a test fails, it’s useful to be able to rerun the failing linker invocation outside of the context of the test. If the bug is in linker-diff, then it’s useful to be able to rerun that. So when a test fails, I print out the command lines to do both of these. I can then copy and paste whichever I’d like to work on into my terminal.

...
Error: Validation failed.

WILD_WRITE_LAYOUT=1 WILD_WRITE_TRACE=1 OUT=/home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.save/run-with cargo run --bin wild --

 To revalidate:

cargo run --bin linker-diff -- --wild-defaults --ignore '.got.plt,.dynamic.DT_PLTGOT,.dynamic.DT_JMPREL,.dynamic.DT_NEEDED,.dynamic.DT_PLTREL,.dynamic.DT_FLAGS,.dynamic.DT_FLAGS_1,section.plt.entsize,section.relro_padding' --ref /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.ld.so --ref /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.lld.so /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so

When I find a program that misbehaves when linked with Wild, the first thing I want to do is try to figure out what Wild is getting wrong. To help with that, I’ve integrated support for running linker diff into Wild itself. This is done by setting the environment variable WILD_REFERENCE_LINKER to the name of a reference linker to invoke.

WILD_REFERENCE_LINKER=ld RUSTFLAGS="-Clinker=clang -Clink-args=--ld-path=wild" cargo test

When set, Wild will run the reference linker (GNU ld) with the same arguments as those it was invoked with, but change the output file. It’ll then invoke linker-diff to check for unexpected differences, then fail the link if any are found.

Once I’ve identified the part that Wild is getting wrong, I can try to add something similar to one of my existing test programs.

Wild’s tests still have lots more that needs doing. I’ve mostly focussed on the happy path so far, since getting even that right is tricky. Soon I’ll probably need to start looking at testing error conditions. I’ll likely follow a somewhat similar approach of having some test programs and making sure that both the reference linker and Wild reject them and that each linker includes some specific string in the error output - e.g. the name of a symbol that was unresolved.

At some point in the future, I’m interested in trying fuzzing as a testing strategy. Profile-guided fuzzing could find interesting inputs that hit corner cases in the linker not covered by regular tests.

The eventual plan for Wild is to make it incremental. When it comes time to start working on this, I think linker-diff will again be useful. My plan is test as follows:

Link a test program with wild. Call this output A.
Make a random change to the input objects (possibly via fuzzing), then link this with wild. Call this output B.
Undo the random change we made and incrementally link. Call this output C.
A and C should be semantically the same, so if we diff them with linker-diff, it should report no differences.

Another strategy I’m keen to employ is mutant testing (see mutants.rs). This makes random changes to your code that should change behaviour - e.g. inverting a comparison - then checks if any of your tests pick up the change. Not only does this have the potential to pick up gaps in testing, but it may also help find bits of code that are unnecessary. I’d also be interested in seeing if it could be used to rank tests by how many problems they detect that other tests miss. Tests that only detect a subset of the bugs detected by other tests would be candidates for removal.

I hope this look into how I approach testing and in particular testing of the Wild linker has given you some ideas for your own projects.

Thanks

Thanks to my github sponsors. Your contributions help to make it possible for me to continue to work on this kind of stuff rather than going and getting a “real job”.

bearcove
repi
bes
Urgau
jonhoo
Kobzol
coastalwhite
mstange
bcmyers
Shnatsel
Rafferty97
joshtriplett
tommythorn
flba-eb
acshi
teh
yerke
alexkirsz
NobodyXu
Pratyush
ethanmsl
+2 anonymous

Discussion threads

Speeding up rustc by being lazy

2024-06-05T13:00:00+00:00

I’ve been busy working on the Wild linker (see previous posts), but wanted to divert for a moment to look at some other compilation speed things that I’ve been thinking about. This post discusses various thoughts about moving Rust codegen, monomorphisation and inlining later in compilation and some of the ways this might reduce both from-scratch and incremental build times.

Dead code

Dead code is code that gets compiled, but isn’t needed for the final binary. This might come from crates in our dependency tree where we’re only using part of the crate. It might also be from impls that we’re not using - e.g. lots of Debug and Clone impls that aren’t actually used. The amount of dead code that we compile varies quite a bit by crate.

In order to assess how much code is getting compiled then discarded during linking, I added support to the Wild linker to print garbage collection statistics. If I run this on ripgrep, which has a pretty lean and well-tuned build, we find that 17% of the executable code compiled is discarded.

For a less well-tuned binary, let’s pick on one of my own crates, the evcxr REPL. It shows that 35% of compiled code is discarded by the linker.

There’s already been work done in Rustc to support MIR-only rlibs. This would defer codegen to later in compilation. A lot of that work has been motivated by wanting to support compiling libstd with different options. Depending on how it’s done, we may be able to take advantage of it to make codegen happen on-demand. If codegen is deferred until link time, we know what is and isn’t referenced. e.g. we can start from main and see what is referenced. We can then perform codegen only for those functions that are referenced.

Repeated monomorphisations

Another source of wastage is duplicate monomorphisations. Generic code, such as std::Vec::::push can’t be compiled to native code until the type parameter T is substituted. This means that it happens when building the crate that calls the function. But there could be multiple crates or codegen units that make use of the same monomorphisation. Repeating codegen for each of them is wasteful.

I did an investigation into duplicate functions by creating a tool that determines what percentage of the executable bytes in a binary are excess due to duplicate functions. For many build configurations, about 5-10% of the machine code going into your executable is likely excess copies of duplicated functions and most of that is due to repeating the same monomorphisation. You can read more in the tool’s README.

This is not only wasteful of compilation time, but also binary size. For release builds, various options such as setting codegen-units=1 and fat LTO reduce this duplication, however these options also hurt build times, so we need another solution.

There are two different sources of repeated monomorphisations. The first between codegen units within a crate. This seems to mostly be an issue for release builds because the monomorphisations are put into codegen units they’re referenced from, in case LLVM wants to inline them.

The second source of repeated monomorphisations is between crates. If multiple crates all need the same monomorphisations, then each crate produces it. These duplicates happen in both release and debug builds.

When compiling C++ code, GCC and Clang emit such monomorphisations as weak symbols rather than local symbols like rustc does. This lets the linker deduplicate them. This might be an option for reducing binary sizes, although it’s complicated by Rust’s use of the archive format for rlibs, since if the only symbols referenced in an archive entry are weak symbols, then the archive entry won’t be loaded. bjorn3 points out that this could be fixed by passing --whole-archive to the linker. Like setting codegen-units=1, this only helps the binary size-problem, not the wasted-compilation-time problem.

Some work on this problem has already been done in the form of the unstable flag -Zshare-generics which is on by default for non-optimised builds. This does reduce the number of duplicate monomorphisations, but there’s still plenty of duplicates from different crates remaining.

Duplicates originating from the same monomorphisation in different crates are somewhat tricky to solve, but one possibility to do something similar what the proposal above for dead code. i.e. defer monomorphisation to link time. Doing this would mean that we could create just one copy of each monomorphised function.

Recompiling dependents on implementation changes

Another source of wastage happens when you have several crates and you’re making changes to a library crate, then rebuilding some binary that depends on the library crate that you edited. Currently cargo rebuilds all crates in the dependency tree between the crate that you edited and the binary crate you’re building.

In the diagram above, if A is our binary (or a test crate) and we’re making edits to the implementation of a function in F, say adding and removing print statements, then each time we make a change, rustc needs to be invoked on all the crates with the dashed outlines. However ideally, it should be possible to just recompile F, then relink A.

Currently when the rust compiler compiles a library crate, it emits an rmeta file, then later emits an rlib containing the results of codegen. The rmeta file for a build (as opposed to a check) currently includes the MIR of all the functions in the crate. This is currently necessary, since the dependent crates might want to inline some of the functions.

If we’re delaying codegen to link time, then we can also delay inlining. This means that we don’t need the MIR in order to compile the dependent crates. This would give us two advantages:

Pipelined compilation can work better, since we don’t need to wait for the MIR to be ready before the dependent crates can be built.
We don’t need to rerun rustc on the dependent crates when the rlibs change, only when the rmeta changes. That means that if you edit the implementation of a function in one of your library crates, you only need to rerun rustc for that one library crate and then relink. During relinking, any functions that changed as well as any functions that inlined changed functions would go through codegen.

Parallelism

When doing work in parallel with multiple threads or processes, if some, or one unit of work finishes significantly later than the rest, it can slow things down because we have CPU cores sitting idle with nothing useful to do. I’ll call these late finishers “stragglers”.

Currently in the Rust compiler, codegen of one crate can happen at the same time as earlier compilation stages of another crate. By deferring codegen until we’re building the final binary, we introduce an extra wait-point where we can potentially get stragglers.

One mitigation that we already get with the changes proposed above, is that a normal build becomes more like a cargo check in terms of pipelining. Rather than emitting .rmeta files containing MIR, the compiler emits .rmeta files without MIR. This means that dependent crates can start being compiled earlier because they don’t need to wait for the MIR of their dependencies.

However we still need to wait for the MIR for the last crate(s) to finish being written before we can start codegen. One potential for increased parallelism here is that rustc could make the MIR for a crate available before it has finished checking the crate. The compiler stages might look something like this:

Parse files, do everything that’s required to write a .rmeta file containing only what’s needed to check dependent crates. i.e. emit interface information, type information, exported macros etc. Once this finishes, dependent crates can start being compiled. This is similar to what currently happens with a cargo check. Emit MIR-only rlib. Once this finishes for all crates needed by a binary, codegen and linking of that binary can begin.
Complete remaining error-checking of the crate. Cargo would wait for this to complete, but other steps including codegen and linking can run concurrently with this.

So this is a form of pipelined building, similar to the pipelined building that cargo and rustc currently do. For comparison, this is what currently happens during a build:

Do everything that’s required to write a .rmeta file. Unlike above, this contains MIR, since subsequent crates might need the MIR in order to inline functions during codegen. Once this is completed, subsequent crates can start building.
Codegen crate. Once this is completed, the final binary can be built.

Finer-grained codegen units

One way to reduce stragglers is by having smaller units of work. Currently the Rust compiler is a bit limited as to how small it can make codegen units. Some of the things that limit the Rust compiler are affected by changes proposed above.

My ideal would be if we could codegen each function separately. That maximises parallelism and also means that when doing incremental compilation we can avoid the need to repeat codegen for other functions that just happened to be in the same codegen unit on the previous build. However we’d need to make sure we’re not repeating any work in multiple codegen units.

If this were done today, one source of repeated work would be that any generic functions called by the function that we were going to codegen might, depending on optimisation level, be included too. Above, this post proposed that if we’re not inlining a generic function, that we codegen it only once and make it global rather than local. That would allow us to not include it together with the function that we’re compiling.

Apparently the cranelift backend already does codegen for each function independently.

Writing a separate object file for each function is unlikely to be practical or efficient. There are potential limits on how many arguments can be passed to the linker. The linker also might not be optimised for this. Having one function per object within an archive might be a possibility, although experimentation would be needed to see how well different linkers handled that. The alternative would be to pack multiple function definitions into a single object file even though they went through codegen separately.

Linker integration?

One option for deferring codegen would be to integrate codegen into the linker. This could take the form of building a linker into rustc and then using rustc as the linker.

An alternative would be to do codegen just prior to linking.

Integrating a linker into rustc would have some advantages:

The linker is already doing a graph traversal, taking advantage of that avoids the need to do a separate graph traversal in the compiler.
If you have a mix of code from Rust and other languages (e.g. C or C++), then the linker has a view of all of this. If doing the graph traversal without the help of the linker, we’d need to assume that any function that could be called from another language is called.
Caching is probably easier with tighter linker integration, since the linker can read entries directly from a cache and we’re not constrained to putting everything in object files.

However, the main disadvantage of such tight linker integration is that we then don’t get all the benefits of this work unless we’re using the integrated linker. My linker, Wild, is still a way off being ready for general use on Linux and I haven’t even started to look at porting to other platforms. So I think it’s important to try to do deferred codegen without integrating the linker.

Dong codegen just prior to linking could be done as follows:

Compile binary crates to rlibs rather than directly invoking the linker when the binary crate gets compiled.
Have cargo invoke rustc to perform the linking step. This final rustc invocation would determine what codegen was needed, do it, then invoke the linker on the resulting object files.

Caching

If codegen is deferred until we are building a binary, then we need to make sure that we avoid repeating the same codegen more than once. This means that when doing a warm build, we need only do codegen for new / changed code.

We also need to be careful if we’re building multiple binary crates. All binary crates need to be able to share the codegen outputs where appropriate. One way to achieve this might be as follows.

Keep an index file in which we record which functions are in which object files.
When invoking rustc to do codegen / linking, lock the index file, figure out which functions we’re going to codegen, update the index file to indicate which object files those new functions will be in, create those files and lock them, then release the lock on the main index file.
That should hold the main index lock for a relatively short time after which another rustc process can do the same.
When we finish doing codegen, before we invoke the linker, make sure that none of the object files we are going to pass to the linker are still locked, which would indicate that the rustc process writing them was still working.

Caching is quite possibly the hardest part of all of this to get correct. Ideally we’d like to avoid storing each compiled function twice (once in the cache and once in the object file to be linked), but this does make things significantly more complicated, especially without causing non-determinism.

Keeping memory usage in check

With all codegen being done by the one rustc process, care needs to be taken to ensure memory usage isn’t too high. Several strategies might help here:

Store graph information (what references what) separately from the MIR so that we can do a graph traversal without loading all the MIR.
Load the MIR for each function only when we’re ready to codegen it, write the resulting machine code into an object file then drop it and the MIR.

Why not do all compilation on demand?

It would be pretty hard to retrofit fully on-demand compilation to a mature compiler like rustc. It’s also unclear how much you could actually skip. At least a bit of processing of each file is required in order to find all trait implementations so that method resolution can give correct results.

Correctness checks within function bodies could potentially be done only for functions that were reachable. But that raises lots of questions about whether you’d want to do that. I’ve heard that Zig doesn’t report some errors for dead code.

At least for now, it’s better to still do all correctness checking even for dead code.

I was somewhat inspired here by an excellent episode of the Software Unscripted podcast where Richard interviewed matklad. The episode is called “Incremental Compilation with Alex Kladov” (link). On the same podcast, in a much earlier episode, Richard interviews Andrew Kelley, the creator of Zig (link) which does a lot of its compilation in a more on-demand way.

Various related previous discussions:

Laziness in the compiler (July 2023)
Towards a second edition of the compiler (July 2017)
MIR-only RLIBs (Discussions on github / Zulip from January 2017 and more in 2024!)
- The motivations for MIR-only RLIBs are different than for this post but there’s substantial cross-over.
-Z share-generics
- Explicit monomorphization for compilation time reduction

I’ve only linked to discussions where they’re archived, but you can find open issues etc on these topics with a quick web search.

Next steps

I’m busy making the Wild linker, however I think I potentially have some bandwidth to start working on at least one of the ideas here. I haven’t yet figured out which one. If you’ve got any comments or would like to discuss this, my contact details are on my about page.

Thanks

Thanks to bjorn3, simulacrum, Jakub Beránek, davidtwco and nora (Nilstrieb) for providing feedback on an earlier draft of this post. Any errors or inaccuracies are mine.

Thanks also to my github sponsors. Your contributions help to make it possible for me to continue to work on this kind of stuff rather than going and getting a “real job”.

repi
bes
Urgau
coastalwhite
mstange
bcmyers
Shnatsel
Rafferty97
joshtriplett
acshi
teh
yerke
alexkirsz
Pratyush
lexara-prime-ai
ethanmsl
+1 anonymous

Discussion threads

Video: Rust Sydney - A linker in the Wild

2024-04-17T13:00:00+00:00

This week I presented a talk at the Rust Sydney meetup about the Wild linker.

Video

There are also slides, including speaker notes with roughly what I said, or intended to say.

Discussion on Reddit

Wild linker - March update

2024-03-18T13:00:00+00:00

A month ago, I posted some investigations into Rust warm build times and I mentioned at the end that I was writing a linker called Wild in Rust.

This post is an update with some of the work since then. I’ll go into a little bit of technical detail about some of the changes I’ve made, but I’ll try to keep it accessible.

eh_frame support

These are necessary in order for stack unwinding to work. Without eh_frames, panics and backtraces won’t work. Generally each function in your binary will have an entry in eh_frame that provides information about the stack for any instruction in the function. This is used both for the simple case of printing a stack trace and the more complicated case of unwinding, which requires running Drop for variables in each stack frame.

eh_frames require a few special things from the linker.

The linker needs to build an index of the frame entries so that the runtime can do a binary search to find the entry for a particular function.
In order for the index to be valid and not confuse the runtime, it’s a good idea to drop any frame entries for functions that we’ve decided not to link. i.e. we don’t want frame entries for dead code that the linker removed. That means that eh_frame handling needs to be integrated into the garbage collection pass of the linker, which figures out which sections of the input files to link and which to discard.

Now that Wild supports eh_frames, panic and backtraces work in Rust binaries that we produce.

ifuncs

I had actually already implemented ifuncs when I made my last post, but they’re an interesting feature and I mention them later in the post, so I thought I’d cover them here.

An ifunc is a type of function where the implementation is resolved at runtime during program startup. When libc starts, it goes through a list of ifuncs created by the linker and calls a “resolver function” for each ifunc. The resolver function gets passed information such as what CPU features are supported. It can then return the most optimal implementation for the current CPU. libc then stores the returned pointer for use whenever the function is called during program execution.

This is used for functions like memcpy which have faster implementations for some CPUs.

ifuncs are a GNU extension to ELF, but glibc uses them, so if you want to link against glibc, you need to support them.

Dynamic relocation

Earlier versions of Wild only supported non-relocatable static binaries. This meant that the binary had to be loaded at a fixed address that was decided by the linker. This likely isn’t a problem for development, however as a step towards implementing dynamic linking, I thought it’d be good to add support for statically linked, relocatable executables.

When a linker runs, it takes the sections of the input files and decides where to put them in the output file. It then needs to apply the relocations required by each section. A relocation is an instruction written by the compiler to tell the linker to put an address of a section or a symbol at a particular location. There are many different types of relocations (43 in the ELF 64 bit spec). Some of these relocations are relative, which means that we write the relative offset to the requested symbol or section. However, some things like vtables (used when you have trait objects) require absolute addresses.

When an absolute address is required and we’re writing a relocatable binary, the linker doesn’t yet know the address of the target. This means that it needs to write a dynamic relocation, which is an instruction to the runtime. These dynamic relocations are applied at startup. Sometimes there can be quite a lot of them. e.g. rustc has over 32K dynamic relocations which get processed each time rustc starts.

As an aside, this made me think a bit about whether it would be possible to have vtables that contain only relative pointers. Indeed, there has been some research done on this in relation to LLVM.

Most of the work for dynamic relocation was finding all the places where we needed to apply it. One place that came as a bit of a surprise was the relocations for ifuncs, which I mentioned above. These ifunc relocations need to contain the absolute address of the resolver function and the absolute address where the relocation is to be applied. This meant that I needed to apply dynamic relocations to the ifunc relocation. Applying a relocation to a relocation felt a bit meta. It also made me realise how many layers of complexity Linux ELF (including GNU extensions) has built up over the years. Ideally the ifunc relocations would contain only relative pointers, then we wouldn’t need to apply dynamic relocations to them.

Another chunk of work for dynamic relocation was implementing some additional linker micro-optimisations. These are not LTO, which we don’t yet support, they’re small optimisations that the linker does when it’s applying relocations to some compiled code. These optimisations are supposed to be optional for the linker to implement, but as we’ll see, this is not always the case.

To explain this example, I first need to explain the global offset table (GOT). It’s a table, built by the linker, that contains pointers to some functions. For functions that the compiler wants to allow to be overridden at runtime by dynamic libraries, the compiler, instead of calling a function directly, requests that the linker create a GOT entry containing a pointer to the function.

Because the global offset table contains the absolute addresses of functions, if we’re building a relocatable executable, we need to apply dynamic relocations to all the entries it contains.

Sometimes the linker knows, perhaps because we’re statically linking, that the function cannot be overridden at runtime, meaning that it’s free to bypass the GOT entry by transforming indirect calls to the function into direct calls to that same function. This means that we no longer need an entry in the global offset table and, perhaps more importantly, we no longer need to apply a dynamic relocation.

The following assembly code contains a call instruction (which calls a function). That call instruction has a relocation put there by the compiler that requests a GOT entry for the function __libc_start_main function.

  1f:   ff 15 00 00 00 00       call   *0x0(%rip)
                        21: R_X86_64_GOTPCRELX  __libc_start_main-0x4

What’s interesting here is that this assembly code is in the function _start provided by libc, which is where the program starts executing. It’s trying to use the GOT, but dynamic relocations haven’t been applied to the GOT yet, because that happens somewhere in __libc_start_main which is the function we’re trying to call! That means that if we take the compiler’s instructions as written, our binary will segfault.

So the “optional” optimisation to eliminate the GOT entry turns out to be not so optional after all. This is perhaps not surprising. Once GNU ld implemented optimisations like this, libc would have at some point come to depend on it.

Comment section

I wanted a reliable way to identify whether Wild was used to produce a particular binary file. The way this is generally done is by adding a text comment into the .comment section in the binary. The compiler (e.g.) rustc also puts a comment stating its version.

You can view the comments in a binary using readelf as follows:

readelf -p .comment my-binary

String dump of section '.comment':
  [     1]  GCC: (GNU) 9.4.0
  [    12]  rustc version 1.76.0 (07dca489a 2024-02-04)
  [    3e]  GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  [    69]  Linker: Wild version 0.1.0

String merging

This work was mostly motivated by the work on the comment section, since each object produced by the compiler has a separate copy of the comment. If we didn’t do string merging, then our output binary would have the same comment repeated many times.

String merging is activated when a section of an input file is marked with the merge and strings flags.

You can see these flags with readelf by running:

readelf --sections my-binary

The comment section might look like:

  [16] .comment          PROGBITS         0000000000000000  00003264
       000000000000002d  0000000000000001  MS       0     0     1

Here MS are the flags for the .comment section and mean merge and strings.

Some linkers like GNU ld take string merging to an extreme. For example if they find a string “foobar” and another string with the same suffix such as “bar” they’ll include only “foobar” then make references to “bar” point to the last three bytes of “foobar”. This is a neat optimisation, but since we’re trying to write a fast linker, we don’t do this. Other fast linkers like lld and mold also don’t do this - at least not by default, so we’re in good company.

Performance

Wild doesn’t support all the same features as the more mature linkers, so when comparing, it’s important to try to make the comparison as fair as possible by only exercising features in the other linkers that Wild also supports.

Linking debug info is very time consuming, so we pass --strip-debug to all linkers so that they don’t need to spend time linking debug info.
Wild doesn’t yet support dynamic linking, so we ensure that we’re building a statically linked executable.
Wild doesn’t yet support build IDs, so we omit those flags from the link command.

The program that I used to benchmark build speeds was the same one I used in my previous blog post, which is a Rust program with a moderate number of dependencies.

Linker	Time (ms)	± Standard deviation (ms)	File size (MiB)
GNU ld (2.38)	12413	76	80.3
gold (1.16)	3418	58	83.3
lld (15)	898.9	8.1	84.8
mold (2.40)	425.8	6.9	81.1
wild (0.2.0)	346.6	7.6	80.9

The size column is mostly there to check that none of the linkers are doing significantly more or less work than the others. The larger size for Gold and LLD is mostly because they have twice as much data in their .gcc_except_table sections compared with the other linkers. .gcc_except_table is used by .eh_frame which I described above and contains stuff required for Drop to work when Rust unwinds. I’m not sure why those two linkers both have exactly 3132732 bytes in this section, while the other three have exactly 1459724. My guess would be that their garbage collection algorithms don’t extend to these sections.

For the multithreaded linkers (lld, mold and wild), it’s also interesting to look at total CPU time consumed. If we allow as many threads as there are CPUs, then hyperthreading inflates the CPU usage time when two threads are running on the same core. So we restrict the linkers to 4 threads, which is how many CPU cores my laptop has.

We also set --no-fork for Mold, otherwise it forks a subprocess to do the work, which means we don’t get to observe the CPU usage of the process doing the actual work.

Linker	Wall time (ms)	CPU time (ms)	Parallelism
lld	903.9	1090.3	1.21
mold	510.3	1724.3	3.38
wild	425.8	1170.7	2.75

Currently Wild has slightly higher CPU usage compared to LLD, but considerably lower than Mold’s. It has more parallelism than LLD’s, but not as much as Mold’s. I’ve got ideas for improving performance, but I’m trying to avoid working on those until I’ve improved correctness and completeness.

I want to stress that this is only one benchmark. Many unknowns remain:

Will the results be significantly different for other benchmarks?
How will Wild scale up when linking much larger binaries and/or on systems with many CPU cores?
Will implementing the missing features require changes to Wild’s design that might slow it down?

All we can really conclude from this benchmark is that Wild is currently reasonably efficient at non-incremental linking and reasonable at taking advantage of a few threads. I don’t think that adding the missing features should change this benchmark significantly. i.e. adding support for debug info really shouldn’t change our speed when linking with no debug info. I can’t be sure however until I implement these missing features.

The goal of Wild isn’t to be the fastest at a cold build, but rather to be very fast at incremental linking, once that’s implemented. But it’s nice to see that it’s currently faring pretty well at linking from scratch.

Next steps

The next big things that I’m looking to implement are:

Building a tool to identify significant differences in output files
Dynamic linking
Debug info

We’re already part way toward dynamic linking by having implemented relocatable binaries. I think mostly what remains is tracking which symbols need to be resolved from shared objects (.so files) at runtime, writing dynamic relocations for them and adding some extra info into the dynamic header to say which shared objects are required.

Another thing likely to be needed is to correctly handle symbol visibility. While we’re statically linking, we know that symbols can’t be overridden at runtime, since our static binary is all there is. That means that we can apply micro-optimisations like the one I talked about above to remove GOT entries. Once we’re dynamically linking, I’ll need to only apply such optimisations for symbols that have been marked as allowing it.

Debug info I suspect will be somewhat similar to eh_frames, but with substantially more tables and relationships between those tables.

Funding

I’m currently doing the live-off-savings-and-github-sponsors thing. Whether I can continue to do this long-term will depend on how much sponsorship I can attract. If you or your company appreciate this work and would like to see it continue, please consider sponsoring me.

Discussion threads

Speeding up the Rust edit-build-run cycle

2024-02-04T13:00:00+00:00

There are two main aspects to compile times that matter to developers. Cold build times, when building from scratch and warm build times when you’ve already built and you’re rebuilding following an edit. This article focuses on warm build times, which for rapid iteration during development is what generally matters most.

We start with some tips for speeding up your Rust development cycle, then talk about work that I’m doing in this space to make it even faster.

For projects with minimal dependencies, you might find that your development cycle is already fast enough. In that case, great. This article will therefore use a crate with some heavyweight dependencies such that warm compilation times are a problem.

Benchmark setup

The benchmarked crate, benchmarking scripts can be found in this repository.
All benchmarks were run on my few-year-old System 76 laptop. It’s got an i7-10510U CPU, 42GB of RAM and is running Pop-OS (a Linux distribution).
A couple of warmup builds are done first.
Before each build, a trivial edit is made. This is similar to adding or removing a print statement, which is what we’d like to emulate.

To start, we benchmark the cargo run time for a default configuration debug build using the default system linker (GNU ld). This gives us a time of 20.202 s ± 0.256 s which is slow enough to be fairly annoying.

Use a faster linker like Mold

Switching to a faster linker like mold (or sold for Mac) can make a very big difference. lld is also pretty fast - see the timings for it later.

With mold version 2.3.3 installed, we add the following to our .cargo/config.toml:

[target.x86_64-unknown-linux-gnu]
linker = "/usr/bin/clang-15"
rustflags = ["-C", "link-arg=--ld-path=/usr/local/bin/mold"]

That reduces our cargo run time to 7.539 s ± 1.691 s.

Avoid linking debug info

Debug information tends to be large and linking it slows down linking quite considerably. If you’re like many developers and you generally use println for debugging and rarely or never use an actual debugger, then this is wasted time.

There are two ways you can do avoid linking debug information:

Skip compiling the debug information in the first place by setting debug=0 in your profile.
Skip linking it, even it it was compiled by setting strip="debuginfo" in your profile.

Using just the second option, should mean in theory that if you need debug information, that you can just need to relink. Unfortunately Cargo currently rebuilds everything from scratch when this option is changed. Also, as far as I know, this option won’t help on Mac, since rustc uses an external strip command to remove the debug information after linking is complete, presumably because the Mac linker doesn’t support the --strip-debug flag.

One limitation of setting debug=0 is that it doesn’t affect compilation of the Rust standard library because that comes precompiled with debug info. That means that with just debug=0 set, we’ll still be linking a bit of debug info.

[profile.dev]
debug = 0
strip = "debuginfo"

With just strip="debuginfo" our time comes down to 1.507 s ± 0.301 s.

With just debug=0 we’re slightly slower at 1.611 s ± 0.329 s

With both we get 1.576 s ± 0.355 s.

All three are pretty similar and well within 1 standard deviation of each other, so they’re effectively the same.

If developing on Linux, I’d probably set both, since setting debug=0 also improves the cold build times (77 seconds down from 107). If developing on Mac, probably just set debug=0. If developing on Windows, it’s probably a good idea to try both and measure the warm build time to see what combination has the best effect.

One downside of no debug info is that your backtraces will only have function names, with no line numbers. Personally, I don’t mind this and despite having debug=0 in my profile for a long time, didn’t even notice until matthieum (on Reddit) pointed it out. Note that you’ll still get the line number where the panic occurred, just not line numbers for the functions that called it. I find this to be an acceptable trade-off for faster warm builds.

If you do need debug info, setting split-debuginfo="unpacked" isn’t quite as fast as no debug info, but it’s a lot faster than actually linking the debug info. Thanks VorpalWay on Reddit for suggesting this option. Note however that this is already the default on Mac and it isn’t supported on Windows, so you’ll probably only see a difference when you set this on Linux. Whatever you do, it’s a good idea to measure the effect of your settings change on your warm build times.

You might be wondering about the effect of setting strip="symbols" which strips not just debug information, but also symbol tables. This can potentially speed up warm builds a little more, but has the significant downside that you won’t be able to get backtraces at all when you set RUST_BACKTRACE=1, so I wouldn’t recommend it for development builds.

Building a non-relocatable executable

Rust by default compiles relocatable executables. This means that each time the binary gets run, the operating system loads it at a random address. This is called ASLR (address space layout randomisation) and is an important mitigation against security vulnerabilities. However it’s generally not something we need during development and it turns out that we can save a little more time by turning it off. This step should be skipped if you’re building or using shared objects (e.g. .so files on Linux).

I found it easier to do this if we also statically link our binary with musl libc. The switch to musl and to static linking didn’t have any significant effect on the benchmark time.

You can install the musl toolchain with:

rustup target add x86_64-unknown-linux-musl

Then add the following to your .cargo/config.toml.

[build]
target = "x86_64-unknown-linux-musl"

[target.x86_64-unknown-linux-musl]
linker = "/usr/bin/clang-15"
rustflags = [ "-C", "relocation-model=static", "-C", "link-arg=--ld-path=mold" ]

This reduces our cargo run time to 1.156 s ± 0.039 s.

If you want to confirm that the executable is no longer relocatable, you can run file command on it. It should say ELF 64-bit LSB executable instead of ELF 64-bit LSB pie executable. The “pie” stands for Position Independent Executable.

Alternatives to Mold

lld is also a pretty fast linker. It’s not quite as fast as mold, but it’s pretty fast. Switching to lld increases our cargo run time to 1.602 s ± 0.011 s, which is still pretty good.

Summary of improvements

At this point we’ve reduced the warm build (and run) time from 20 seconds to 1.2 seconds, a 16x improvement! The main takeaways are:

Use the fastest linker you can. 20 s -> 7.5 s
Don’t link debug info. 7.5 s -> 1.6 s
Build a non-position-independent executable. 1.6 s -> 1.2 s

I’d suggest you experiment with what configuration works best for your project and platform. You can also play with other configuration options. For example, if you’re happy to abort if you get a panic (at least during development), you might set panic="abort", which from my measurements gives another small reduction in warm build time. The key is to measure. Here’s a one-liner to help with that:

hyperfine --warmup 2 --prepare 'touch src/main.rs' 'cargo build'

If you don’t have hyperfine, see the hyperfine repository for installation instructions.

Diagnosing unexpected rebuilds of dependencies

If you’re frequently seeing cargo rebuild your dependencies when you’ve only changed your crate, it can take ages and really sap your productivity. It’s worth taking some time to figure out why. The main tool to help diagnose this is building with the -v flag. e.g.

cargo build -v

You’re looking for lines like the following:

Dirty rayon v1.8.1: the rustflags changed
Dirty regex-automata v0.4.5: the config settings changed

It’s possible that you’ll see crates being recompiled and Cargo doesn’t give you a reason. One common reason in this case is that the features for that crate have changed. This typically happens when you’re using a cargo workspace. Different crates in your workspace might request different features from your dependencies. When building the whole workspace with cargo build, the union of those features is used, however if you then request to just build a single crate, e.g. cargo build -p foo, only the features needed by the foo crate will be built, which means the dependency needs to be rebuilt. One way that some people solve this is to create a workspace-hack package that depends on all your dependencies with the union of all their features then have all your packages depend on workspace-hack. cargo-hakari is a tool to help automate this. Other options are to just not use workspaces or never use the -p flag.

Investigating remaining time

For many people, 1.2 s might be fast enough. But what if our project has an order of magnitude more dependencies, or if we’re running on a slower computer. Perhaps even 1.2 seconds isn’t fast enough for some.

Now that we’ve gotten it as fast as we can, it’s time to look at what’s taking time.

We can look at the times when each process gets started using strace. e.g.:

./random-edit src/main.rs; strace -tt -f --trace=execve -o strace.out cargo run

This increases the build time reported by cargo from 1.07s to about 1.33s, but that’s still close enough that it should give us a pretty good idea of how long things are taking. Looking at the resulting strace.out, we can note a few things:

The time from executing ~/.cargo/bin/cargo (the rustup wrapper) to executing the actual cargo binary is 37 ms. Rustup issue for improving this.
The time from executing cargo to executing rustc is 230 ms. I wonder if there’s scope for caching whatever expensive computations cargo is doing here. Update: epage has been looking into this.
The time from when rustc starts until it finishes is 1100 ms. Not surprisingly this is where we spend the majority of our time.
There’s 19 ms between when rustc exits and when cargo executes our binary. Looks like this is a few different things. Writing a fingerprint file, deleting the old binary and putting the new binary in its place then freeing memory.
The time to actually execute the binary we built is 3 ms.

Lets look a little more closely at what rustc is doing. Using a nightly version of rustc, we can build with the -Ztime-passes flag.

./random-edit src/main.rs; cargo rustc -- -Ztime-passes

The most interesting lines are the following:

time:   0.411; rss:  138MB ->  285MB ( +147MB)	codegen_crate
time:   0.454; rss:  145MB ->  146MB (   +1MB)	link
time:   0.939; rss:   26MB ->   75MB (  +48MB)	total

So of the 939 ms total, we’re spending about 454 ms linking, 411 ms on codegen and 74 ms on other stuff.

Possible future changes

Changing the defaults

We managed a 16x speedup in warm cargo run time relative to the defaults. This is great, but new users won’t necessarily know to do these things and will just be left with the impression that rust projects are slower to build than what’s actually necessary.

There has been talk about bundling lld with rust and using it by default. That would go a long way.

I also wonder what might be done about the debug information. In an earlier version of this article, I suggested that maybe the default should be no debug info, however given that this makes backtraces not have line numbers, I’m now not so sure. Probably the split-debuginfo="unpacked" should become the default on Linux like it is on Mac.

Incremental linking

What if we didn’t need to redo linking every time we made a change? i.e. if adding a print statement to a function could just result in that one function being updated in our binary. This could potentially give us an even faster dev cycle, especially for projects with even more dependencies or where we still need debug information.

The downside of incremental linking is that the resulting binary probably wouldn’t be bit-for-bit identical to what we’d get if we linked from scratch. But personally I don’t see that as a problem when I’m iterating on code making small edits.

Implementing linkers is hard, complex work and adding incremental linking to the mix makes it even more challenging. The Mold author has said that incremental linking is too hard and has too many downsides. But my experience is that Rust makes hard, complex things easier to implement, so let’s build an incremental linker in Rust!

There’s somewhat of a tradition that linkers end with the letters “ld” and it’s intended to eventually be an incremental linker, so that suggests it ends in “ild”. A quick search for words ending in “ild” yields “wild” as an interesting name. Let’s go with that.

Over the last couple of months, I’ve made a start on it. There’s still heaps to do, including actually making it incremental, but it can now link itself.

The performance of the wild linker can’t yet be properly compared with other linkers, since I haven’t yet implemented support for eh_frames, which are required for unwinding to work properly. They’re the next thing that I intend to implement.

My plan for this year is to work full time on improving Rust warm build times, starting by writing an incremental linker. I’ll be living off savings, the goodwill of my partner and github sponsors. How long I can do that for will depend on how much sponsorship I can get. So if you think your company might benefit from that, perhaps you can convince them to sponsor me? Huge thanks to my existing sponsors, especially Embark studios.

Discussion threads

Making Rust supply chain attacks harder with Cackle

2023-10-09T13:00:00+00:00

Making Rust supply chain attacks harder with Cackle

If you want a slightly shorter read, skip to Introducing Cackle.

A hypothetical story about Alex

Alex is a software engineer who has built a tool which she licences to her customers. She only has a small number of clients, but they like her tool. She built her tool using Rust, with about 20 direct dependencies from crates.io. When you count indirect dependencies, her tool has about 250 dependencies.

One of those indirect dependencies is a crate named foobar, which was written by someone named Chris. Chris wrote foobar a few years ago, but has lost interest in maintaining it.

Someone called Bob has filed a couple of PRs against foobar and is now asking when they might get merged and if the project is still maintained. Bob says that they’d be happy to maintain the crate if Chris is too busy. Bob seems enthusiastic, so Chris adds Bob as an owner.

Some months go by and Alex is fixing a bug that one of her clients noticed. After fixing the bug, she runs cargo update, runs the tests, does some manual testing, then pushes out the updated version of her tool to her clients. The updated version of her tool includes an updated version of the crate foobar.

A week later she gets a call from one of her clients who is extremely upset. It seems that the client’s user database has been leaked and the attackers are threatening to publish it if a ransom isn’t paid. Her client has determined that Alex’s tool is sending data to an unknown address on the Internet.

After some investigation, Alex finds some code in the new version of foobar that is doing this. She pins an old, unaffected version of the foobar crate and does a new release, but the damage to her small business and her reputation is already done.

Supply-chain attacks

The story about Alex is an example of a supply-chain attack. There are many ways something like this could happen:

A developer’s crates.io token could be compromised.
A burned-out developer could hand over control of a crate to someone they don’t really know.
Someone could build a crate, then submit PRs to other projects to make use of their crate. Later they could add malicious code to their own crate.
Developers sometimes add protest-ware to their own crates. Even if this is for a good cause, it still has the potential to cause substantial damage and loss of trust.

This kind of thing hasn’t happened much yet in the Rust ecosystem, but as it grows, it’s expected to happen more often. It already happens semi-regularly in other larger ecosystems like node.js and in Python’s package systems.

Practices to help prevent supply-chain attacks

There are numerous things you can do to help prevent supply-chain attacks:

Review the code of crates that you use.
Use and contribute to a code vetting database like cargo-vet or cargo-crev.
Avoid depending on trivial crates that would only save you a few lines of code (e.g. left-pad from node.js). Not only do trivial dependencies not save you much code, they’re also potentially higher risk due to them being easier to create.
All else being equal, prefer crates with more downloads.
Select crates that are written by people you recognise and who have a good reputation in the community - although everyone is new at some point, so this is a trade-off.
Copy the cargo add command from crates.io rather than typing the name by hand. This can help prevent you from being the victim of typosquatting attacks.
For binary crates where you don’t need or want new features, bug fixes etc, you could consider pinning their versions. If you do though, then you should monitor for security advisories.

For crate authors there are some additional steps you can take:

When reviewing PRs, look carefully at any newly added dependencies.
Be careful about to whom you hand control of your crates.
Consider asking new maintainers to fork your crate rather than handing it over.

Introducing Cackle aka cargo-acl

Cackle is a code ACL checker and is an additional tool to help make supply-chain attacks harder. It’s intended to be used in addition to the above approaches. Cackle is configured via cackle.toml that sits alongside Cargo.toml. In the configuration file, you can define classes of APIs, such as “net”, “fs” (filesystem) and “process” that you’d like to restrict. You then say which crates that you depend on are permitted to use these APIs. When run, Cackle checks for any crates in your dependency tree that are using restricted APIs that they’re not permitted to use.

An API definition says what names are included or excluded from that API. For example, we might define the “process” API as follows:

[api.process]
include = [
    "std::process",
]
exclude = [
    "std::process::exit",
]

Exclusions take precedence over inclusions, so should be more specific. Here we’re defining an API named “process” and classifying any functions in the std::process module as belonging to it. We then exclude std::process::exit. So a reference to std::process::Command::new would count as using the process API, but a reference to std::process::exit would not.

We can then allow specific packages to use this API. For example:

[pkg.rustyline]
allow_apis = [
  "fs",
]
allow_unsafe = true

This says that the rustyline package is allowed to use filesystem APIs and also allowed to use unsafe code.

In the case of Alex’s story, had she been using Cackle, then the foobar crate might have been flagged as now using the “net” API where previously it wasn’t.

Installing Cackle

For the time being, Cackle only supports Linux. Assuming you’ve already got Rust installed, you can run:

cargo install --locked cargo-acl

Building an initial config

Cackle has a built-in terminal UI that helps with creation of your cackle.toml. We’ll use that to create our initial configuration. From the directory containing your Cargo.toml, run:

cargo acl

The problems pane shows problems that have been detected so far with the permissions used by your dependencies. Initially, it also shows action items that help with creating your configuration.

We select a problem and press ‘f’ to show possible fixes.

Here we can see two possible initial configurations that we can create. We select that we’d like to use the recommended initial configuration and press ‘f’ to apply the fix.

Next, the UI will ask us to select what kind of sandbox we’d like to use. At the moment, the only supported sandbox is Bubblewrap. The sandbox is used for running build scripts, for running rustc (and thus procedural macros) and for running tests. APIs used by each binary are checked by cackle before they get run, so depending on how carefully you’re checking those API usages, you may decide not to worry about sandboxing.

At this point, we’ve imported APIs that restrict network access, filesystem access and command access via the standard library. We haven’t however restricted similar APIs that might be provided by third-party crates in our dependency tree. For example, tokio can be used to perform network access, but we haven’t classified any of its APIs as doing so.

Third-party crates can export cackle API definitions. If any do, then the cackle UI will ask us if we’d like to import those API definitions. While cackle is fairly new however, it’s expected that we’ll need to write such API definitions ourselves. We might write a “net” API definition for tokio as follows:

[api.net]
include = [
    "tokio::net",
]

Here we’re saying that references to anything in the tokio crate’s net module should be classified as a use of the net API.

We can manually edit our cackle.toml to add this API definition. It will be merged with the definition of the net API for the standard library that is built into Cackle and which we imported when we created our initial configuration.

Cackle will now proceed to build your crate and its dependencies. As the build proceeds, cackle will analyse object files, executables and source code to see what APIs are used from where and which crates use unsafe. As problems are found, they’ll be added to the “problems” pane in the UI.

Details about the package involved are shown in the bottom pane. You can use this to help understand what the package does, which can be useful for deciding if it’s reasonable for it to use a particular API.

If you’d like to see how a package was included, you can view the tree from current problem’s package to your package by pressing ‘t’.

For API usages and unsafe code, you can press ‘d’ to see further details of each usage location of that API or unsafe code.

If you’re happy for the crate to use that API or unsafe, you can press ‘f’ to see available fixes to the configuration file.

Press ‘f’ again to apply the selected fix.

Similar to the unsafe usages, the usage locations for a disallowed API can be shown by pressing ‘d’.

Here we can see that the build script for the quote crate is using the process API by referencing std::process::Command.

Like with other problems, we can press ‘f’ to see available fixes.

We have a few options for allowing an API usage. The quote crate used the process API from its build script. We can allow just this, which is the first available fix.

Alternatively, we might choose to allow the quote crate to use the process API, but only as part of build scripts. This would mean that code in for example quote’s lib.rs could use the process API if that code was then used by a build script from another package. It would also allow quote’s own build.rs to use the process API.

Lastly we might allow quote to use the process API regardless of what kind of binary is being built.

The bottom pane shows a description of the change being made as well as a diff of your cackle.toml.

Only API usages in reachable code are considered. If you’d like to see how the code is reachable you can press ‘b’ to get a backtrace showing a path of function references from main to the function that used the API.

Here we can see that the ripgrep build script is calling clap’s App::gen_completions, which calls Parser::gen_completions which is what’s using File::create.

Note that this isn’t a runtime backtrace - the build script hasn’t been executed yet. It’s a hypothetical backtrace built from the graph of what functions reference what other functions. Press escape to leave the backtrace.

Other kinds of problems that cackle might detect with your dependencies are:

A build script (build.rs) failing. This might be because it attempted to access the network, or tried to write a file outside of its output directory. Possible fixes are to allow network access, allow writes to a particular directory or disable the sandbox for that particular build script.
A build script might emit instructions to cargo, requesting extra arguments to the linker. This can be a source of code that we’re not checking, so we want to look to see if what’s being done is reasonable, then if it is, allow it.

Once all problems have been resolved, Cackle will exit.

We might like to integrate Cackle into our CI. To do so, we’d run:

cargo acl -n

And if we’d like to also then run the tests under cackle, we’d do:

cargo acl -n test

In a github action, we might do this as follows:

name: cackle
on: [push, pull_request]

jobs:
  cackle:
    name: Cackle check and test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: dtolnay/rust-toolchain@stable
      - uses: cackle-rs/cackle-action@latest
      - run: cargo acl -n
      - run: cargo acl -n test

This will do a non-interactive check of our dependencies against the configuration. If any problems are encountered, details will be printed and it will exit with a failure status.

If your Cargo.lock is checked into your repository, you might like to add rm Cargo.lock to your CI before running cargo acl, that way we’ll be checking the latest semver-compatible versions of your dependencies.

How it works

When you run cargo acl, Cackle invokes cargo build, but wraps rustc, the linker and any build scripts. As the build progresses, the wrappers communicate back to the parent cackle process, which analyses the build artefacts in order to determine what APIs are being used. It does this by parsing the object files to determine what functions reference what other functions. It also parses the debug information in the linked executable in order to determine the source locations for each reference. Source files are then mapped back to the package that provided that source file via deps files written by rustc.

One nice thing about the fact that Cackle analyses the final executable is that dead code is not considered with regard to API usage. So for example if you depend on the image package to encode PNGs, but don’t use functions from the image crate that read or write files, then those functions shouldn’t go into the executable, which means Cackle won’t classify the image package as using filesystem APIs. This means that if later, the image package started doing filesystem access from a function that wasn’t previously, it would be flagged as using a disallowed API.

Circumvention

There are of course ways to circumvent Cackle and use an API without detection.

One way is if your configuration is incomplete. For example, if your crate depends on tokio, but you don’t add tokio::net to the includes for api.net then another of your dependencies could use tokio::net to perform network access without being detected. Cackle tries to mitigate this to some extent by looking for top-level modules with the same name as one of your APIs. So in the case of tokio, Cackle will suggest that you add tokio::net to the net API.

Once you’ve granted a package permission to use an API, it has carte blanche to do whatever it likes with that permission. Cackle provides strongest protection where crates have been granted no special permissions. Similarly, once a crate is granted use of unsafe, it could in theory do just about anything with it. That said, using unsafe to say perform network access without linking to C code, is harder than just using Rust’s std::net APIs, so we’re at least making it harder for a would-be attacker.

More problematic is that platform-specific or config-specific malicious code might be missed. For example, malicious code that is only present on Windows or Mac would be missed since Cackle currently only works on Linux.

Lastly, there are undoubtedly bugs that might allow API usage to go undetected.

Observations from running on a few different binaries

When running cackle on a few popular binaries published to crates.io, what I’ve observed is that usually, a bit under half of crates need no special permissions. This is great, because if at any point any of these crates start using a restricted API or unsafe, we should notice.

Of the crates that need special permissions, by far the most commonly required is unsafe. Most crates using unsafe provide some low level API that can’t be provided, or can’t be provided efficiently without unsafe. I did however notice a number of crates where it’s not clear why they would need unsafe.

Filesystem and process APIs are less used than unsafe, but still used a moderate amount, especially from build scripts.

Network APIs are relatively uncommon and for a program that doesn’t talk over the network - e.g. Cackle itself or something like ripgrep, I’d expect to see no dependencies using network APIs. Indeed, for these two crates, this is what we see.

Future plans

More granular sandboxing of proc macros

Right now, proc macros are sandboxed by virtue of Cackle running rustc in a sandbox. However this means that all your proc macros are granted the same sandbox permissions. So if you have one proc macro in your dependencies that needs network access, you need to grant this to all proc macros. Fortunately most proc macros don’t need any network access, so this isn’t too much of a problem in common cases.

If rustc at any point gets support for running proc macros as subprocesses rather than by loading dylibs into the main rustc process, then we can sandbox each proc macro individually which will allow us to fix this.

Runtime analysis

At the moment, we’re mostly only doing static analysis. It could be interesting to add some runtime analysis to Cackle. Specifically, we could trace the syscalls made by build scripts (and ideally also proc macros) and see what file paths they try to access and what subprocesses they spawn.

Support for more platforms

As mentioned previously, if we can’t work on Mac and Windows, then any attacks that target only those platforms will go undetected. Making it work on Mac is hopefully not too much work, since Mac uses the same debug information format as Linux. Windows is a little harder, since it uses a different format for debug info. There is a crate that parses it though, so it’s probably doable. Help is needed with both as I only have Linux.

Conclusion

Cackle won’t (and can’t) stop all supply-chain attacks. However, hopefully it can be a tool that can at least make it substantially harder for an attacker to sneak something malicious into your dependencies without you noticing.

Thanks

Huge thank you to Embark Studios and Johan Andersson for their generous sponsorship.

David Lattimore’s Blog

Video: Wild linker talk at GOSIM China 2024

Rust dylib rabbit holes

Some good news

Where to from here?

Further resources

Thanks

Discussion threads

Testing a linker

Thanks

Discussion threads

Speeding up rustc by being lazy

Dead code

Repeated monomorphisations

Recompiling dependents on implementation changes

Parallelism

Finer-grained codegen units

Linker integration?

Caching

Keeping memory usage in check

Why not do all compilation on demand?

Related work

Next steps

Thanks

Discussion threads

Video: Rust Sydney - A linker in the Wild

Wild linker - March update

eh_frame support

ifuncs

Dynamic relocation

Comment section

String merging

Performance

Next steps

Funding

Discussion threads

Speeding up the Rust edit-build-run cycle

Benchmark setup

Use a faster linker like Mold

Avoid linking debug info

Building a non-relocatable executable

Alternatives to Mold

Summary of improvements

Diagnosing unexpected rebuilds of dependencies

Investigating remaining time

Possible future changes

Changing the defaults

Incremental linking

Discussion threads

Making Rust supply chain attacks harder with Cackle

Making Rust supply chain attacks harder with Cackle

A hypothetical story about Alex

Supply-chain attacks

Practices to help prevent supply-chain attacks

Introducing Cackle aka cargo-acl

Installing Cackle

Building an initial config

How it works

Circumvention

Observations from running on a few different binaries

Future plans

More granular sandboxing of proc macros

Runtime analysis

Support for more platforms

Conclusion

Thanks

Discussion threads