<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://davidlattimore.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://davidlattimore.github.io/" rel="alternate" type="text/html" /><updated>2025-11-27T06:31:45+00:00</updated><id>https://davidlattimore.github.io/feed.xml</id><title type="html">David Lattimore’s Blog</title><subtitle>A blog about my open-source work, mostly in Rust. My interests are mostly around developer tooling, compilers, linking.</subtitle><entry><title type="html">Graph Algorithms in Rayon</title><link href="https://davidlattimore.github.io/posts/2025/11/27/graph-algorithms-in-rayon.html" rel="alternate" type="text/html" title="Graph Algorithms in Rayon" /><published>2025-11-27T00:00:00+00:00</published><updated>2025-11-27T00:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2025/11/27/graph-algorithms-in-rayon</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2025/11/27/graph-algorithms-in-rayon.html"><![CDATA[<p>The <a href="https://github.com/davidlattimore/wild">Wild linker</a> makes very extensive use of
<a href="https://docs.rs/rayon/latest/rayon/">rayon</a> for parallelism. Much of this parallelism is in the
form of
<a href="https://docs.rs/rayon/latest/rayon/iter/trait.IntoParallelRefIterator.html#tymethod.par_iter"><code class="language-plaintext highlighter-rouge">par_iter</code></a>
and friends. However, some parts of the linker don’t fit neatly because the amount of work isn’t
known in advance. For example, the linker has two places where it explores a graph. When we start,
we know some roots of that graph, but we don’t know all the nodes that we’ll need to visit. We’ve
gone through a few different approaches for how we implement such algorithms. This post covers those
approaches and what we’ve learned along the way.</p>

<h2 id="spawn-broadcast">Spawn broadcast</h2>

<p>Our first approach was to spawn a task for each thread (rayon’s
<a href="https://docs.rs/rayon/latest/rayon/struct.Scope.html#method.spawn_broadcast">spawn_broadcast</a>)
then do our own work sharing and job control between those threads. By “our own job control” I mean
that each thread would pull work from a channel and if it found no work, it’d <a href="https://doc.rust-lang.org/std/thread/fn.park.html">park the
thread</a>. If new work came up, the thread that
produced the work would wake a parked thread.</p>

<p>This was complex. Worse, it didn’t allow us to use other rayon features while it was running. For
example, if we tried to do a par_iter from one of the threads, it’d only have the current thread to
work with because all the others were doing their own thing, possibly parked, but in any case, not
available to rayon.</p>

<h2 id="scoped-spawning">Scoped spawning</h2>

<p>Using rayon’s <a href="https://docs.rs/rayon/latest/rayon/fn.scope.html"><code class="language-plaintext highlighter-rouge">scope</code></a> or
<a href="https://docs.rs/rayon/latest/rayon/fn.in_place_scope.html"><code class="language-plaintext highlighter-rouge">in_place_scope</code></a>, we can create a scope
into which we spawn tasks.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">rayon</span><span class="p">::</span><span class="nf">scope</span><span class="p">(|</span><span class="n">scope</span><span class="p">|</span> <span class="p">{</span>  
  <span class="k">for</span> <span class="n">node</span> <span class="k">in</span> <span class="n">roots</span> <span class="p">{</span>  
    <span class="n">scope</span><span class="nf">.spawn</span><span class="p">(|</span><span class="n">scope</span><span class="p">|</span> <span class="p">{</span>  
      <span class="nf">explore_graph</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">scope</span><span class="p">);</span>  
    <span class="p">});</span>  
  <span class="p">}</span>  
<span class="p">});</span>  
</code></pre></div></div>

<p>The idea here is that we create a scope and spawn some initial tasks into that scope. Those tasks
then spawn additional tasks and so on until eventually there are no more tasks.</p>

<p>The rayon documentation warns that this is more expensive than other approaches, so should be
avoided if possible. The reason it’s more expensive is that it heap-allocates the task. Indeed, when
using this approach, we do see increased heap allocations.</p>

<h2 id="channel--par_bridge">Channel + par_bridge</h2>

<p>Another approach that I’ve tried recently and which arose out of the desire to reduce heap
allocations is to put work into a <a href="https://docs.rs/crossbeam-channel/latest/crossbeam_channel/fn.unbounded.html">crossbeam
channel</a>. The work
items can be an enum if there are different kinds. Our work scope is then just something like the
following:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="p">(</span><span class="n">work_send</span><span class="p">,</span> <span class="n">work_recv</span><span class="p">)</span> <span class="o">=</span> <span class="nn">crossbeam_channel</span><span class="p">::</span><span class="nf">unbounded</span><span class="p">();</span>

<span class="c1">// Add some initial work items.  </span>
<span class="k">for</span> <span class="n">node</span> <span class="k">in</span> <span class="n">roots</span> <span class="p">{</span>  
  <span class="n">work_send</span><span class="nf">.send</span><span class="p">(</span><span class="nn">WorkItem</span><span class="p">::</span><span class="nf">ProcessNode</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">work_send</span><span class="nf">.clone</span><span class="p">()));</span>  
<span class="p">}</span>

<span class="c1">// Drop sender to ensure we can terminate. Each work item has a copy of the sender.  </span>
<span class="nf">drop</span><span class="p">(</span><span class="n">work_send</span><span class="p">);</span>

<span class="n">work_recv</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.par_bridge</span><span class="p">()</span><span class="nf">.for_each</span><span class="p">(|</span><span class="n">work_item</span><span class="p">|</span> <span class="p">{</span>  
   <span class="k">match</span> <span class="n">work_item</span> <span class="p">{</span>  
      <span class="nn">WorkItem</span><span class="p">::</span><span class="nf">ProcessNode</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">work_send</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="p">{</span>  
        <span class="nf">explore_graph</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">work_send</span><span class="p">);</span>  
      <span class="p">}</span>  
   <span class="p">}</span>  
<span class="p">});</span>  
</code></pre></div></div>

<p>The trick with this approach is that each work item needs to hold a copy of the send-end of the
channel. That means that when processing work items, we can add more work to the queue. Once the
last work item completes, the last copy of the sender is dropped and the channel closes.</p>

<p>This approach works OK. It does avoid the heap allocations associated with scoped spawning. It is a
little bit complex, although not as complex as doing all the job control ourselves. One downside is
that like doing job control ourselves, it doesn’t play nicely with using <code class="language-plaintext highlighter-rouge">par_iter</code> inside of worker
tasks. The reason why is kind of subtle and is due to the way rayon is implemented. What can happen
is that the <code class="language-plaintext highlighter-rouge">par_iter</code> doesn’t just process its own tasks. It can also steal work from other
threads. When it does this, it can end up blocking trying to pull another work item from the
channel. The trouble is that because the <code class="language-plaintext highlighter-rouge">par_iter</code> was called from a work item that holds a copy of
the send-end of the channel, we can end up deadlocked. The channel doesn’t close because we hold a
sender and we don’t drop the sender because we’re trying to read from the read-end of the channel.</p>

<p>Another problem with this approach that I’ve just come to realise is that it doesn’t compose well. I
had kind of imagined just getting more and more options in my <code class="language-plaintext highlighter-rouge">WorkItem</code> enum as the scope of the
work increased. The trouble is that working with this kind of work queue doesn’t play nicely with
the borrow checker. An example might help. Suppose we have some code written with rayon’s
<a href="https://docs.rs/rayon/latest/rayon/slice/trait.ParallelSliceMut.html#method.par_chunks_mut">par_chunks_mut</a>
and we want to flatten that work into some other code that uses a channel with work items. First we
need to convert the <code class="language-plaintext highlighter-rouge">par_chunks_mut</code> code into a channel of work items.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">foo</span> <span class="o">=</span> <span class="nf">create_foo</span><span class="p">();</span>  
<span class="n">foo</span><span class="nf">.par_chunks_mut</span><span class="p">(</span><span class="n">chunk_size</span><span class="p">)</span><span class="nf">.for_each</span><span class="p">(|</span><span class="n">chunk</span><span class="p">|</span> <span class="p">{</span>  
   <span class="c1">// Do work with mutable slice `chunk`  </span>
<span class="p">});</span>  
</code></pre></div></div>

<p>If we want the creation of <code class="language-plaintext highlighter-rouge">foo</code> to be a work item and each bit of processing to also be work items,
there’s no way to do that and have the borrow checker be happy.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">match</span> <span class="n">work_item</span> <span class="p">{</span>  
   <span class="nn">WorkItem</span><span class="p">::</span><span class="n">CreateAndProcessFoo</span> <span class="k">=&gt;</span> <span class="p">{</span>  
      <span class="k">let</span> <span class="n">foo</span> <span class="o">=</span> <span class="nf">create_foo</span><span class="p">();</span>  
      <span class="c1">// Split `foo` into chunks and queue several `WorkItem::ProcessChunk`s….?  </span>
   <span class="p">}</span>  
   <span class="nn">WorkItem</span><span class="p">::</span><span class="nf">ProcessChunk</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="p">{</span>  
      <span class="c1">// Do work with mutable slice `chunk`.  </span>
   <span class="p">}</span>  
<span class="p">}</span>  
</code></pre></div></div>

<p>So that clearly doesn’t work. There’s no way for us to take our owned <code class="language-plaintext highlighter-rouge">foo</code> and split it into chunks
that can be processed as separate <code class="language-plaintext highlighter-rouge">WorkItem</code>s. The borrow checker won’t allow it.</p>

<p>Another problem arises if we’ve got two work-queue-based jobs and we’d like to combine them, but the
second job needs borrows that were taken by the first job to be released before it can run. This
runs into similar problems.</p>

<p>The kinds of code structures we end up with here feel a bit like we’re trying to write async code
without async/await. This makes me wonder if async/await could help here.</p>

<h2 id="asyncawait">Async/await</h2>

<p>I don’t know exactly what this would look like because I haven’t yet tried implementing it. But I
imagine it might look a lot like how the code is written with rayon’s scopes and spawning. Instead
of using rayon’s scopes, it’d use something like
<a href="https://crates.io/crates/async-scoped"><code class="language-plaintext highlighter-rouge">async_scoped</code></a>.</p>

<p>One problem that I have with rayon currently is, I think, solved by using async/await. That problem,
which I briefly touched on above is described in more detail here. Suppose we have a <code class="language-plaintext highlighter-rouge">par_iter</code>
inside some other parallel work:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">outer_work</span><span class="nf">.par_iter</span><span class="p">()</span><span class="nf">.for_each</span><span class="p">(|</span><span class="n">foo</span><span class="p">|</span> <span class="p">{</span>
  <span class="k">let</span> <span class="n">foo</span> <span class="o">=</span> <span class="n">inputs</span><span class="nf">.par_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="o">...</span><span class="p">)</span><span class="nf">.collect</span><span class="p">();</span>

  <span class="c1">// &lt; Some other work with `foo` here, hence why we cannot merge the two par_iters &gt;</span>

  <span class="n">foo</span><span class="nf">.par_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="o">...</span><span class="p">)</span><span class="nf">.for_each</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="o">...</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p>If the thread that we’re running this code on becomes idle during the first inner <code class="language-plaintext highlighter-rouge">par_iter</code>, that
thread will try to steal work from other threads. If it succeeds, then even though all the work of
the <code class="language-plaintext highlighter-rouge">par_iter</code> is complete, we can’t continue to the second inner <code class="language-plaintext highlighter-rouge">par_iter</code> until the stolen work
also completes. However, with async/await, tasks are not tied to a specific thread once started.
Threads steal work, but tasks don’t, so the task that’s running the above code would become runnable
as soon as the <code class="language-plaintext highlighter-rouge">par_iter</code> completed even if the thread that had originally been running that task
had stolen work - the task could just be run on another thread.</p>

<p>It’d be very interesting to see what async/await could contribute to the parallel computation space.
I don’t have any plans to actually try this at this stage, but maybe in future.</p>

<h2 id="return-to-scoped-spawning-and-future-work">Return to scoped spawning and future work</h2>

<p>In the meantime, I’m thinking I’ll return to scoped spawning. Using a channel works fine for simple
tasks and it avoids the heap allocations, but it really doesn’t compose at all well.</p>

<p>I am interested in other options for avoiding the heap allocations. Perhaps there’s options for
making small changes to rayon that might achieve this. e.g. adding support for spawning tasks
without boxing, provided the closure is less than or equal to say 32 bytes. I’ve yet to explore such
options though.</p>

<h2 id="thanks">Thanks</h2>

<p>Thanks to everyone who has been <a href="https://github.com/sponsors/davidlattimore">sponsoring</a> my work on
Wild, in particular the following, who have sponsored at least $15 in the last two months:</p>

<ul>
  <li>CodeursenLiberte</li>
  <li>repi</li>
  <li>rrbutani</li>
  <li>Rafferty97</li>
  <li>wasmerio</li>
  <li>mati865</li>
  <li>Urgau</li>
  <li>mstange</li>
  <li>flba-eb</li>
  <li>bes</li>
  <li>Tudyx</li>
  <li>twilco</li>
  <li>sourcefrog</li>
  <li>simonlindholm</li>
  <li>petersimonsson</li>
  <li>marxin</li>
  <li>joshtriplett</li>
  <li>coreyja</li>
  <li>binarybana</li>
  <li>bcmyers</li>
  <li>Kobzol</li>
  <li>HadrienG2</li>
  <li>+3 anonymous</li>
</ul>

<h2 id="discussion-threads">Discussion threads</h2>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1p7omoh/thoughts_on_graph_algorithms_in_rayon/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[The Wild linker makes very extensive use of rayon for parallelism. Much of this parallelism is in the form of par_iter and friends. However, some parts of the linker don’t fit neatly because the amount of work isn’t known in advance. For example, the linker has two places where it explores a graph. When we start, we know some roots of that graph, but we don’t know all the nodes that we’ll need to visit. We’ve gone through a few different approaches for how we implement such algorithms. This post covers those approaches and what we’ve learned along the way.]]></summary></entry><entry><title type="html">Wild Linker Update - 0.6.0</title><link href="https://davidlattimore.github.io/posts/2025/09/23/wild-update-0.6.0.html" rel="alternate" type="text/html" title="Wild Linker Update - 0.6.0" /><published>2025-09-23T00:00:00+00:00</published><updated>2025-09-23T00:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2025/09/23/wild-update-0.6.0</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2025/09/23/wild-update-0.6.0.html"><![CDATA[<p>Today, we’ve released <a href="https://github.com/davidlattimore/wild/releases/tag/0.6.0">Wild version
0.6.0</a>. There were many changes and we
were probably overdue for a release, having last released in May.</p>

<p>This release saw contributions from many people:</p>

<ul>
  <li>davidlattimore: 90</li>
  <li>marxin: 69</li>
  <li>lapla-cogito: 41</li>
  <li>mati865: 28</li>
  <li>RossSmyth: 6</li>
  <li>daniel-levin: 3</li>
  <li>Noratrieb: 2</li>
  <li>lqd: 1</li>
  <li>m-hugo: 1</li>
  <li>dawnofmidnight: 1</li>
</ul>

<p>That’s the number of commits, which isn’t a great measure, but it’s something. Importantly, more
than half of the commits made were made by people other than me. It’s awesome to see wild growing
into a team project. If you’d like to contribute, come along and have a chat on the <a href="https://wild.zulipchat.com/join/bbopdeg6howwjpaiyowngyde/">Wild
Zulip</a> or have a look through the issues
for something you think you’d like to try implementing / fixing.</p>

<p>My work on the project has been reduced a bit the last couple of months due to me speaking at first
RustForge, then RustChinaConf. Conference preparation takes me a lot of time and I need to get
better at managing that preparation work while still getting other stuff done. In any case, the
conferences are over now and I’m looking forward to getting some solid work done.</p>

<p>The last few months we had Kei (lapla-cogito) join us for Google Summer of Code (GSoC). It was
awesome having Kei work with us. As you can see above, a lot of work got done. The project focused
on setting things up to run Mold’s test suite with wild. This is now running in CI and helps fill
some gaps in our own tests as well as highlight things that we haven’t yet implemented. Kei also did
a lot of other fixes and improvements. One of the more notable ones was implementing <code class="language-plaintext highlighter-rouge">--help</code>, which
was something we’d wanted for a while. I look forward to continuing to work with Kei going forward.</p>

<p>Martin (marxin) added initial RISC-V support to this release. There’s still probably a little bit
more that could be done on this, but it basically works. Kei is working on adding RISC-V support to
linker-diff, which will help with further work in this area.</p>

<p>With this release, we now do release builds of Wild with Wild. i.e., we’re using it in “production”.
As such, we’ve removed the language from our README that used to say not to use it in production.
That’s not to say you should do production builds with it and just put them out there. We definitely
recommend thorough testing. Wild is still intended primarily for fast development, but if you’d like
to use it for other things, who are we to stop you. As always, be sure to let us know if you hit any
problems.</p>

<p>With 0.6.0, we can now link the Chromium web browser. This is an interesting stress test for linkers
because it’s really big - about 1.4 GiB (a previous version of this post incorrectly said that this
was without debug info, it’s actually with debug info).</p>

<p><img src="/images/0.6.0/chromium.svg" alt="Benchmark of time to link Chromium" /></p>

<p>It’s worth noting that the relative difference between lld and mold is very different to what’s seen
in the benchmarks on the <a href="https://github.com/rui314/mold">mold repo</a>. This is likely due to the
benchmark machine being very different. i.e., my laptop has a lot less cores.</p>

<p>The following benchmark is for librustc-driver, which is where most of the code in the Rust compiler
goes.</p>

<p><img src="/images/0.6.0/librustc-driver.svg" alt="Benchmark of time to link librustc-driver" /></p>

<p>Our final benchmark is the bevy dylib. This is an interesting benchmark since it has a very large
version script and produces a shared object with more than half a million dynamic symbols.</p>

<p><img src="/images/0.6.0/bevy-dylib.svg" alt="Benchmark of time to link bevy dylib" /></p>

<p>My laptop has 4 cores and 8 threads. All my development work to date has been on this machine and on
it, Wild performs really well, often beating other linkers by a factor of 2 or sometimes more.
However, on machines with more cores, the performance isn’t so great. We’ve started to look into
this to see what we can do about it. One area that particularly stands out is string merging. This
is where there’s a section containing null-terminated strings that need to be deduplicated with
similar sections in other object files. Sounds easy, but getting it to perform well with multiple
threads is hard. We’ve gone through several different implementations of string merging in an
attempt to get good performance. Our current implementation is probably too complex. It’s also
performing badly in some cases. In particular when there aren’t many input sections but there are
lots of threads. In this case it’s actually getting slower the more threads it has.</p>

<p>As such, I’m considering doing another rewrite of string merging. One option I’m considering here is
to change the way string-merge sections are represented in ELF files. I suspect that with a few
tweaks to how they’re represented, we could get much better performance.</p>

<p>At a high level, the idea would be to store an additional section containing an index of the strings
to be merged. This index, similar to the symbol table, would contain the start offset of each
string. Where string relocations (references) are currently by section number + section offset, we’d
change them to be section number + string number. That means that we’d need a new relocation type.
Additionally, the string index would also store a hash of each string and the strings would be
sorted by hash.</p>

<p>I’ll probably do some experiments on this front to see what’s possible. If it performs well, then we
can talk to other linker authors and compiler writers to see if there’s interest in the new
representation. I’ll try to write a blog post about the outcome, even if it doesn’t work out.</p>

<p>While string-merging is the worst offender in terms of scaling the number of threads, it looks like
other areas are also not ideal. This needs more investigation. I suspect at least part of the issue
might be rayon.</p>

<p>One area where we know we have a problem with rayon is its <code class="language-plaintext highlighter-rouge">try_for_each_init</code> API. We use this to
allocate a per-thread arena in a couple of cases. Unfortunately, rayon runs the init block for
pretty much every work item rather than just running it once per thread. This means that we end up
generating many times more arenas that we need, which is pretty wasteful. This is a known issue in
rayon, but I think it’s perhaps not clear how to fix it with rayon’s architecture.</p>

<p>I’m keen to try alternatives to rayon to see what difference they make. In particular, I’ve been
looking at <a href="https://github.com/orxfun/orx-parallel/">orx-parallel</a>. Once it has thread-pool support
and some way to handle graph algorithms (e.g. task spawning), I’ll definitely be giving it a try.</p>

<p>Trying <a href="https://github.com/dragostis/chili">chili</a> would also be interesting, but it’s pretty low
level, so we’d need quite a few abstractions built on top of it (e.g. par_iter) before we could
reasonably use it.</p>

<p>If you’ve been following this project for a while, you might be wondering what’s happening with
incremental linking. I thought that I was ready to start on this about a year ago, but it turns out
that I underestimated how much more there was to get a solid linker, so fixing bugs and adding
missing features has occupied most of my time. When I started the linker, I wasn’t expecting to get
such good performance with non-incremental linking. Seeing the performance that we’ve gotten has
changed the equation a bit in terms of what seems important to work on. Anyway, I still intend to
get to incremental eventually, but I won’t promise when.</p>

<p>There are lots of other things that we may or may not work on in the coming months. Possibilities
are:</p>

<ul>
  <li>Improving linker-script support</li>
  <li>Linker-plugin LTO. Not needed for Rust LTO, but is needed for LTO of other languages.</li>
  <li>Improved symbol version support (Martin might be looking at this)</li>
  <li>Garbage collection of redundant / unused debug info. This one is a bit daunting, so we probably
won’t do it, but it’d be cool if we did, since it’s something that none of the other linkers do.</li>
  <li>Putting ELF-specific stuff behind a trait to make porting to Windows / Mac easier.</li>
</ul>

<p>Thanks to everyone who has been <a href="https://github.com/sponsors/davidlattimore">sponsoring me</a>, in
particular the following people who have sponsored at least $30 since the last release:</p>

<ul>
  <li>CodeursenLiberte</li>
  <li>pmarks</li>
  <li>mati865</li>
  <li>repi</li>
  <li>Urgau</li>
  <li>teburd</li>
  <li>flba-eb</li>
  <li>tommythorn</li>
  <li>binarybana</li>
  <li>bcmyers</li>
  <li>Kobzol</li>
  <li>HadrienG2</li>
  <li>bes</li>
  <li>twilco</li>
  <li>mstange</li>
  <li>marxin</li>
  <li>joshtriplett</li>
  <li>jonhoo</li>
  <li>+1 anonymous</li>
</ul>

<h1 id="discussions">Discussions</h1>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1no80lz/wild_linker_update_060/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[Today, we’ve released Wild version 0.6.0. There were many changes and we were probably overdue for a release, having last released in May.]]></summary></entry><entry><title type="html">Wild Performance Tricks</title><link href="https://davidlattimore.github.io/posts/2025/09/02/rustforge-wild-performance-tricks.html" rel="alternate" type="text/html" title="Wild Performance Tricks" /><published>2025-09-02T13:00:00+00:00</published><updated>2025-09-02T13:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2025/09/02/rustforge-wild-performance-tricks</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2025/09/02/rustforge-wild-performance-tricks.html"><![CDATA[<p>Last week I had the pleasure of attending RustForge in Wellington, New Zealand. I gave a talk titled
“Wild performance tricks”. You can watch a <a href="https://www.youtube.com/live/6Scgq9fBZQM?t=9246s">recording of my
talk</a>. If you’d prefer to read rather than watch,
the rest of this post will cover more or less the same material. The talk shows some linker
benchmarks, which I’ll skip here and focus instead on the optimisations, which I think are the more
interesting part of the talk.</p>

<p>The tricks here are a few of my favourites that I’ve used in the <a href="https://github.com/davidlattimore/wild">Wild
linker</a>.</p>

<h2 id="mutable-slicing-for-sharing-between-threads">Mutable slicing for sharing between threads</h2>

<p>In the linker, we have a type <code class="language-plaintext highlighter-rouge">SymbolId</code> defined as:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">SymbolId</span><span class="p">(</span><span class="nb">u32</span><span class="p">);</span>
</code></pre></div></div>

<p>We need a way to store resolutions, where one <code class="language-plaintext highlighter-rouge">SymbolId</code> resolves (maps) to another <code class="language-plaintext highlighter-rouge">SymbolId</code>. If
we need to look up which symbol <code class="language-plaintext highlighter-rouge">SymbolId(5)</code> maps to, we then look at index <code class="language-plaintext highlighter-rouge">5</code> in the <code class="language-plaintext highlighter-rouge">Vec</code>.
Because every symbol maps to some other symbol (possibly itself), this means that we make use of the
entire <code class="language-plaintext highlighter-rouge">Vec</code>. i.e. it’s dense, not sparse. For a sparse mapping, a <code class="language-plaintext highlighter-rouge">HashMap</code> might be preferable.</p>

<p>The Wild linker is very multi-threaded, so we want to be able to process symbols for our input
objects in parallel. To achieve this, we make sure that all symbols for a given object get allocated
adjacent to each other. i.e. each object has <code class="language-plaintext highlighter-rouge">SymbolId</code>s in a contiguous range. This is good for
cache locality because when a thread is working with an object, all its symbols will be nearby in
memory, so more likely to be in cache. It also lets us do things like this:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">parallel_process_resolutions</span><span class="p">(</span><span class="k">mut</span> <span class="n">resolutions</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[</span><span class="n">SymbolId</span><span class="p">],</span> <span class="n">objects</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="n">Object</span><span class="p">])</span> <span class="p">{</span>
   <span class="n">objects</span>
       <span class="nf">.iter</span><span class="p">()</span>
       <span class="nf">.map</span><span class="p">(|</span><span class="n">obj</span><span class="p">|</span> <span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">resolutions</span><span class="nf">.split_off_mut</span><span class="p">(</span><span class="o">..</span><span class="n">obj</span><span class="py">.num_symbols</span><span class="p">)</span><span class="nf">.unwrap</span><span class="p">()))</span>
       <span class="nf">.par_bridge</span><span class="p">()</span>
       <span class="nf">.for_each</span><span class="p">(|(</span><span class="n">obj</span><span class="p">,</span> <span class="n">object_resolutions</span><span class="p">)|</span> <span class="p">{</span>
           <span class="n">obj</span><span class="nf">.process_resolutions</span><span class="p">(</span><span class="n">object_resolutions</span><span class="p">);</span>
       <span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here, we’re using the Rayon crate to process the resolutions for all our objects in parallel from
multiple threads. We start by iterating over our objects, then for each object, we use
<code class="language-plaintext highlighter-rouge">split_off_mut</code> to split off a mutable slice of <code class="language-plaintext highlighter-rouge">resolutions</code> that contains the resolutions for that
object. <code class="language-plaintext highlighter-rouge">par_bridge</code> converts this regular Rust iterator into a Rayon parallel iterator. The closure
passed to <code class="language-plaintext highlighter-rouge">for_each</code> then runs in parallel on multiple threads, with each thread getting access to
the object and a mutable slice of that object’s resolutions.</p>

<h2 id="parallel-initialisation-of-the-vec">Parallel initialisation of the Vec</h2>

<p>The previous technique of using <code class="language-plaintext highlighter-rouge">split_off_mut</code> to get multiple non-overlapping mutable slices of
our Vec relies on the Vec having already been initialised. We’d like to initialise our Vec in
parallel, otherwise we’d have to wait for the main thread to fill the entire Vec with a placeholder
value only to then have our threads overwrite those placeholder values. To do this, we can use the
<code class="language-plaintext highlighter-rouge">sharded-vec-writer</code> crate, which was created for use in Wild, but which can be used for similar
purposes elsewhere.</p>

<p>First, we create a Vec with sufficient capacity to store the resolutions for all our symbols:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">resolutions</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">SymbolId</span><span class="o">&gt;</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">with_capacity</span><span class="p">(</span><span class="n">total_num_symbols</span><span class="p">);</span>
</code></pre></div></div>

<p>At this point, we’ve allocated space on the heap for the Vec, but that space is still uninitialised.
i.e. the length is still zero.</p>

<p>Next, we create a <code class="language-plaintext highlighter-rouge">VecWriter</code>, which mutably borrows the Vec, then split that writer into shards,
with each shard having a size equal to the number of symbols in the corresponding object.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">writer</span> <span class="o">=</span> <span class="nn">VecWriter</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">resolutions</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">shards</span> <span class="o">=</span> <span class="n">writer</span><span class="nf">.take_shards</span><span class="p">(</span><span class="n">objects</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">o</span><span class="p">|</span> <span class="n">o</span><span class="py">.num_symbols</span><span class="p">));</span>
</code></pre></div></div>

<p>We can now, in parallel, iterate through our objects and their corresponding shards and initialise
the shards.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">objects</span>
   <span class="nf">.par_iter</span><span class="p">()</span>
   <span class="nf">.zip_eq</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">shards</span><span class="p">)</span>
   <span class="nf">.for_each</span><span class="p">(|(</span><span class="n">obj</span><span class="p">,</span> <span class="n">shard</span><span class="p">)|</span> <span class="p">{</span>
      <span class="k">for</span> <span class="n">symbol</span> <span class="k">in</span> <span class="n">obj</span><span class="nf">.symbols</span><span class="p">()</span> <span class="p">{</span>
         <span class="n">shard</span><span class="nf">.push</span><span class="p">(</span><span class="o">...</span><span class="p">);</span>
      <span class="p">}</span>
   <span class="p">});</span>
</code></pre></div></div>

<p>Lastly, we return the shards to the writer, which verifies that all the shards were fully
initialised, thus resizing the Vec, after which it can be used normally.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">writer</span><span class="nf">.return_shards</span><span class="p">(</span><span class="n">shards</span><span class="p">);</span>
</code></pre></div></div>

<h2 id="atomic---non-atomic-in-place-conversion">Atomic - non-atomic in-place conversion</h2>

<p>Most parts of the linker can make do with either exclusive access to part of the <code class="language-plaintext highlighter-rouge">resolutions</code> Vec,
or shared access to the entire Vec. However, there’s one part of the linker where we need to perform
random writes to the <code class="language-plaintext highlighter-rouge">resolutions</code> Vec. This is done when we have multiple symbol definitions with
the same name. Originally, I just did this work from the main thread, since I figured most of the
time there would only be a small number of symbols that had the same name. This was mostly true,
however for large C++ binaries like Chromium, it turns out that there are actually a lot of symbols
with the same names, presumably due to C++’s use of header files, which create lots of identical
definitions.</p>

<p>To allow random writes to <code class="language-plaintext highlighter-rouge">resolutions</code>, we introduce a new type:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nf">AtomicSymbolId</span><span class="p">(</span><span class="n">AtomicU32</span><span class="p">);</span>
</code></pre></div></div>

<p>Being an atomic, we can write to an <code class="language-plaintext highlighter-rouge">AtomicSymbolId</code> using only a shared (non-exclusive) reference.
However, we need a way to temporarily view our <code class="language-plaintext highlighter-rouge">Vec&lt;SymbolId&gt;</code> as a <code class="language-plaintext highlighter-rouge">&amp;[AtomicSymbolId]</code>.</p>

<p>The standard library has something that might help - <code class="language-plaintext highlighter-rouge">AtomicU32::from_mut_slice</code>:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">from_mut_slice</span><span class="p">(</span><span class="n">v</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[</span><span class="nb">u32</span><span class="p">])</span> <span class="k">-&gt;</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="p">[</span><span class="n">AtomicU32</span><span class="p">]</span>
</code></pre></div></div>

<p>However, it’s unstable (nightly only). Even if it were stable, it only works with slices of
primitive types, so we’d have to lose our newtypes (SymbolId etc).</p>

<p>Another option would be to always use atomics, however that would quite possibly hurt performance of
the rest of the linker, which doesn’t need atomics. It’d also hurt ergonomics, since currently our
<code class="language-plaintext highlighter-rouge">SymbolId</code>s implement <code class="language-plaintext highlighter-rouge">Copy</code>, but if they wrapped an <code class="language-plaintext highlighter-rouge">AtomicU32</code>, then they wouldn’t be able to.</p>

<p>A reasonable option at this point would be to resort to unsafe and use something like
<code class="language-plaintext highlighter-rouge">core::mem::transmute</code>. We’d need to check all the rules and make sure that we were meeting all the
requirements. This is not a bad option, but I personally like the challenge of doing things without
unsafe if I can, especially if I can do so without loss of performance.</p>

<p>Indeed, it turns out that we can, as follows:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">into_atomic</span><span class="p">(</span><span class="n">symbols</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">SymbolId</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">AtomicSymbolId</span><span class="o">&gt;</span> <span class="p">{</span>
   <span class="n">symbols</span>
       <span class="nf">.into_iter</span><span class="p">()</span>
       <span class="nf">.map</span><span class="p">(|</span><span class="n">s</span><span class="p">|</span> <span class="nf">AtomicSymbolId</span><span class="p">(</span><span class="nn">AtomicU32</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">s</span><span class="na">.0</span><span class="p">)))</span>
       <span class="nf">.collect</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’d be reasonable to think that this will have a runtime cost, however it doesn’t. The reason is
that the Rust standard library has a nice optimisation in it that when we consume a Vec and collect
the result into a new Vec, in many circumstances, the heap allocation of the original Vec can be
reused. This applies in this case. But what even with the heap allocation being reused, we’re still
looping over all the elements to transform them right? Because the in-memory representation of an
<code class="language-plaintext highlighter-rouge">AtomicSymbolId</code> is identical to that of a <code class="language-plaintext highlighter-rouge">SymbolId</code>, our loop becomes a no-op and is optimised
away.</p>

<p>We can verify this by looking at the assembly produced for this function:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">movups</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
<span class="nf">mov</span>     <span class="nb">rax</span><span class="p">,</span> <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span>
<span class="nf">movups</span>  <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">xmm0</span>
<span class="nf">mov</span>     <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">],</span> <span class="nb">rax</span>
<span class="nf">ret</span>
</code></pre></div></div>

<p>The main takeaway from this assembly is that there’s no branching, no looping, just a few moves and
a return. If we allowed this function to be inlined into the caller, it would likely vanish to
nothing.</p>

<p>For conversion back to the non-atomic form, we can do much the same:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">into_non_atomic</span><span class="p">(</span><span class="n">atomic_symbols</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">AtomicSymbolId</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">SymbolId</span><span class="o">&gt;</span> <span class="p">{</span>
   <span class="n">atomic_symbols</span>
       <span class="nf">.into_iter</span><span class="p">()</span>
       <span class="nf">.map</span><span class="p">(|</span><span class="n">s</span><span class="p">|</span> <span class="nf">SymbolId</span><span class="p">(</span><span class="n">s</span><span class="na">.0</span><span class="nf">.into_inner</span><span class="p">()))</span>
       <span class="nf">.collect</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The main thing to note here is that we avoid doing an atomic load from the atomic and instead
consume the atomic with <code class="language-plaintext highlighter-rouge">into_inner</code>. This is easier for the compiler to optimise and if we look at
the assembly produced it’s identical to what we got for <code class="language-plaintext highlighter-rouge">into_atomic</code>.</p>

<p>To actually use these functions, we first need to get ownership of our Vec using <code class="language-plaintext highlighter-rouge">core::mem::take</code>.
This puts an empty Vec in its place. Empty Vecs don’t heap allocate, so this is very cheap. We then
call <code class="language-plaintext highlighter-rouge">into_atomic</code> to convert the Vec into the form we need.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">atomic_resolutions</span> <span class="o">=</span> <span class="nf">into_atomic</span><span class="p">(</span><span class="nn">core</span><span class="p">::</span><span class="nn">mem</span><span class="p">::</span><span class="nf">take</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="k">self</span><span class="py">.resolutions</span><span class="p">));</span>
</code></pre></div></div>

<p>We can then do whatever parallel processing we need with the Vec in its atomic form.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">process_resolutions_in_parallel</span><span class="p">(</span><span class="o">&amp;</span><span class="n">atomic_resolutions</span><span class="p">);</span>
</code></pre></div></div>

<p>Finally, we convert back to the original non-atomic form and store back where we got it from,
overwriting the empty Vec that we temporarily put in its place.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">self</span><span class="py">.resolutions</span> <span class="o">=</span> <span class="nf">into_non_atomic</span><span class="p">(</span><span class="n">atomic_resolutions</span><span class="p">);</span>
</code></pre></div></div>

<p>One thing worth noting here is that if we panic (or do an early return), we might leave
<code class="language-plaintext highlighter-rouge">self.resolutions</code> as the empty Vec. This isn’t a problem in the linker, since if we’re returning an
error or have hit a panic, then we don’t care at that point about resolutions. It would be possible
to ensure that the proper Vec was restored for use-cases where that was important, however it would
add extra complexity and might be enough to convince me that it’d be better to just use transmute.</p>

<h2 id="buffer-reuse">Buffer reuse</h2>

<p>Doing too much heap allocation tends to hurt performance. A common trick is to move heap allocations
outside of loops. For example, rather than this:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">loop</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
    <span class="c1">// Do work with `buffer`.</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We might prefer to allocate buffer before the loop, then just clear it inside the loop:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">loop</span> <span class="p">{</span>
    <span class="n">buffer</span><span class="nf">.clear</span><span class="p">();</span>
    <span class="c1">// Do work with `buffer`.</span>
<span class="p">}</span>
</code></pre></div></div>

<p>However, if we’re storing something into a Vec that has a non-static lifetime, then we can run into
problems. Here, we have a variable <code class="language-plaintext highlighter-rouge">text</code>, which holds a <code class="language-plaintext highlighter-rouge">String</code>. We then split that string and
store the resulting string-slices into <code class="language-plaintext highlighter-rouge">buffer</code>. Even though we clear <code class="language-plaintext highlighter-rouge">buffer</code> at the end of the
loop, the compiler is unhappy. It wants <code class="language-plaintext highlighter-rouge">text</code> to outlive <code class="language-plaintext highlighter-rouge">buffer</code> because we’re storing references
to <code class="language-plaintext highlighter-rouge">text</code> into <code class="language-plaintext highlighter-rouge">buffer</code>.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">loop</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">text</span> <span class="o">=</span> <span class="nf">get_text</span><span class="p">();</span>
    <span class="n">buffer</span><span class="nf">.extend</span><span class="p">(</span><span class="n">text</span><span class="nf">.split</span><span class="p">(</span><span class="s">","</span><span class="p">));</span>
    <span class="c1">// Do work with `buffer`.</span>
    <span class="n">buffer</span><span class="nf">.clear</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We could at this point give up and just move our Vec creation back inside the loop. However, it
turns out that there’s another solution.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="n">reuse_vec</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">U</span><span class="o">&gt;</span><span class="p">(</span><span class="k">mut</span> <span class="n">v</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">U</span><span class="o">&gt;</span> <span class="p">{</span>
   <span class="n">v</span><span class="nf">.clear</span><span class="p">();</span>
   <span class="n">v</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="nd">unreachable!</span><span class="p">())</span><span class="nf">.collect</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The idea of this function is to convert from a Vec of some time to an empty Vec of another type,
reusing the heap allocation. This works in a very similar way to how we converted between atomic and
non-atomic <code class="language-plaintext highlighter-rouge">SymbolId</code>s, except this time because we first clear the Vec, the body of our <code class="language-plaintext highlighter-rouge">map</code>
function is unreachable.</p>

<p>The optimisation in the Rust standard library that allows reuse of the heap allocation will only
actually work if the size and alignment of <code class="language-plaintext highlighter-rouge">T</code> and <code class="language-plaintext highlighter-rouge">U</code> are the same, so let’s verify that that’s the
case. We can do the check at compile time, so if we accidentally call this function with
incompatible <code class="language-plaintext highlighter-rouge">T</code> and <code class="language-plaintext highlighter-rouge">U</code>, we’ll get a compilation error at the call site.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="n">reuse_vec</span><span class="o">&lt;</span><span class="n">T</span><span class="p">,</span> <span class="n">U</span><span class="o">&gt;</span><span class="p">(</span><span class="k">mut</span> <span class="n">v</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">U</span><span class="o">&gt;</span> <span class="p">{</span>
   <span class="k">const</span> <span class="p">{</span>
       <span class="nd">assert!</span><span class="p">(</span><span class="nn">size_of</span><span class="p">::</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">()</span> <span class="o">==</span> <span class="nn">size_of</span><span class="p">::</span><span class="o">&lt;</span><span class="n">U</span><span class="o">&gt;</span><span class="p">());</span>
       <span class="nd">assert!</span><span class="p">(</span><span class="nn">align_of</span><span class="p">::</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">()</span> <span class="o">==</span> <span class="nn">align_of</span><span class="p">::</span><span class="o">&lt;</span><span class="n">U</span><span class="o">&gt;</span><span class="p">());</span>
   <span class="p">}</span>
   <span class="n">v</span><span class="nf">.clear</span><span class="p">();</span>
   <span class="n">v</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">_</span><span class="p">|</span> <span class="nd">unreachable!</span><span class="p">())</span><span class="nf">.collect</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Let’s verify that this optimises as we expect:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span>     <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">],</span> <span class="mi">0</span>
<span class="nf">movups</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
<span class="nf">movups</span>  <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">xmm0</span>
<span class="nf">mov</span>     <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">],</span> <span class="mi">0</span>
<span class="nf">ret</span>
</code></pre></div></div>

<p>More or less the same assembly as before, except that we’re now setting the length of the Vec to 0.
Note, that the loop and the panic from the use of <code class="language-plaintext highlighter-rouge">unreachable!</code> are gone.</p>

<p>We can now integrate this into our previous code as follows:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="k">mut</span> <span class="n">buffer_store</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;&amp;</span><span class="nb">str</span><span class="o">&gt;</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">loop</span> <span class="p">{</span>
    <span class="k">let</span> <span class="k">mut</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nf">reuse_vec</span><span class="p">(</span><span class="n">buffer_store</span><span class="p">);</span>
    <span class="k">let</span> <span class="n">text</span> <span class="o">=</span> <span class="nf">get_text</span><span class="p">();</span>
    <span class="n">buffer</span><span class="nf">.extend</span><span class="p">(</span><span class="n">text</span><span class="nf">.split</span><span class="p">(</span><span class="s">","</span><span class="p">));</span>
    <span class="c1">// Do work with `buffer`.</span>
    <span class="n">buffer_store</span> <span class="o">=</span> <span class="nf">reuse_vec</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Effectively, each time around the loop we move out of <code class="language-plaintext highlighter-rouge">buffer_store</code>, converting the type of the
<code class="language-plaintext highlighter-rouge">Vec</code>, use it for a bit, then convert it back and store it again in <code class="language-plaintext highlighter-rouge">buffer_store</code>. The only time
we’ll need a new heap allocation is when our <code class="language-plaintext highlighter-rouge">Vec</code> needs to grow. The types of <code class="language-plaintext highlighter-rouge">buffer_store</code> and
<code class="language-plaintext highlighter-rouge">buffer</code> are both <code class="language-plaintext highlighter-rouge">Vec&lt;&amp;str&gt;</code>, however the lifetime of the references is different.</p>

<h2 id="deallocation-on-a-separate-thread">Deallocation on a separate thread</h2>

<p>Freeing memory is generally a lot slower than allocating it. If we’ve done a very large allocation,
it can sometimes be worthwhile passing it to another thread to free it, so that we can get on with
other work.</p>

<p>For example, if using rayon, we might use <code class="language-plaintext highlighter-rouge">rayon::spawn</code> to spawn a task that drops our buffer:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">process_buffer</span><span class="p">(</span><span class="n">buffer</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">u8</span><span class="o">&gt;</span><span class="p">)</span> <span class="p">{</span>
   <span class="c1">// Do some work with `buffer`.</span>

   <span class="nn">rayon</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(||</span> <span class="nf">drop</span><span class="p">(</span><span class="n">buffer</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Note, that <code class="language-plaintext highlighter-rouge">rayon::spawn</code> itself does a heap allocation, so this would only be worthwhile if
<code class="language-plaintext highlighter-rouge">buffer</code> was potentially very large. This is definitely something you’d want to benchmark to see if
it actually improves the runtime for your use-case. There is at least one place in the Wild linker
where we did this and it did give a measurable reduction in runtime.</p>

<p>Similar to buffer reuse, if our heap allocation has non-static lifetimes associated with it, we can
get rid of them using <code class="language-plaintext highlighter-rouge">reuse_vec</code>.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">process_buffer</span><span class="p">(</span><span class="n">names</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;&amp;</span><span class="p">[</span><span class="nb">u8</span><span class="p">]</span><span class="o">&gt;</span><span class="p">)</span> <span class="p">{</span>
   <span class="c1">// Do some work with `names`.</span>

   <span class="k">let</span> <span class="n">names</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;&amp;</span><span class="p">[</span><span class="nb">u8</span><span class="p">]</span><span class="o">&gt;</span> <span class="o">=</span> <span class="nf">reuse_vec</span><span class="p">(</span><span class="n">names</span><span class="p">);</span>
   <span class="nn">rayon</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(||</span> <span class="nf">drop</span><span class="p">(</span><span class="n">names</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In this case, we’re converting the <code class="language-plaintext highlighter-rouge">Vec</code> from a <code class="language-plaintext highlighter-rouge">Vec&lt;&amp;[u8]&gt;</code> to  <code class="language-plaintext highlighter-rouge">Vec&lt;&amp;'static [u8]&gt;</code>.</p>

<h2 id="bonus-strip-lifetime-with-non-trivial-drop">Bonus: Strip lifetime with non-trivial Drop</h2>

<p>This is a bonus tip that wasn’t included in the talk and builds on the previous tip and is in
response to a question by VorpalWay on Reddit. If you want to drop a <code class="language-plaintext highlighter-rouge">Vec&lt;T&gt;</code> and <code class="language-plaintext highlighter-rouge">T</code> has both a
non-static lifetime and a non-trivial <code class="language-plaintext highlighter-rouge">Drop</code>, then things get slightly more tricky. The trick here
is to convert to a struct that is the same as <code class="language-plaintext highlighter-rouge">T</code>, but has non-static references replaced with
<code class="language-plaintext highlighter-rouge">MaybeUninit</code>.</p>

<p>For example, suppose we have the following struct:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">Foo</span><span class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="n">owned</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
    <span class="n">borrowed</span><span class="p">:</span> <span class="o">&amp;</span><span class="nv">'a</span> <span class="nb">str</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can define a new struct:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">StaticFoo</span> <span class="p">{</span>
    <span class="n">owned</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
    <span class="n">borrowed</span><span class="p">:</span> <span class="n">MaybeUninit</span><span class="o">&lt;&amp;</span><span class="k">'static</span> <span class="nb">str</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can then convert our Vec to the new type with zero cost and no unsafe:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">without_lifetime</span><span class="p">(</span><span class="n">foos</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Foo</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">StaticFoo</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="n">foos</span><span class="nf">.into_iter</span><span class="p">()</span>
        <span class="nf">.map</span><span class="p">(|</span><span class="n">f</span><span class="p">|</span> <span class="n">StaticFoo</span> <span class="p">{</span>
            <span class="n">owned</span><span class="p">:</span> <span class="n">f</span><span class="py">.owned</span><span class="p">,</span>
            <span class="n">borrowed</span><span class="p">:</span> <span class="nn">MaybeUninit</span><span class="p">::</span><span class="nf">uninit</span><span class="p">(),</span>
        <span class="p">})</span>
        <span class="nf">.collect</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The presence of <code class="language-plaintext highlighter-rouge">MaybeUnit::uninit()</code> tells the compiler that it’s OK to have anything there, so it
can choose to leave whatever <code class="language-plaintext highlighter-rouge">&amp;str</code> was in the original <code class="language-plaintext highlighter-rouge">Foo</code> struct. This means that it’s valid to
produce a <code class="language-plaintext highlighter-rouge">StaticFoo</code> with the same in-memory representation as the <code class="language-plaintext highlighter-rouge">Foo</code> that it replaces, allowing
it to eliminate the loop. The asm for this function is:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nf">movups</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">]</span>
 <span class="nf">mov</span>     <span class="nb">rax</span><span class="p">,</span> <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rsi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">]</span>
 <span class="nf">movups</span>  <span class="nv">xmmword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">],</span> <span class="nv">xmm0</span>
 <span class="nf">mov</span>     <span class="kt">qword</span><span class="p">,</span> <span class="nv">ptr</span><span class="p">,</span> <span class="p">[</span><span class="nb">rdi</span><span class="p">,</span> <span class="o">+</span><span class="p">,</span> <span class="mi">16</span><span class="p">],</span> <span class="nb">rax</span>
 <span class="nf">ret</span>
</code></pre></div></div>

<p>i.e. the loop was indeed eliminated.</p>

<p>Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.</p>

<h1 id="thanks">Thanks</h1>

<p>Thanks to my <a href="https://github.com/sponsors/davidlattimore">github sponsors</a>. Your contributions help
to make it possible for me to continue to work on this kind of stuff rather than going and getting a
“real job”.</p>

<ul>
  <li>CodeursenLiberte</li>
  <li>Urgau</li>
  <li>pmarks</li>
  <li>repi</li>
  <li>embark-studios</li>
  <li>mati865</li>
  <li>bes</li>
  <li>joshtriplett</li>
  <li>mstange</li>
  <li>bcmyers</li>
  <li>Rafferty97</li>
  <li>acshi</li>
  <li>Kobzol</li>
  <li>flba-eb</li>
  <li>jonhoo</li>
  <li>marxin</li>
  <li>tommythorn</li>
  <li>binarybana</li>
  <li>teburd</li>
  <li>bearcove</li>
  <li>yerke</li>
  <li>teh</li>
  <li>twilco</li>
  <li>Shnatsel</li>
  <li>coastalwhite</li>
  <li>wezm</li>
  <li>davidcornu</li>
  <li>gendx</li>
  <li>rrbutani</li>
  <li>nazar-pc</li>
  <li>willstott101</li>
  <li>tatsuya6502</li>
  <li>teohhanhui</li>
  <li>jkendall327</li>
  <li>EdorianDark</li>
  <li>drmason13</li>
  <li>HadrienG2</li>
  <li>jplatte</li>
  <li>rukai</li>
  <li>ymgyt</li>
  <li>dream-dasher</li>
  <li>alexkirsz</li>
  <li>Pratyush</li>
  <li>Tudyx</li>
  <li>coreyja</li>
  <li>dralley</li>
  <li>irfanghat</li>
  <li>mvolfik</li>
  <li>simtheverse</li>
</ul>

<h2 id="discussion">Discussion</h2>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1n7814i/wild_performance_tricks/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[Last week I had the pleasure of attending RustForge in Wellington, New Zealand. I gave a talk titled “Wild performance tricks”. You can watch a recording of my talk. If you’d prefer to read rather than watch, the rest of this post will cover more or less the same material. The talk shows some linker benchmarks, which I’ll skip here and focus instead on the optimisations, which I think are the more interesting part of the talk.]]></summary></entry><entry><title type="html">Audio: Compose Podcast Interview</title><link href="https://davidlattimore.github.io/posts/2025/06/02/compose-podcast-interview.html" rel="alternate" type="text/html" title="Audio: Compose Podcast Interview" /><published>2025-06-02T13:00:00+00:00</published><updated>2025-06-02T13:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2025/06/02/compose-podcast-interview</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2025/06/02/compose-podcast-interview.html"><![CDATA[<p>Last week, I had the pleasure of having a conversation with Tim McNamara for <a href="https://timclicks.dev/podcast/david-lattimore-faster-linker-faster-builds">an
episode</a> of his podcast,
Compose. We talked about the Wild linker, linking in general, Rust coding styles, contributing to
open source and a range of other topics.</p>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[Last week, I had the pleasure of having a conversation with Tim McNamara for an episode of his podcast, Compose. We talked about the Wild linker, linking in general, Rust coding styles, contributing to open source and a range of other topics.]]></summary></entry><entry><title type="html">Designing Wild’s incremental linking</title><link href="https://davidlattimore.github.io/posts/2024/11/19/designing-wilds-incremental-linking.html" rel="alternate" type="text/html" title="Designing Wild’s incremental linking" /><published>2024-11-19T14:00:00+00:00</published><updated>2024-11-19T14:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/11/19/designing-wilds-incremental-linking</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/11/19/designing-wilds-incremental-linking.html"><![CDATA[<h1 id="designing-wilds-incremental-linking">Designing Wild’s incremental linking</h1>

<p>Whenever I’m about to embark on implementing something even slightly non-trivial, I typically write
out a plan for what I’m about to do. Writing it down helps me to uncover things that I hadn’t
thought about. Usually, I write these designs only for myself, however this time, I thought I’d try
something different, writing a design and sharing it with anyone who might be interested. My hope is
that some people might have interesting ideas for variations on this design that I hadn’t
considered. If nothing else, hopefully this document will be an interesting read for someone.</p>

<p>If you’ve read my <a href="https://davidlattimore.github.io/">previous posts</a>, you’ll know that I’ve been
writing a linker, called Wild, with the goal of being a very fast incremental linker. I didn’t make
the linker incremental from the start because I wanted to get it reasonably correct and reasonably
fast before adding incremental linking to the mix. Wild is now working well enough that I’ve
switched to using it as my default linker. That’s not to say there aren’t things it doesn’t do
correctly, e.g. large model stuff, which is needed for very large executables, but it works well
enough for compiling Rust code with bits of C or C++ mixed in. It’s also relatively fast - on my
laptop, Wild can link itself 48% faster than Mold. If there’s lots of debug info however, Wild is
currently often slower. I hope to work on improving the performance of linking debug info
eventually.</p>

<p>So I feel that while there’s plenty more that could be done to improve Wild’s non-incremental
linking, now is probably a good time to start work on incremental. To that end, this document is my
plan for how I intend to go about that.</p>

<h1 id="the-end-goal">The end-goal</h1>

<p>First, let’s discuss the reason for wanting incremental linking. Mostly it’s to make linking as fast
as possible. I’d like it if when I’m making edits to a test case, I could see the pass / fail status
of that test within say 10 ms of hitting save. Incremental linking alone isn’t sufficient to reach
this goal, but it is necessary. There is lots of work needed to get to that point, including lots of
big changes to how the Rust compiler works. Let’s leave those changes for separate discussions,
since this document is about incremental linking.</p>

<p>In order to get that kind of speed, we can’t afford to reprocess all the inputs, rewrite the entire
output etc. We need to make minimal edits to update the existing binary on disk. For example, if
we’re editing our test case, then we’d like to just be rewriting the part of the executable that
contains the compiled code for that test, plus possibly a table of line numbers used by panics.</p>

<p>A further goal of incremental linking is as a step towards hot code reloading - i.e. updating a
binary while it is running. That too, however, deserves its own document, so I won’t go into it in
detail now. It does however influence the design. For example, one idea for fast incremental linking
is to do an initial link with all your dependencies, then the final link just tacks on your code.
This might be fine for the goal of fast linking, however it doesn’t help with the eventual goal of
hot code reloading.</p>

<h1 id="out-of-scope-for-first-implementation">Out-of-scope for first implementation</h1>

<p>Linkers are pretty complex bits of software even without incremental linking. Adding incremental
updates into the mix adds significantly to this. However we can make things a little easier on
ourselves by reducing the scope somewhat.</p>

<h2 id="archive-semantics">Archive semantics</h2>

<p>One area of complexity in linkers is related to archive entries - so called “archive semantics”.
Linkers ignore entries in archives unless the archive entry defines at least one symbol that a
previous input object left undefined. This is used to avoid unnecessary initialisation of subsystems
that aren’t in use. Changing which archive entries are active in an incremental link would add
substantial complexity. Our target use case is incremental linking of Rust code, and Rust code,
while it uses archives for rlibs, doesn’t make use of archive semantics. So supporting incremental
updates of archive semantics would be a lot of work for very questionable benefit towards our use
case. For that reason, the plan is to punt on it for the first implementation.</p>

<h2 id="unused-section-garbage-collection">Unused section garbage collection</h2>

<p>Most linkers support a flag <code class="language-plaintext highlighter-rouge">--gc-sections</code>, which causes them to get rid of sections that aren’t
reachable from a root or marked as must-keep. Wild supports this flag, and in fact does it by
default. However supporting this in conjunction with incremental linking would add extra complexity,
so we’ll skip this for now. The main downsides of this are that the binary will end up a bit bigger
and we might spend a bit of extra time copying data into the output file</p>

<h2 id="removal-of-old-merged-strings">Removal of old merged strings</h2>

<p>Strings in string-merge sections may be removed in subsequent links. Removing them from the output
would require reference counting each string. Besides taking up a little extra space in the binary,
there doesn’t seem to be much downside to keeping them around, so for now, we’ll do that.</p>

<h2 id="strictly-ordered-sections">Strictly ordered sections</h2>

<p>Our approach to incremental linking depends on not moving stuff that hasn’t changed. If we need to
put input sections in a particular order in the output, then we might need to relocate unchanged in
order to make space. This would hurt performance. Most output sections are fine to put input
sections in any order. However a few aren’t. For example <code class="language-plaintext highlighter-rouge">.init</code> is a section that contains a single
function and parts of that function come from different object files. The return instruction for
this function comes from <code class="language-plaintext highlighter-rouge">crtn.o</code> and it’s essential that this goes at the end of the output
section.</p>

<p>For now, we’ll just fall back to a full initial-incremental link if any sections that have strict
ordering get changed. Fortunately these aren’t the kinds of sections that you tend to edit when
iterating on code unless you’re doing something very obscure. Modern code tends not to use these
sections anyway, but rather uses <code class="language-plaintext highlighter-rouge">.init_array</code> which we discuss later in this document.</p>

<h1 id="configuring-incremental-linking">Configuring incremental linking</h1>

<p>To enable incremental linking, I’ll add a flag <code class="language-plaintext highlighter-rouge">--incremental</code>. I’ll probably also support setting
an environment variable - <code class="language-plaintext highlighter-rouge">WILD_INCREMENTAL=1</code>, since in many cases that may be easier for a user to
set than a flag.</p>

<p>I’ll likely add additional configuration, in particular a flag for what percentage of additional
space to add to each output section to allow for growth.</p>

<h1 id="object-diffing">Object diffing</h1>

<p>Ideally, when doing an incremental link, the compiler would pass only the bits that have changed to
the linker. This would be in the form of a list of updated, added and maybe deleted sections. Each
section would generally contain exactly one symbol that would point to the start of the section,
however we also need to handle sections with 0 or more than 1 symbols.</p>

<p>Unfortunately, for the time being, the compiler isn’t going to give us this list of updated
sections, so we’ll have to compute it ourselves. By separating incremental linking into two phases -
(1) computing a diff then (2) applying a diff - it’ll be easier to take advantage of future
compilers that can just supply us the diff directly. This separation may also open up extra options
for debugging and testing of incremental linking - e.g. testing each part separately.</p>

<p>The first stage of diffing will be determining which files have changed. We could hash the entirety
of each file, however, with lots of input files, that would be expensive, so instead the plan is to
just check to see if the modification timestamp has changed.</p>

<p>Once we have a list of changed files to look at, we can open each changed file and determine which
sections have changed.</p>

<p>Matching sections between the old and new versions of the object file is slightly tricky. For
sections containing code, the section should have a name that includes the mangled symbol name of
the function. e.g. <code class="language-plaintext highlighter-rouge">.text._ZN4core3fmt9Formatter9write_fmt</code> These are easy to match, since the name
should remain unchanged. However, sections containing anonymous data are harder. They have names
like <code class="language-plaintext highlighter-rouge">.rodata..L__unnamed_75</code>, which will likely change when edits are made. My plan at this stage
is to match these sections by looking at what references them. So for example if
<code class="language-plaintext highlighter-rouge">.text._ZN4core3fmt9Formatter9write_fmt</code> in the old object file references <code class="language-plaintext highlighter-rouge">.rodata..L__unnamed_75</code>
then in the new object file it references <code class="language-plaintext highlighter-rouge">.rodata..L__unnamed_78</code>, we’d match those two sections
for the purposes of diffing.</p>

<p>In order to diff the old object file against the new object file, we need to keep a copy of the old
object file. This can be done relatively quickly by making a hard link for each input file. Files to
which we don’t have write access, or which are located on a different filesystem than our
incremental state directory, would be skipped. These are likely system libraries and are unlikely to
change. If they do end up changing, then we’d need to link from scratch.</p>

<p>Rust writes each crate that we depend on as an .rlib file. These are archives which will often
contain multiple object files - one for each codegen unit. When such a crate is edited, one or more
of the codegen units within the archive will be updated. Unlike with bare object files on disk, we
can’t rely on timestamps to determine whether a file within the archive has changed. At least we
can’t at the moment because rustc doesn’t set the timestamp field. We can probably just compare the
bytes of the files directly for now.</p>

<h1 id="persistent-state">Persistent state</h1>

<p>Wild will need to write various bits of state to disk in order to support making incremental updates
to the output file. My plan at this stage is to put these into a directory with a name based on the
output file. For example, if the output file is <code class="language-plaintext highlighter-rouge">target/debug/ripgrep</code>, then the incremental
directory would be <code class="language-plaintext highlighter-rouge">target/debug/ripgrep.incr</code>.</p>

<p>When accessing state files during an incremental link, we’ll often want to avoid reading the entire
file. In most cases, this will be done by using mmap to access the file. This means that the on-disk
and in-memory format will need to be the same. We don’t need to worry about things like endianness
of the data, since moving the incremental link state between machines isn’t a use-case we intend to
support.</p>

<h2 id="previous-input-files-and-other-metadata">Previous input files and other metadata</h2>

<p>As mentioned above with regard to diffing, we’ll likely need to store copies of the old input files.
We can put these in a subdirectory of our state directory.</p>

<p>We’ll also need an index file that contains information about all of the input files and arguments
for the previous link. This file shouldn’t be large, so we can probably afford to serialise and
deserialise it each time.</p>

<p>We can store additional information here such as:</p>

<ul>
  <li>The size and capacity of each output section.</li>
  <li>The version of Wild used.</li>
  <li>Any additional small bits of information such as sizes of various tables.</li>
</ul>

<h2 id="symbol-name-to-symbol-id-map">Symbol name to symbol ID map</h2>

<p>When linking the updated code, we need to be able to quickly look up symbols by name and we don’t
want to have to rebuild the map from symbol names to symbol IDs every time we do an incremental
link. This means that we’ll need to persist our map from symbol names to symbol IDs to disk.
Currently, this is a hashmap stored in memory. The keys of this map are currently <code class="language-plaintext highlighter-rouge">&amp;[u8]</code> - i.e.
slices of bytes. These slices reference data from the original input objects. This means that when
building this hashmap, we don’t currently copy the bytes of the symbol names, we just use them
in-place. Persisting this is somewhat tricky.</p>

<p>In the short term, the easiest option is probably to just accept that when incremental linking is
enabled, we’ll need to copy the symbol names into our map. If we’re doing that, we can use some
existing crate like <code class="language-plaintext highlighter-rouge">sled</code> to store our map. Besides needing to copy our symbol names, sled does
other things that we don’t really need like transactions. But it’ll get us going quickly and we can
iterate from there.</p>

<p>Longer term, I think what will give the best performance will be something like <code class="language-plaintext highlighter-rouge">odht</code> (an on-disk
hash table), but where the keys are external to the table. So hashing or comparing a key would
involve an external lookup to fetch the bytes of the actual key.</p>

<p>It’s likely that whatever we do here, it’ll be slower than what we’re doing now. We don’t want to
slow down the linker if incremental linking is disabled, so we’ll need to keep the existing
in-memory hashmap implementation around. We should be able to switch between the in-memory and the
on-disk maps by making code that does name lookups generic over some trait.</p>

<h2 id="symbol-resolution-table">Symbol resolution table</h2>

<p>We currently store a table in which we can look up the address, GOT address, PLT address etc of each
symbol. This is stored in memory as a <code class="language-plaintext highlighter-rouge">Vec</code>. This can probably just be changed to an mmapped file
instead. Accessing this table should be pretty similar whether it’s backed by a Vec or by a file.
Initialising it will be different and a bit slower for file-based storage, since we’d need to
zero-initialise the file when we create it before we could mmap it, whereas currently we don’t
zero-initialise and just write the resolutions concurrently from multiple threads. We could
experiment with not using mmap for the initial write of this file. i.e. just create the Vec, then
write the bytes of the Vec to a file.</p>

<h2 id="relocation-reverse-index">Relocation reverse index</h2>

<p>When a symbol gets moved, e.g. because it’s in a section that got updated and the new version of
that section didn’t fit in the old spot, we need to update all references to that symbol to point to
the new location.</p>

<p>In order to do this efficiently, we need to store all relocations indexed by the symbol to which
they refer. Doing this efficiently from multiple threads without ending up with non-deterministic
results is somewhat tricky. Certainly creating a Vec for each symbol to hold all the references to
that symbol would likely be too expensive.</p>

<p>My current plan is, for each symbol, to store the index of the first relocation that references that
symbol. Then for each relocation, store the index of the next relocation for the same symbol. This
would mean that the list of relocations for a symbol is stored effectively as an index-based linked
list within the list of relocations.</p>

<p>This approach to storage can be done with 2 or 3 flat files that we can treat as mutable slices of
their respective data types. We can build these in-place from multiple threads provided we use
atomic compare-exchange operations to update the list heads.</p>

<h2 id="dynamic-relocation-reverse-index">Dynamic relocation reverse index</h2>

<p>Input sections may contain relocations that refer to symbols provided by shared objects. Such
relocations cannot be resolved at link time and must instead be resolved at runtime. This is done by
emitting dynamic relocations. Executable code will generally not make direct use of such
relocations, but instead use the global offset table (GOT) which will then have the dynamic
relocation. Data sections however often contain vtables which will need dynamic relocations. If such
a data section gets removed or updated, then we need to make sure we remove or update any dynamic
relocations associated with the old version of that section.</p>

<p>Wild doesn’t currently have a concept of a global section ID. All storage of information about
sections is currently on a per-input-file basis. This is inconvenient for the purposes of storing a
reverse index for dynamic relocations, so probably, I’ll introduce a global input section ID, then
store a table from input section ID to the first dynamic relocation for that input section. All the
dynamic relocations for a section should be adjacent, so we can also just store the count of
relocations.</p>

<h2 id="exception-frames">Exception frames</h2>

<p>Exception frames are needed in order for backtraces and panics to work. Information about all
executable sections is put in the <code class="language-plaintext highlighter-rouge">.eh_frame</code> section. The linker splits this input section up by
locating the individual frame description entries (FDEs) then recombining them into the output
<code class="language-plaintext highlighter-rouge">.eh_frame</code> section. The linker also needs to write a <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code> section which is a sorted
index of frame addresses and is used at runtime to do a binary search in order to locate the frame
information for a particular address.</p>

<p>When an executable section is updated or removed, we need to update or remove the corresponding
FDEs. Any change to the FDEs will require a corresponding change to the sorted <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code>
section.</p>

<p>Similar to dynamic relocations, all FDEs for an input section will be adjacent in the output file,
so a start index and a count should be sufficient to identify which FDEs belong to a particular
input section.</p>

<h2 id="string-merge-index">String merge index</h2>

<p>Input sections that have the M (merge) and S (string) bits set are string-merge sections. At link
time, we locate each string, by looking for its null terminator, then deduplicate the string with
other strings that are destined for the same output section.</p>

<p>When we incrementally link, some of the strings in a string-merge section may have changed. Even if
none have changed, we still need a way to look up the address for a particular string. This means
that we need to persist, for each output string-merge section, an index of where each string is
located. In some ways this is similar to our symbol-name to symbol-ID map. As with that map, we’ll
initially use a third-party on-disk database like <code class="language-plaintext highlighter-rouge">sled</code>, then later look at more optimised options
to avoid copying the actual strings.</p>

<h1 id="logging">Logging</h1>

<p>A log of links will by default be written to the user’s <a href="https://docs.rs/directories/5.0.1/directories/struct.ProjectDirs.html#method.state_dir">state
directory</a>.
This will be able to be displayed by running <code class="language-plaintext highlighter-rouge">wild log</code> and will show a line per linker invocation
with information about whether we did a full link or an incremental link and if we did a full link,
the reason why. The intention here is to provide a way for a user to be able to diagnose why
incremental linking isn’t behaving as expected.</p>

<h1 id="algorithm">Algorithm</h1>

<p>Once incremental linking is implemented, the linker will have three modes of operation.</p>

<ul>
  <li>Non-incremental. In this mode, it’ll behave much like it does now.</li>
  <li>Initial-incremental. It’ll link from scratch but prepare for subsequent incremental linking.
Output sections will have additional space allocated so that they can grow and various state files
will be written.</li>
  <li>Incremental-update. Update the output file by making minimal changes and leaving the rest in
place. Will also need to update the state files to reflect changes that were made.</li>
</ul>

<p>The following is a rough outline of the proposed algorithm for an incremental-update. If any stage
fails, then it’ll fall back to doing initial-incremental.</p>

<ul>
  <li>Check changes in flags.</li>
  <li>Check if a previous attempt to incrementally link was interrupted or didn’t complete for some reason.</li>
  <li>Identify changed files.</li>
  <li>Diff changed files to produce section update list.</li>
  <li>Determine how much additional space needs to be used in each output section. This includes
generated sections such as the global offset table (GOT), dynamic relocations etc.</li>
  <li>Allocate addresses for each changed / added section. A section that has run out of space will
result in failure (fallback to initial-incremental), however this may be relaxed in future for
cases where we can safely create an additional section of the same type.</li>
  <li>Update symbol resolutions and record which symbols have changed their resolution.</li>
  <li>Write updated / added sections to the output file.</li>
  <li>Rewrite relocations for symbols with changed resolutions.</li>
  <li>Add / remove / update dynamic relocations</li>
  <li>Add / remove / update exception frame information</li>
  <li>Update .eh_frame_hdr by performing insertions and removals corresponding to FDEs that we added /
removed. We can either do this by sorting the list of additions / removals, then doing a single
pass over .eh_frame_hdr to merge in the added / removed index entries, or we could rebuild and
resort the entire index.</li>
  <li>Update other state files.</li>
</ul>

<h1 id="sections-that-cant-contain-gaps">Sections that can’t contain gaps</h1>

<p>Sections containing code or data are generally fine to have gaps within them. However there are some
sections that cannot contain gaps or where if there are gaps, they need special handling. For
example <code class="language-plaintext highlighter-rouge">.init_array</code> is a list of pointers to initialiser functions that get run on startup. An
uninitialised element of this array would lead to undefined behaviour (likely a crash). For a
section like this that we know contains function pointers, we could fill gaps with pointers to a
no-op function. However, custom sections can also be declared where the linker generates symbols
that point to the start and end of the section. For such custom sections, we don’t have any
reasonable filler value to put in gaps.</p>

<p>Where gaps would be left, we can probably relocate input sections from the end of the output section
to fill the gaps. This should work provided all the input sections are the same size - generally the
case when these sections actually just contain pointers. If we have input sections with different
sizes, then we might need to rewrite the whole output section, although initially, we’ll probably
just fail the incremental update and fall back to a full initial-link.</p>

<h1 id="testing">Testing</h1>

<p>Most of Wild’s tests are small programs written in C, assembly, Rust etc. These programs get
compiled then linked with both GNU ld and Wild. They then get executed to make sure they produce the
expected result. We also compare the outputs using linker-diff (part of the Wild repository) which
helps by making it more obvious what we’re getting wrong and also picks up some kinds of bugs that
just executing our test binaries might not detect.</p>

<p>In order to test incremental linking, we can extend this system by compiling multiple versions of
each input file. For C code, we could predefine some macro, for example <code class="language-plaintext highlighter-rouge">-D WILD_INC=1</code> that the
code can then use to switch between different definitions of some function or data.</p>

<p>In addition to diffing the resulting binaries against the output of GNU ld, we can also diff the
incrementally linked output from Wild against a non-incremental output of Wild for the same inputs.</p>

<h1 id="feedback">Feedback</h1>

<p>Hopefully most, or at least some of that made sense. If you have any thoughts or questions, please
do reach out. My contact details can be found on my <a href="/about">about page</a> or you can comment on the
Reddit thread that I’ll link below.</p>

<h1 id="thanks">Thanks</h1>

<p>Thanks to my <a href="https://github.com/sponsors/davidlattimore">github sponsors</a>. Your contributions help
to make it possible for me to continue to work on this kind of stuff rather than going and getting a
“real job”.</p>

<ul>
  <li>bearcove</li>
  <li>repi</li>
  <li>marxin</li>
  <li>bes</li>
  <li>Urgau</li>
  <li>jonhoo</li>
  <li>Kobzol</li>
  <li>pmarks</li>
  <li>coastalwhite</li>
  <li>mstange</li>
  <li>twilco</li>
  <li>binarybana</li>
  <li>willstott101</li>
  <li>bcmyers</li>
  <li>Shnatsel</li>
  <li>Rafferty97</li>
  <li>joshtriplett</li>
  <li>teburd</li>
  <li>wezm</li>
  <li>davidcornu</li>
  <li>tommythorn</li>
  <li>flba-eb</li>
  <li>acshi</li>
  <li>gendx</li>
  <li>teh</li>
  <li>nazar-pc</li>
  <li>yerke</li>
  <li>drmason13</li>
  <li>NobodyXu</li>
  <li>jplatte</li>
  <li>ymgyt</li>
  <li>Pratyush</li>
  <li>+2 anonymous</li>
</ul>

<h1 id="discussion-threads">Discussion threads</h1>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1gvdref/designing_wilds_incremental_linking/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[Designing Wild’s incremental linking]]></summary></entry><entry><title type="html">Video: Wild linker talk at GOSIM China 2024</title><link href="https://davidlattimore.github.io/posts/2024/11/12/gosim-2024-wild-linker-talk.html" rel="alternate" type="text/html" title="Video: Wild linker talk at GOSIM China 2024" /><published>2024-11-12T13:00:00+00:00</published><updated>2024-11-12T13:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/11/12/gosim-2024-wild-linker-talk</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/11/12/gosim-2024-wild-linker-talk.html"><![CDATA[<p>In October, I attended the open source conference, GOSIM 2024 in China where I gave a talk about the
Wild linker.</p>

<p><a href="https://www.youtube.com/watch?v=XFSwmSXv2QA">Video</a></p>

<p><a href="https://www.reddit.com/r/rust/comments/1gq0x3t/video_of_wild_linker_talk_at_gosim_2024/">Discussion on
Reddit</a></p>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[In October, I attended the open source conference, GOSIM 2024 in China where I gave a talk about the Wild linker.]]></summary></entry><entry><title type="html">Rust dylib rabbit holes</title><link href="https://davidlattimore.github.io/posts/2024/08/27/rust-dylib-rabbit-holes.html" rel="alternate" type="text/html" title="Rust dylib rabbit holes" /><published>2024-08-27T14:00:00+00:00</published><updated>2024-08-27T14:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/08/27/rust-dylib-rabbit-holes</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/08/27/rust-dylib-rabbit-holes.html"><![CDATA[<p>Bevy is a popular game engine for Rust. It’s pretty large and compilation times can be an issue. To
help with this, Bevy provides an optional feature that when enabled, compiles most of Bevy as a
dynamic library. This allows for faster iteration as you don’t need to relink all the Bevy internals
each time you rebuild.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo run <span class="nt">--features</span> bevy/dynamic_linking
</code></pre></div></div>

<p>I was experimenting with this from the perspective of testing and profiling the linker I’m writing,
Wild (see <a href="https://davidlattimore.github.io/">previous posts</a>).</p>

<p>With that in mind, I was mostly looking at (a) how long it takes to link and (b) how well the
resulting .so file works.</p>

<p>Initially, I was only looking at debug builds. To speed up the build, I turned off debug info.</p>

<div class="language-toml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[profile.dev]</span>
<span class="py">debug</span> <span class="p">=</span> <span class="kc">false</span>
</code></pre></div></div>

<p>So this was perhaps more accurately described as a non-optimised build. Having optimisations off
should make the build faster right? Probably it does, but it doesn’t necessarily make linking
faster. Here’s the times for linking this shared object:</p>

<table>
  <thead>
    <tr>
      <th>Linker</th>
      <th>Time (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>lld (18)</td>
      <td>1975</td>
    </tr>
    <tr>
      <td>mold (2.32.1)</td>
      <td>1763</td>
    </tr>
    <tr>
      <td>wild</td>
      <td>895</td>
    </tr>
  </tbody>
</table>

<p>I’ll not include GNU ld because it’s more than 10 seconds, making it painful to benchmark.</p>

<p>If we now set <code class="language-plaintext highlighter-rouge">opt-level = 2</code>, then the link time drops quite dramatically:</p>

<table>
  <thead>
    <tr>
      <th>Linker</th>
      <th>Time (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>lld (18)</td>
      <td>545</td>
    </tr>
    <tr>
      <td>mold (2.32.1)</td>
      <td>287</td>
    </tr>
    <tr>
      <td>wild</td>
      <td>183</td>
    </tr>
  </tbody>
</table>

<p>I sometimes wonder if Rust (or more accurately Cargo) needs a third default profile “fastbuild” that
doesn’t have debug info and is optimised for building fast. I’m sure there are a bunch of tradeoffs
between compilation speed and debuggability that currently favour the latter. I bet there are
optimisations that, if applied, would speed up the build, especially a warm build, but which are
disabled in debug builds because they might make it harder to use a debugger on the code.</p>

<p>But what really drew my attention with the non-optimised build was what it’s getting the linker to
do. We’re creating a shared object (.so file on Linux). Rustc gives instructions to the linker to
tell it which symbols need to be exported. If a symbol is exported from the shared object, then an
executable or another shared object that depends on our shared object can make use of those symbols.
If a symbol isn’t exported then it cannot be directly referenced from outside the shared object.</p>

<p>In order to control which symbols get exported from the shared object, the linker is passed a
version script which specifies which symbols should be global and then downgrades the rest to local.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
  global:
    rust_metadata_bevy_dylib_2f311168f6c5d4f8;
    _ZN9hashbrown3set24HashSet$LT$T$C$S$C$A$GT$6insert17hcb8b576667efe889E;
    _ZN9hashbrown3set24HashSet$LT$T$C$S$C$A$GT$6remove17h53654c4e42de8b15E;
....

  local:
    *;
};
</code></pre></div></div>

<p>For a non-optimised build, this version script lists more than 300k symbols to export! Contrast this
with the optimised build, where it lists only 18k symbols. Looking into this a bit, the majority of
the extra symbols happen because non-optimised builds enable <code class="language-plaintext highlighter-rouge">-Z share-generics</code> by default. These
shared generics not only get exported from the crates that monomorphise them, they also get exported
from the dylib. The remainder of the extra symbols look to be functions that would have been inlined
in an optimised build. This seems somewhat surprising that a public function would be exported from
a dylib only if it didn’t get inlined.</p>

<p>But let us for the moment assume that we actually for some reason need all 300k symbols to be
exported.</p>

<p>When a dynamically linked executable or a shared object gets loaded, the runtime can look up symbols
that are provided by other shared objects. On Linux, symbol lookups can either be eager, meaning
that they happen when the binary is loaded, or lazy, meaning that the symbol is only looked up when
the function is first called. For security reasons, lazy binding is less popular these days and rust
indeed sets linker flags to bind symbols at load time.</p>

<p>For shared objects produced by rustc, most of these non-lazy symbol lookups are done with <code class="language-plaintext highlighter-rouge">GLOB_DAT</code>
relocations. These relocations are instructions to the runtime to put the address of a symbol with a
particular name at a particular location in memory. For example, the following relocation says to
look up the symbol <code class="language-plaintext highlighter-rouge">__rust_alloc</code>, then put the address of that symbol at address <code class="language-plaintext highlighter-rouge">0x93ec698</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000093ec698  0000000700000006 R_X86_64_GLOB_DAT      0000000000000000 __rust_alloc + 0
</code></pre></div></div>

<p>If we check how many <code class="language-plaintext highlighter-rouge">GLOB_DAT</code> relocations are in our bevy shared object, we get a bit of a
surprise.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>readelf <span class="nt">-W</span> <span class="nt">-r</span> libbevy_dylib.so | <span class="nb">grep </span>GLOB_DAT | <span class="nb">wc</span> <span class="nt">-l</span>
291185
</code></pre></div></div>

<p>But <code class="language-plaintext highlighter-rouge">GLOB_DAT</code> is for resolving references to symbols that the shared object depends on, so why is
the number of outgoing references so similar to the number of symbols that the shared object
exports?</p>

<p>Indeed, it turns out that this isn’t a coincidence. The majority of the symbols with <code class="language-plaintext highlighter-rouge">GLOB_DAT</code>
relocations are for symbols that are defined by the dylib itself.</p>

<p>But why would the dylib request runtime resolution of a symbol that it itself defines? Dynamic
linking on Linux allows symbols defined by shared objects to be overridden (also known as
“interposing”). One use-case for this is to override the allocator provided by libc in order to
perform runtime checks.</p>

<p>But we don’t really want to be able to override all these symbols, we just want them to be exported
so that they can be used by our binary that uses the shared object. When the compiler builds an
object file on Linux, symbols can be local or global. Locals are only accessible within that codegen
unit, while globals can be referenced from other codegen units. Global symbols can then be further
restricted by setting their visibility, which affects how they’ll be treated when dynamic linking.</p>

<table>
  <thead>
    <tr>
      <th>Binding</th>
      <th>Visibility</th>
      <th>Accessible from other codegen units?</th>
      <th>Accessible from other dynamic objects?</th>
      <th>Can be overridden?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Local</td>
      <td> </td>
      <td>No</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Global</td>
      <td>Hidden</td>
      <td>Yes</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Global</td>
      <td>Protected</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Global</td>
      <td>Default</td>
      <td>Yes</td>
      <td>Yes</td>
      <td>Yes</td>
    </tr>
  </tbody>
</table>

<p>The key difference here is between default visibility and protected visibility. The latter means
that the symbol cannot be interposed (overridden). A default visibility symbol however can be
interposed, which means that if another shared object earlier in the load order, or the executable
itself defines a symbol with the same name, that will take precedence.</p>

<p>OK, so we just need to set all our symbols to protected. That way they’ll be exported from the
shared object, but won’t be permitted to be overridden.</p>

<p>I found the code in rustc that sets symbol visibility and prototyped changing it to set symbols to
have protected visibility unless the symbol was marked as <code class="language-plaintext highlighter-rouge">#[no_mangle]</code>. This worked and
drastically reduced the number of <code class="language-plaintext highlighter-rouge">GLOB_DAT</code> relocations. To test how much of a difference this
makes, I tried loading shared objects with and without this change.</p>

<ul>
  <li>Default visibility: Shared object took about 150ms to load.</li>
  <li>Protected visibility: Shared object took about 5ms to load.</li>
</ul>

<p>OK, that’s great. At that point I thought I should look for existing issues related to this and
indeed found one. The creator of the cranelift backend for rustc, bjorn3 had also attempted to
change symbols to use protected visibility, but had hit issues when linking with GNU ld.</p>

<p>GNU ld complains that direct references to protected symbols cannot be used when building a shared
object. I tried GNU ld and got the same problem.</p>

<p>But let’s think about this for a moment, why can’t a shared object have direct references to a
protected symbol - it cannot be overridden, so it should be fine to reference it directly. Right?</p>

<p>To understand what GNU ld’s objection is here, we need to look at how GCC compiles C code. We’ll
start by looking at what it does with C code that references data.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">int</span> <span class="n">my_value</span><span class="p">;</span> <span class="c1">// Likely from a header file</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">my_value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>First, let’s look at what the clang compiler does with this.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   <span class="err">0:</span>	<span class="err">48</span> <span class="err">8</span><span class="nf">b</span> <span class="mi">05</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> 	<span class="nv">mov</span>    <span class="mh">0x0</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">),</span><span class="o">%</span><span class="nb">rax</span>        <span class="err">#</span> <span class="mi">7</span> <span class="o">&lt;</span><span class="nv">main</span><span class="o">+</span><span class="mh">0x7</span><span class="o">&gt;</span>
			<span class="err">3:</span> <span class="nf">R_X86_64_REX_GOTPCRELX</span>	<span class="nv">my_value</span><span class="o">-</span><span class="mh">0x4</span>
   <span class="err">7:</span>	<span class="err">8</span><span class="nf">b</span> <span class="mi">00</span>                	<span class="nv">mov</span>    <span class="p">(</span><span class="o">%</span><span class="nb">rax</span><span class="p">),</span><span class="o">%</span><span class="nb">eax</span>
</code></pre></div></div>

<p>The first instruction is reading a pointer to our variable <code class="language-plaintext highlighter-rouge">my_value</code> from the GOT (global offset
table). The GOT is a table of pointers. These pointers are generally populated by the runtime at
startup to point to functions and variables that come from different shared objects.</p>

<p>The second instruction then loads the value from that pointer. This instruction sequence will work
fine even if the variable <code class="language-plaintext highlighter-rouge">my_value</code> ends up coming from a shared object.</p>

<p>If the variable ends up being statically linked into our binary, then the linker will transform this
assembly to:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="err">1130:</span>       <span class="err">48</span> <span class="err">8</span><span class="nf">d</span> <span class="mi">05</span> <span class="nv">f1</span> <span class="mi">2</span><span class="nv">e</span> <span class="mi">00</span> <span class="mi">00</span>    <span class="nv">lea</span>    <span class="mh">0x2ef1</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">),</span><span class="o">%</span><span class="nb">rax</span>        <span class="err">#</span> <span class="mi">4028</span> <span class="o">&lt;</span><span class="nv">my_value</span><span class="o">&gt;</span>
    <span class="err">1137:</span>       <span class="err">8</span><span class="nf">b</span> <span class="mi">00</span>                   <span class="nv">mov</span>    <span class="p">(</span><span class="o">%</span><span class="nb">rax</span><span class="p">),</span><span class="o">%</span><span class="nb">eax</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">lea</code> instruction here is loading the relative address of our variable, which is now known at
link time. That means that there’s no access to the global offset table.</p>

<p>Now, let’s look at what GCC does:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   <span class="err">4:</span>	<span class="err">8</span><span class="nf">b</span> <span class="mi">05</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span>    	<span class="nv">mov</span>    <span class="mh">0x0</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">),</span><span class="o">%</span><span class="nb">eax</span>        <span class="err">#</span> <span class="nv">a</span> <span class="o">&lt;</span><span class="nv">main</span><span class="o">+</span><span class="mh">0xa</span><span class="o">&gt;</span>
			<span class="err">6:</span> <span class="nf">R_X86_64_PC32</span>	<span class="nv">my_value</span><span class="o">-</span><span class="mh">0x4</span>
</code></pre></div></div>

<p>It’s using a PC32 relocation to access the variable <code class="language-plaintext highlighter-rouge">my_value</code>. This is a direct reference, which
will only work if the address of the variable is known at link time. i.e. this won’t (or shouldn’t
IMO) work if <code class="language-plaintext highlighter-rouge">my_value</code> comes from a shared object. If we add the flag <code class="language-plaintext highlighter-rouge">-fPIC</code> to gcc, then it
produces the same code as clang.</p>

<p>So we have a trade-off. The code to directly access a variable that gets statically linked into our
executable is shorter and presumably more efficient, but doesn’t really work if the variable ends up
coming from a shared object. The code that does work for accessing the variable from a shared object
is slightly longer and a bit less efficient, although with the linker optimising away the access to
the global offset table, the efficiency difference is pretty small - however it remains longer than
the direct access code.</p>

<p>I said that the direct access approach doesn’t work if the variable ends up coming from a shared
object. Unfortunately that’s not entirely true. Linkers apply a horrible hack called
copy-relocations in order to make it work. When they encounter a direct access to a variable that’s
defined by a shared object, they allocate space for that variable in BSS (a zero-initialised section
that doesn’t take up space in the file on disk), then at runtime the bytes of the variable get
copied from the shared object that defined it into that space. That copy then overrides the
definition provided by the shared object.</p>

<p><img src="/images/protected/copy-relocation.svg" alt="Diagram of a copy relocation" /></p>

<p>But what if the symbol definition in the shared object has protected visibility? That means it can’t
be overridden right? GCC chose to interpret “can’t be overridden” as “can only be overridden by a
copy relocation”.</p>

<p>For a shared object to work correctly when one of its symbols is overridden, there can’t be direct
references to the symbol within the shared object. Here we get to a point of incompatibility between
the GCC / GNU ld world and the LLVM / LLD world.</p>

<p>If we now look at the code that each compiler produces for putting into a shared object, we can see
the other side of this difference. Here’s our C code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="n">visibility</span><span class="p">(</span><span class="s">"protected"</span><span class="p">)))</span>
<span class="kt">int</span> <span class="n">my_value</span> <span class="o">=</span> <span class="mi">42</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">get_my_value</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">return</span> <span class="n">my_value</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We tell both compilers that we might put this into a shared object by compiling with <code class="language-plaintext highlighter-rouge">-fPIC</code>.</p>

<p>GCC produces the following assembly for the variable access.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="err">19:</span>	<span class="err">48</span> <span class="err">8</span><span class="nf">b</span> <span class="mi">05</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> 	<span class="nv">mov</span>    <span class="mh">0x0</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">),</span><span class="o">%</span><span class="nb">rax</span>        <span class="err">#</span> <span class="mi">20</span> <span class="o">&lt;</span><span class="nv">get_my_value</span><span class="o">+</span><span class="mh">0xf</span><span class="o">&gt;</span>
			<span class="err">1</span><span class="nl">c:</span> <span class="nf">R_X86_64_REX_GOTPCRELX</span>	<span class="nv">my_value</span><span class="o">-</span><span class="mh">0x4</span>
  <span class="err">20:</span>	<span class="err">8</span><span class="nf">b</span> <span class="mi">00</span>                	<span class="nv">mov</span>    <span class="p">(</span><span class="o">%</span><span class="nb">rax</span><span class="p">),</span><span class="o">%</span><span class="nb">eax</span>
</code></pre></div></div>

<p>i.e. even though the variable is protected, it still accesses it via the GOT.</p>

<p>Clang however produces a more efficient direct access to the variable.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="err">14:</span>	<span class="err">8</span><span class="nf">b</span> <span class="mi">05</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span> <span class="mi">00</span>    	<span class="nv">mov</span>    <span class="mh">0x0</span><span class="p">(</span><span class="o">%</span><span class="nv">rip</span><span class="p">),</span><span class="o">%</span><span class="nb">eax</span>        <span class="err">#</span> <span class="mi">1</span><span class="nv">a</span> <span class="o">&lt;</span><span class="nv">get_my_value</span><span class="o">+</span><span class="mh">0xa</span><span class="o">&gt;</span>
			<span class="err">16:</span> <span class="nf">R_X86_64_PC32</span>	<span class="nv">my_value</span><span class="o">-</span><span class="mh">0x4</span>
</code></pre></div></div>

<p>So when building an executable, GCC ends up directly referencing all symbols, even those that might
be protected symbols from a shared object. In order to make that work, it then uses indirect
references when building shared objects.</p>

<p>Clang does the opposite, using indirect references when building an executable, but then allows
direct references to protected symbols when building a shared object.</p>

<p>Mixing these two different, and incompatible models for when it’s OK to directly reference something
can lead to problems. If your shared object is built by LLVM with direct access to protected
variables, then your main binary is built by GCC with direct access to all variables, we end up with
two separate copies of our variable. If the variable is mutable, then a change made in the main
binary won’t be seen by the shared object and vice versa.</p>

<p>In order to protect against this, GNU ld detects direct access to protected variables and refuses to
link the shared object. But the shared object would have worked fine so long as it was only used by
a binary compiled with LLVM (Clang).</p>

<p>This can be seen if we try to compile a shared object with Clang and link it with GNU ld:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clang <span class="nt">-fPIC</span> <span class="nt">-shared</span> b.c <span class="nt">-o</span> libb.so
/usr/bin/ld: /tmp/b-09dfbd.o: relocation R_X86_64_PC32 against protected symbol <span class="sb">`</span>my_value<span class="sb">`</span> can not be used when making a shared object
/usr/bin/ld: final <span class="nb">link </span>failed: bad value
</code></pre></div></div>

<p>The examples so far used protected symbols that were data, not functions, however the same problem
occurs with functions. The only real difference is that the linker won’t do a copy relocation for a
function, instead it synthesises a PLT entry (a small bit of machine code that jumps to the actual
function) then uses that to override the function definition in the shared object.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__attribute__</span><span class="p">((</span><span class="n">visibility</span><span class="p">(</span><span class="s">"protected"</span><span class="p">)))</span>
<span class="kt">int</span> <span class="nf">f1</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">return</span> <span class="mi">42</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">typedef</span> <span class="nf">int</span> <span class="p">(</span><span class="o">*</span><span class="n">int_fn_t</span><span class="p">)(</span><span class="kt">void</span><span class="p">);</span>

<span class="n">int_fn_t</span> <span class="nf">get_f1_ptr2</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">return</span> <span class="o">&amp;</span><span class="n">f1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Compiling this code with clang causes a link failure with GNU ld:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clang <span class="nt">-shared</span> <span class="nt">-fPIC</span> x.c
/usr/bin/ld: /tmp/x-f06305.o: relocation R_X86_64_PC32 against protected symbol <span class="sb">`</span>f1<span class="sb">`</span> can not be used when making a shared object
/usr/bin/ld: final <span class="nb">link </span>failed: bad value
</code></pre></div></div>

<p>This might seem like it’s just a trade-off between optimising code in the executable (GCC) or
optimising code in the shared object (LLD), in which case we should presumably pick to optimise the
executable, since for many uses that’s where the bulk of the code lives. However picking this relies
on copy relocations, which are in my opinion, a hack. Like many hacks, they have a number of
downsides.</p>

<ul>
  <li>They make the size of a variable part of its ABI. i.e. a shared object that defines a symbol now
cannot change the size of that symbol without breaking the ABI.</li>
  <li>They require that the variable gets copied into writable memory. If a shared library embeds a
large bit of data, say a 100MiB machine learning model and a copy relocation occurs, then at
startup, that 100MiB will need to be copied. Furthermore, if there are several copies of the
binary running, we’re now going to have several independent copies of that 100MiB in RAM, whereas
without a copy relocation, that 100MiB could be shared read-only between all the running
processes.</li>
</ul>

<p>The Rust compiler by default, uses LLVM to perform codegen. So when we change rustc to emit all
rust-mangled symbols with protected visibility, LLVM does the same as Clang above and emits direct
relocations to those symbols. This is fine provided we stick in the LLVM / LLD world, however if we
try to link using GNU ld, it gets rejected because it doesn’t fit GNU’s model of relying on copy
relocations for shared-object variable access from the main binary.</p>

<p>All of this came about because of GCC trying to simultaneously produce optimal code for executables
while not knowing at compile time whether a symbol might come from a shared object. On Windows, a
different path was taken. There, symbols that might come from a shared object (DLL on Windows) must
be annotated in the source code with <code class="language-plaintext highlighter-rouge">__declspec(dllimport)</code>. This allows the compiler to emit
optimised, direct-access instructions for all other symbols.</p>

<p>An alternative to annotating the source to indicate whether a symbol will come from a shared object
or be linked statically is to give the compiler access to the things we’re going to link against so
that it can find where the definition comes from and make an appropriate decision. This would never
fly in the C world where it’s expected that you can compile code with only access to the header
file, but in most modern languages like Rust, it’s more of an option for the compiler to have access
to your dependencies in order to make this kind of decision. Rust doesn’t currently do this, but it
should be possible for Rust to always make the optimal choice between a direct or an indirect
reference because it has all the information it needs to make that decision. Thanks to Reddit user
u/Zoxc32 for the correction that Rust doesn’t currently do this.</p>

<p>Using default visibility for symbols in shared objects affects not only load time for those shared
objects (150ms vs 5ms), but it also likely affects runtime performance, since all those variables
now need to be accessed via the global offset table, which means an extra pointer hop to get to the
data. There’s a good chance it also prevents LLVM from making various optimisations, since by using
default visibility, we’re effectively telling it that any of these variables or functions might be
swapped out for alternative definitions at runtime.</p>

<h1 id="some-good-news">Some good news</h1>

<p>I do my development on a system that’s based on Ubuntu 22.04, which has binutils version 2.38. Only
after writing most of this blog post did I think to try checking the behaviour of more recent
versions of GNU ld. As it turns out, binutils 2.40 fixes this problem in GNU ld.</p>

<p>Linking shared objects that have direct references to protected symbols is no longer an error.
Kudos to LLD maintainer, Maskray for making this change!</p>

<p>Instead, building an executable that would require a copy relocation for a protected symbol is now
an error.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/usr/bin/ld: /tmp/cciOjHc4.o: copy relocation against non-copyable protected symbol `my_value' in libb.so
collect2: error: ld returned 1 exit status
</code></pre></div></div>

<p>The error is now reported where it should be - when trying to build a binary that uses a shared
object with protected symbols and the compiler emitted direct references to those symbols. The fix
for that error is to compile the executable with <code class="language-plaintext highlighter-rouge">-fPIC</code> or switch to clang.</p>

<p>GCC maintains its behaviour of emitting direct relocations to variables and functions unless you
compile with <code class="language-plaintext highlighter-rouge">-fPIC</code>, but that’s much less of a problem for Rust and other languages than the
previous GNU ld behaviour.</p>

<h1 id="where-to-from-here">Where to from here?</h1>

<p>The fix to GNU ld is in binutils 2.40, which is in Ubuntu version 23.04 and later. However systems
built on 22.04 will be around for a while, so I don’t think we can just switch to protected symbols
and cause link errors on those older systems.</p>

<p>Work has been done to use lld by default for linking on Linux. This is currently on nightly versions
of rustc. If we add a flag to enable emitting of protected symbols, then we could enable that flag
when lld is being used as the linker.</p>

<p>It’s reasonable to ask, might creating shared objects with protected symbols cause those shared
objects to be unusable from programs compiled with GCC? I believe the answer is no, since we’d only
be making Rust mangled symbols as protected and they shouldn’t be getting referenced from code
compiled by GCC.</p>

<h1 id="further-resources">Further resources</h1>

<ul>
  <li>LLD maintainer, Maskray has an excellent <a href="https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected">blog
post</a>
about this topic.</li>
  <li>Removal of problematic error from GNU ld. Not sure what to link to, but you can search for “x86:
Make protected symbols local for -shared”.</li>
  <li><a href="https://sourceware.org/bugzilla/show_bug.cgi?id=28875">Disallow invalid relocation against protected symbol</a></li>
  <li>Related rustc issues:
    <ul>
      <li><a href="https://github.com/rust-lang/rust/issues/105518">Use protected visibility by default on ELF platforms</a></li>
      <li><a href="https://github.com/rust-lang/rust/issues/37530">stop exporting every symbol</a></li>
      <li><a href="https://github.com/rust-lang/rust/issues/33221">linking staticlib files into shared libraries exports all of std::</a></li>
    </ul>
  </li>
</ul>

<h1 id="thanks">Thanks</h1>

<p>Thanks to my <a href="https://github.com/sponsors/davidlattimore">github sponsors</a>. Your contributions help
to make it possible for me to continue to work on this kind of stuff rather than going and getting a
“real job”.</p>

<ul>
  <li>bearcove</li>
  <li>repi</li>
  <li>marxin</li>
  <li>bes</li>
  <li>Urgau</li>
  <li>jonhoo</li>
  <li>Kobzol</li>
  <li>coastalwhite</li>
  <li>mstange</li>
  <li>bcmyers</li>
  <li>Shnatsel</li>
  <li>Rafferty97</li>
  <li>joshtriplett</li>
  <li>teburd</li>
  <li>wezm</li>
  <li>davidcornu</li>
  <li>tommythorn</li>
  <li>flba-eb</li>
  <li>acshi</li>
  <li>teh</li>
  <li>yerke</li>
  <li>alexkirsz</li>
  <li>NobodyXu</li>
  <li>jplatte</li>
  <li>ymgyt</li>
  <li>Pratyush</li>
  <li>ethanmsl</li>
  <li>+2 anonymous</li>
</ul>

<h1 id="discussion-threads">Discussion threads</h1>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1f2s7ot/rust_dylib_rabbit_holes/">Reddit</a></li>
  <li><a href="https://news.ycombinator.com/item?id=41375491">Hacker News</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[Bevy is a popular game engine for Rust. It’s pretty large and compilation times can be an issue. To help with this, Bevy provides an optional feature that when enabled, compiles most of Bevy as a dynamic library. This allows for faster iteration as you don’t need to relink all the Bevy internals each time you rebuild.]]></summary></entry><entry><title type="html">Testing a linker</title><link href="https://davidlattimore.github.io/posts/2024/07/17/testing-a-linker.html" rel="alternate" type="text/html" title="Testing a linker" /><published>2024-07-17T00:00:00+00:00</published><updated>2024-07-17T00:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/07/17/testing-a-linker</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/07/17/testing-a-linker.html"><![CDATA[<p>I’ve been writing a linker, called Wild (see <a href="https://davidlattimore.github.io/">previous posts</a>).
Today, I’m going to talk about my approach to testing the linker. I think this is an interesting
case study in its own right, but also there’s aspects of the approach that can likely be applied to
other projects.</p>

<p>The properties that I like the tests for my projects to have are:</p>

<ul>
  <li>I want to feel confident that they will pick up bugs if I introduce them when refactoring.</li>
  <li>They should be fast to run.</li>
  <li>They should be easy to diagnose what’s wrong when they fail.</li>
  <li>They should be easy to maintain. When I refactor code, I should need to change tests as little as
possible, or maybe not at all.</li>
</ul>

<p>These priorities are sometimes in conflict with each other. For example merging several tests
together into a single test might make the test suite as a whole faster, but might also make
diagnosing what’s wrong harder. Whether I choose to split or merge integration tests depends on
circumstances. Sometimes splitting is the right approach, especially if there’s common work done by
each separate test than can be cached, thus regaining the speed. Often however I prefer to merge.
I’m more often running tests that pass than diagnosing tests that fail, so I’d prefer the speed.
Also, often with extra tooling, diagnosing what’s wrong can be made easier, even in a large
integration test that is doing many things.</p>

<p>Unit tests can be very fast, however when you refactor your code, if you change an interface that is
unit tested, then the test needs updating or even rewriting. They can also very easily miss bugs
when interfaces don’t change, but assumptions about who does what where in the code change.</p>

<p>I’ve been on projects that have relied entirely on unit tests and even with a high percentage of the
code covered by those unit tests, in the absence of good integration tests, the system has felt
incredibly fragile.</p>

<p>For these reasons, I generally focus first on integration tests, then resort to unit testing to fill
in gaps where I don’t think the integration tests are sufficient or would be too slow to cover all
the cases. I then build tooling in and around the integration tests to make them easier to diagnose
and maintain.</p>

<p>To provide some specific examples, I’ll now go into how the integration tests for the Wild linker
work.</p>

<p>When I started writing Wild, the first integration tests I wrote were of the form:</p>

<ul>
  <li>Compile a small C program using GCC</li>
  <li>Link the program using GNU ld</li>
  <li>Link the program again using Wild</li>
  <li>Run the binaries produced by both linkers and make sure they both exit with the expected exit
code.</li>
</ul>

<p>Linking with GNU ld is important in order to ensure that the test itself is correct. We want the
program to behave the same when linked with both linkers.</p>

<p>Already here we can see some opportunity to speed up our test slightly with caching. Generally when
we rerun our test it’ll be because we made a change to the linker. However GCC and GNU ld are
unlikely to have changed. So if the C program and the argument we’re passing didn’t change, then we
can skip rerunning GCC and GNU ld. This can be a significant saving, since GNU ld is really slow -
it often takes 10 to 30 times as long as Wild to link the same program.</p>

<p>Integration tests in Rust are typically put in a separate <code class="language-plaintext highlighter-rouge">tests</code> directory. Cargo will compile each
file in this directory as a separate binary. So if you have lots of completely separate integration
tests, this can get slow. For that reason, I generally only ever have a single integration test file
and do all my integration testing from that one file. It’s fine however to have multiple tests in
that file.</p>

<p>The Wild integration test compiles many small C, assembly and Rust programs, links them and runs
them. I include instructions for the test runner inline in the test in the form of specially
formatted comments.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//#Object:exit.c</span>
<span class="c1">//#ExpectSym: _start .text</span>
<span class="c1">//#ExpectSym: exit_syscall .text</span>
<span class="c1">//#EnableLinker:lld</span>

<span class="err">#</span><span class="n">include</span> <span class="s">"exit.h"</span>

<span class="n">void</span> <span class="nf">_start</span><span class="p">(</span><span class="n">void</span><span class="p">)</span> <span class="p">{</span>
   <span class="nf">exit_syscall</span><span class="p">(</span><span class="mi">42</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the example here, the first line tells the test runner to compile exit.c as an object file and
include that in the link. Then there’s a couple of assertions to check that some symbols are in the
correct output sections. The last instruction tells the test runner to enable linking with lld. This
is in addition to GNU ld and Wild that are always enabled for all tests.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//#AbstractConfig:default</span>
<span class="c1">//#DiffIgnore:section.tdata.alignment</span>

<span class="c1">//#Config:llvm-static:default</span>
<span class="c1">//#CompArgs:--target x86_64-unknown-linux-musl -C relocation-model=static -C target-feature=+crt-static -C debuginfo=2</span>

<span class="c1">//#Config:cranelift-static:default</span>
<span class="c1">//#CompArgs:-Zcodegen-backend=cranelift --target x86_64-unknown-linux-musl -C relocation-model=static -C target-feature=+crt-static -C debuginfo=2 --cfg cranelift</span>

<span class="c1">//#Config:llvm-dynamic:default</span>
<span class="c1">//#CompArgs:-C debuginfo=2</span>
<span class="c1">//#DiffIgnore:.dynamic.DT_JMPREL</span>
</code></pre></div></div>

<p>In this more complex example, we’ve defined an abstract config in which we provide some default
settings. Then we have several configurations that inherit from that config and override various
properties. Each config has a unique name that is used for naming output files and when reporting
test failures. This test has a configuration that statically links with musl libc, one that uses the
cranelift backend and one that dynamically links.</p>

<p>Early on when developing the linker, if a test failed, it was generally necessary to step through
running the program in a debugger. I would step through both the output from GNU ld and the output
from my linker and see where they would diverge. The replay debugger <code class="language-plaintext highlighter-rouge">rr</code> was great for this as it
lets you step backwards in addition to forwards. However even with awesome tools like <code class="language-plaintext highlighter-rouge">rr</code>, this was
still a slow process. Fortunately it’s something I rarely need to do anymore.</p>

<p>The reason for that is that I now make extensive use of diffing against the output of GNU ld using a
tool I created called linker-diff. The binaries produced by different linkers are not byte-for-byte
identical and I wouldn’t want to try to make them so. However there’s lots of things we can diff,
even if the layout of the file is different. e.g.:</p>

<ul>
  <li>Values of many of the header fields.
    <ul>
      <li>Even when the actual value of the header field is different, we can often interpret it in a way
that can make it the same. e.g. when we look at the header field that contains the entry point
for the program, the addresses will be different because the layout of the files is different,
however if we look to see what symbol names point to that address, we’d expect them to be the
same.</li>
    </ul>
  </li>
  <li>We can disassemble global functions and check that the instructions match.
    <ul>
      <li>This is complicated somewhat because the instructions will often contain relative offsets to
other functions, or absolute values that are expected to be different depending on how the
linker laid out the binary. Similar to what we did with the entry point in the header, we can
allow these instructions to match provided they point to a symbol with the same name.</li>
    </ul>
  </li>
</ul>

<p>Diffing linker outputs is non-trivial. Like linkers themselves, there are lots of corner cases. It
can be challenging to avoid false positives, while still detecting actual differences that we care
about. There’s still more than can be improved with the diff support, but already it has proved
incredibly valuable in diagnosing problems.</p>

<p>linker-diff is integrated into the integration tests. This means that generally now if I’m changing
how something works and I accidentally break something, rather than a mysterious and opaque test
failure when the binary produces the wrong result, I get a diff report showing where I did something
different to GNU ld.</p>

<p>One complication that arises, is where GNU ld is doing something that’s suboptimal. I observed this
with the linker not applying a particular optimisation if a symbol in our output binary was
referenced by a shared object that we were linking against. Trying to replicate GNU ld’s behaviour
here would have made our output binary link slower, run slower and added significant complexity to
our linker. Fortunately lld had better behaviour in this case. So what I ended up doing for my tests
was diffing Wild’s output against both the output of GNU ld and lld. For each thing we diff, e.g.
each instruction, header field etc, if Wild matches either GNU ld or lld’s output, then we accept it
as correct.</p>

<p>This is what typical output from linker-diff looks like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wild: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so
ld: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.ld.so
lld: /wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.lld.so
asm.get_weak_var
                  endbr64
                  push %rbp
                  mov %rsp,%rbp
  
  wild 0x00402429 48 8d 05 b0 10 00 00 lea 0x10BF,%rax  // weak_var
  ld   0x000011b2 48 8b 05 1f 2e 00 00 mov 0x2E2E,%rax  // DYNAMIC(weak_var)
  lld  0x00001a12 48 8b 05 3f 12 00 00 mov 0x124E,%rax  // DYNAMIC(weak_var)
  ORIG            48 8b 05 00 00 00 00 mov 7,%rax  // R_X86_64_REX_GOTPCRELX -&gt; `weak_var`
  TRACE           relaxation=MovIndirectToLea value_flags=ADDRESS resolution_flags=DIRECT
  
                  mov (%rax),%eax
                  pop %rbp
                  ret
</code></pre></div></div>

<p>Here we can see the disassembly of the function <code class="language-plaintext highlighter-rouge">get_weak_var</code>. At the top and bottom are
instructions that are the same in the output of all three linkers.</p>

<p>In the middle is an instruction that is different. First we have a row for each of the three
linkers, wild, GNU ld and lld. We can see that GNU ld and lld both produced relative move
instructions that reference a dynamic relocation for a variable called <code class="language-plaintext highlighter-rouge">weak_var</code>. Wild however is
loading a relative address directly with no dynamic relocation. This may in fact still run
correctly, but only if this variable isn’t overridden at runtime by the main executable or another
shared object. So this is, or rather was, a bug in Wild.</p>

<p>When diagnosing failures like this, it’s very helpful to be able to see what was in the input file.
I used to find this manually, however it’s somewhat time consuming. So I added support to the linker
to write layout information to a .layout file. linker-diff then uses this to find where a particular
instruction came from in an input file and display that. That is shown on the line prefixed with
<code class="language-plaintext highlighter-rouge">ORIG</code>. The relocation type <code class="language-plaintext highlighter-rouge">GOTPCRELX</code> is especially useful in diagnosing what’s happening.</p>

<p>It’s often useful to be able to log the values of variables from the code in the linker. Matching
these log statements up to the output of the linker can be tricky. To help fix this, the linker can
associate tracing log statements with particular addresses in the output file. If linker-diff finds
any log messages associated with any of the bytes for an instruction that has a diff, then it’ll
display them. This is shown on the <code class="language-plaintext highlighter-rouge">TRACE</code> line above. The code in the linker that emitted this,
then looks like this:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">let</span> <span class="n">_span</span> <span class="o">=</span> <span class="nn">tracing</span><span class="p">::</span><span class="nd">span!</span><span class="p">(</span>
      <span class="nn">tracing</span><span class="p">::</span><span class="nn">Level</span><span class="p">::</span><span class="n">TRACE</span><span class="p">,</span> <span class="s">"relocation"</span><span class="p">,</span> <span class="n">address</span> <span class="o">=</span> <span class="n">place</span><span class="p">)</span><span class="nf">.entered</span><span class="p">();</span>
  <span class="o">...</span>
  <span class="k">if</span> <span class="k">let</span> <span class="nf">Some</span><span class="p">((</span><span class="n">relaxation</span><span class="p">,</span> <span class="n">r_type</span><span class="p">))</span> <span class="o">=</span>
      <span class="nn">Relaxation</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">r_type</span><span class="p">,</span> <span class="n">out</span><span class="p">,</span> <span class="n">offset_in_section</span><span class="p">,</span> <span class="n">value_flags</span><span class="p">,</span> <span class="n">output_kind</span><span class="p">)</span>
  <span class="p">{</span>
      <span class="nn">tracing</span><span class="p">::</span><span class="nd">trace!</span><span class="p">(</span><span class="o">?</span><span class="n">relaxation</span><span class="p">,</span> <span class="o">%</span><span class="n">value_flags</span><span class="p">,</span> <span class="o">%</span><span class="n">resolution_flags</span><span class="p">);</span>
      <span class="o">...</span>
  <span class="p">}</span>
</code></pre></div></div>

<p>The first line creates the variable <code class="language-plaintext highlighter-rouge">_span</code>. Until this variable goes out of scope, all uses of
<code class="language-plaintext highlighter-rouge">tracing::trace!</code> will be associated with the address specified when we created the span.</p>

<p>When a test fails, it’s useful to be able to rerun the failing linker invocation outside of the
context of the test. If the bug is in linker-diff, then it’s useful to be able to rerun that. So
when a test fails, I print out the command lines to do both of these. I can then copy and paste
whichever I’d like to work on into my terminal.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
Error: Validation failed.

WILD_WRITE_LAYOUT=1 WILD_WRITE_TRACE=1 OUT=/home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.save/run-with cargo run --bin wild --

 To revalidate:

cargo run --bin linker-diff -- --wild-defaults --ignore '.got.plt,.dynamic.DT_PLTGOT,.dynamic.DT_JMPREL,.dynamic.DT_NEEDED,.dynamic.DT_PLTREL,.dynamic.DT_FLAGS,.dynamic.DT_FLAGS_1,section.plt.entsize,section.relro_padding' --ref /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.ld.so --ref /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.lld.so /home/david/work/wild/wild/tests/build/libc-integration-0.clang-dynamic-b756cc1ceaeaa45d.wild.so
</code></pre></div></div>

<p>When I find a program that misbehaves when linked with Wild, the first thing I want to do is try to
figure out what Wild is getting wrong. To help with that, I’ve integrated support for running linker
diff into Wild itself. This is done by setting the environment variable <code class="language-plaintext highlighter-rouge">WILD_REFERENCE_LINKER</code> to
the name of a reference linker to invoke.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">WILD_REFERENCE_LINKER</span><span class="o">=</span>ld <span class="nv">RUSTFLAGS</span><span class="o">=</span><span class="s2">"-Clinker=clang -Clink-args=--ld-path=wild"</span> cargo <span class="nb">test</span>
</code></pre></div></div>

<p>When set, Wild will run the reference linker (GNU ld) with the same arguments as those it was
invoked with, but change the output file. It’ll then invoke linker-diff to check for unexpected
differences, then fail the link if any are found.</p>

<p>Once I’ve identified the part that Wild is getting wrong, I can try to add something similar to one
of my existing test programs.</p>

<p>Wild’s tests still have lots more that needs doing. I’ve mostly focussed on the happy path so far,
since getting even that right is tricky. Soon I’ll probably need to start looking at testing error
conditions. I’ll likely follow a somewhat similar approach of having some test programs and making
sure that both the reference linker and Wild reject them and that each linker includes some specific
string in the error output - e.g. the name of a symbol that was unresolved.</p>

<p>At some point in the future, I’m interested in trying fuzzing as a testing strategy. Profile-guided
fuzzing could find interesting inputs that hit corner cases in the linker not covered by regular
tests.</p>

<p>The eventual plan for Wild is to make it incremental. When it comes time to start working on this, I
think linker-diff will again be useful. My plan is test as follows:</p>

<ul>
  <li>Link a test program with wild. Call this output A.</li>
  <li>Make a random change to the input objects (possibly via fuzzing), then link this with wild. Call
this output B.</li>
  <li>Undo the random change we made and incrementally link. Call this output C.</li>
  <li>A and C should be semantically the same, so if we diff them with linker-diff, it should report no
differences.</li>
</ul>

<p>Another strategy I’m keen to employ is mutant testing (see <a href="https://mutants.rs/">mutants.rs</a>). This
makes random changes to your code that should change behaviour - e.g. inverting a comparison - then
checks if any of your tests pick up the change. Not only does this have the potential to pick up
gaps in testing, but it may also help find bits of code that are unnecessary. I’d also be interested
in seeing if it could be used to rank tests by how many problems they detect that other tests miss.
Tests that only detect a subset of the bugs detected by other tests would be candidates for removal.</p>

<p>I hope this look into how I approach testing and in particular testing of the Wild linker has given
you some ideas for your own projects.</p>

<h1 id="thanks">Thanks</h1>

<p>Thanks to my <a href="https://github.com/sponsors/davidlattimore">github sponsors</a>. Your contributions help
to make it possible for me to continue to work on this kind of stuff rather than going and getting a
“real job”.</p>

<ul>
  <li>bearcove</li>
  <li>repi</li>
  <li>bes</li>
  <li>Urgau</li>
  <li>jonhoo</li>
  <li>Kobzol</li>
  <li>coastalwhite</li>
  <li>mstange</li>
  <li>bcmyers</li>
  <li>Shnatsel</li>
  <li>Rafferty97</li>
  <li>joshtriplett</li>
  <li>tommythorn</li>
  <li>flba-eb</li>
  <li>acshi</li>
  <li>teh</li>
  <li>yerke</li>
  <li>alexkirsz</li>
  <li>NobodyXu</li>
  <li>Pratyush</li>
  <li>ethanmsl</li>
  <li>+2 anonymous</li>
</ul>

<h1 id="discussion-threads">Discussion threads</h1>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1e54pml/testing_the_wild_linker/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[I’ve been writing a linker, called Wild (see previous posts). Today, I’m going to talk about my approach to testing the linker. I think this is an interesting case study in its own right, but also there’s aspects of the approach that can likely be applied to other projects.]]></summary></entry><entry><title type="html">Speeding up rustc by being lazy</title><link href="https://davidlattimore.github.io/posts/2024/06/05/speeding-up-rustc-by-being-lazy.html" rel="alternate" type="text/html" title="Speeding up rustc by being lazy" /><published>2024-06-05T13:00:00+00:00</published><updated>2024-06-05T13:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/06/05/speeding-up-rustc-by-being-lazy</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/06/05/speeding-up-rustc-by-being-lazy.html"><![CDATA[<p>I’ve been busy working on the Wild linker (see <a href="/">previous posts</a>), but wanted to divert for a moment to
look at some other compilation speed things that I’ve been thinking about. This post discusses
various thoughts about moving Rust codegen, monomorphisation and inlining later in compilation and
some of the ways this might reduce both from-scratch and incremental build times.</p>

<h1 id="dead-code">Dead code</h1>

<p>Dead code is code that gets compiled, but isn’t needed for the final binary. This might come from
crates in our dependency tree where we’re only using part of the crate. It might also be from impls
that we’re not using - e.g. lots of Debug and Clone impls that aren’t actually used. The amount of
dead code that we compile varies quite a bit by crate.</p>

<p>In order to assess how much code is getting compiled then discarded during linking, I <a href="https://github.com/davidlattimore/wild/blob/main/wild_lib/src/gc_stats.rs">added
support</a> to the Wild
linker to print garbage collection statistics. If I run this on ripgrep, which has a pretty lean and
well-tuned build, we find that 17% of the executable code compiled is discarded.</p>

<p>For a less well-tuned binary, let’s pick on one of my own crates, the evcxr REPL. It shows that 35%
of compiled code is discarded by the linker.</p>

<p>There’s already been work done in Rustc to support MIR-only rlibs. This would defer codegen to later
in compilation. A lot of that work has been motivated by wanting to support compiling libstd with
different options. Depending on how it’s done, we may be able to take advantage of it to make
codegen happen on-demand. If codegen is deferred until link time, we know what is and isn’t
referenced. e.g. we can start from main and see what is referenced. We can then perform codegen only
for those functions that are referenced.</p>

<h1 id="repeated-monomorphisations">Repeated monomorphisations</h1>

<p>Another source of wastage is duplicate monomorphisations. Generic code, such as
<code class="language-plaintext highlighter-rouge">std::Vec::&lt;T&gt;::push</code> can’t be compiled to native code until the type parameter T is substituted.
This means that it happens when building the crate that calls the function. But there could be
multiple crates or codegen units that make use of the same monomorphisation. Repeating codegen for
each of them is wasteful.</p>

<p>I did an investigation into duplicate functions by <a href="https://github.com/davidlattimore/duplicate-function-checker">creating a
tool</a> that determines what percentage
of the executable bytes in a binary are excess due to duplicate functions. For many build
configurations, about 5-10% of the machine code going into your executable is likely excess copies
of duplicated functions and most of that is due to repeating the same monomorphisation. You can read
more in the tool’s README.</p>

<p>This is not only wasteful of compilation time, but also binary size. For release builds, various
options such as setting <code class="language-plaintext highlighter-rouge">codegen-units=1</code> and fat LTO reduce this duplication, however these options
also hurt build times, so we need another solution.</p>

<p>There are two different sources of repeated monomorphisations. The first between codegen units
within a crate. This seems to mostly be an issue for release builds because the monomorphisations
are put into codegen units they’re referenced from, in case LLVM wants to inline them.</p>

<p>The second source of repeated monomorphisations is between crates. If multiple crates all need the
same monomorphisations, then each crate produces it. These duplicates happen in both release and
debug builds.</p>

<p>When compiling C++ code, GCC and Clang emit such monomorphisations as weak symbols rather than local
symbols like rustc does. This lets the linker deduplicate them. This might be an option for reducing
binary sizes, although it’s complicated by Rust’s use of the archive format for rlibs, since if the
only symbols referenced in an archive entry are weak symbols, then the archive entry won’t be
loaded. bjorn3 points out that this could be fixed by passing <code class="language-plaintext highlighter-rouge">--whole-archive</code> to the linker. Like
setting <code class="language-plaintext highlighter-rouge">codegen-units=1</code>, this only helps the binary size-problem, not the wasted-compilation-time
problem.</p>

<p>Some work on this problem has already been done in the form of the unstable flag <code class="language-plaintext highlighter-rouge">-Zshare-generics</code>
which is on by default for non-optimised builds. This does reduce the number of duplicate
monomorphisations, but there’s still plenty of duplicates from different crates remaining.</p>

<p>Duplicates originating from the same monomorphisation in different crates are somewhat tricky to
solve, but one possibility to do something similar what the proposal above for dead code. i.e. defer
monomorphisation to link time. Doing this would mean that we could create just one copy of each
monomorphised function.</p>

<h1 id="recompiling-dependents-on-implementation-changes">Recompiling dependents on implementation changes</h1>

<p>Another source of wastage happens when you have several crates and you’re making changes to a
library crate, then rebuilding some binary that depends on the library crate that you edited.
Currently cargo rebuilds all crates in the dependency tree between the crate that you edited and the
binary crate you’re building.</p>

<p><img src="/images/lazy/crate-graph.svg" alt="Diagram of several crates in a workspace" /></p>

<p>In the diagram above, if <code class="language-plaintext highlighter-rouge">A</code> is our binary (or a test crate) and we’re making edits to the
implementation of a function in <code class="language-plaintext highlighter-rouge">F</code>, say adding and removing print statements, then each time we
make a change, rustc needs to be invoked on all the crates with the dashed outlines. However
ideally, it should be possible to just recompile <code class="language-plaintext highlighter-rouge">F</code>, then relink <code class="language-plaintext highlighter-rouge">A</code>.</p>

<p>Currently when the rust compiler compiles a library crate, it emits an rmeta file, then later emits
an rlib containing the results of codegen. The rmeta file for a build (as opposed to a check)
currently includes the MIR of all the functions in the crate. This is currently necessary, since the
dependent crates might want to inline some of the functions.</p>

<p>If we’re delaying codegen to link time, then we can also delay inlining. This means that we don’t
need the MIR in order to compile the dependent crates. This would give us two advantages:</p>

<ul>
  <li>Pipelined compilation can work better, since we don’t need to wait for the MIR to be ready before
the dependent crates can be built.</li>
  <li>We don’t need to rerun rustc on the dependent crates when the rlibs change, only when the rmeta
changes. That means that if you edit the implementation of a function in one of your library
crates, you only need to rerun rustc for that one library crate and then relink. During relinking,
any functions that changed as well as any functions that inlined changed functions would go
through codegen.</li>
</ul>

<h1 id="parallelism">Parallelism</h1>

<p>When doing work in parallel with multiple threads or processes, if some, or one unit of work
finishes significantly later than the rest, it can slow things down because we have CPU cores
sitting idle with nothing useful to do. I’ll call these late finishers “stragglers”.</p>

<p>Currently in the Rust compiler, codegen of one crate can happen at the same time as earlier
compilation stages of another crate. By deferring codegen until we’re building the final binary, we
introduce an extra wait-point where we can potentially get stragglers.</p>

<p>One mitigation that we already get with the changes proposed above, is that a normal build becomes
more like a <code class="language-plaintext highlighter-rouge">cargo check</code> in terms of pipelining. Rather than emitting .rmeta files containing MIR,
the compiler emits .rmeta files without MIR. This means that dependent crates can start being
compiled earlier because they don’t need to wait for the MIR of their dependencies.</p>

<p>However we still need to wait for the MIR for the last crate(s) to finish being written before we
can start codegen. One potential for increased parallelism here is that rustc could make the MIR for
a crate available before it has finished checking the crate. The compiler stages might look
something like this:</p>

<ul>
  <li>Parse files, do everything that’s required to write a .rmeta file containing only what’s needed to
check dependent crates. i.e. emit interface information, type information, exported macros etc.
Once this finishes, dependent crates can start being compiled. This is similar to what currently
happens with a cargo check. Emit MIR-only rlib. Once this finishes for all crates needed by a
binary, codegen and linking of that binary can begin.</li>
  <li>Complete remaining error-checking of the crate. Cargo would wait for this to complete, but other
steps including codegen and linking can run concurrently with this.</li>
</ul>

<p>So this is a form of pipelined building, similar to the pipelined building that cargo and rustc currently do. For comparison, this is what currently happens during a build:</p>

<ul>
  <li>Do everything that’s required to write a .rmeta file. Unlike above, this contains MIR, since
subsequent crates might need the MIR in order to inline functions during codegen. Once this is
completed, subsequent crates can start building.</li>
  <li>Codegen crate. Once this is completed, the final binary can be built.</li>
</ul>

<h1 id="finer-grained-codegen-units">Finer-grained codegen units</h1>

<p>One way to reduce stragglers is by having smaller units of work. Currently the Rust compiler is a
bit limited as to how small it can make codegen units. Some of the things that limit the Rust
compiler are affected by changes proposed above.</p>

<p>My ideal would be if we could codegen each function separately. That maximises parallelism and also
means that when doing incremental compilation we can avoid the need to repeat codegen for other
functions that just happened to be in the same codegen unit on the previous build. However we’d need
to make sure we’re not repeating any work in multiple codegen units.</p>

<p>If this were done today, one source of repeated work would be that any generic functions called by
the function that we were going to codegen might, depending on optimisation level, be included too.
Above, this post proposed that if we’re not inlining a generic function, that we codegen it only
once and make it global rather than local. That would allow us to not include it together with the
function that we’re compiling.</p>

<p>Apparently the cranelift backend already does codegen for each function independently.</p>

<p>Writing a separate object file for each function is unlikely to be practical or efficient. There are
potential limits on how many arguments can be passed to the linker. The linker also might not be
optimised for this. Having one function per object within an archive might be a possibility,
although experimentation would be needed to see how well different linkers handled that. The
alternative would be to pack multiple function definitions into a single object file even though
they went through codegen separately.</p>

<h1 id="linker-integration">Linker integration?</h1>

<p>One option for deferring codegen would be to integrate codegen into the linker. This could take the
form of building a linker into rustc and then using rustc as the linker.</p>

<p>An alternative would be to do codegen just prior to linking.</p>

<p>Integrating a linker into rustc would have some advantages:</p>

<ul>
  <li>The linker is already doing a graph traversal, taking advantage of that avoids the need to do a
separate graph traversal in the compiler.</li>
  <li>If you have a mix of code from Rust and other languages (e.g. C or C++), then the linker has a
view of all of this. If doing the graph traversal without the help of the linker, we’d need to
assume that any function that could be called from another language is called.</li>
  <li>Caching is probably easier with tighter linker integration, since the linker can read entries
directly from a cache and we’re not constrained to putting everything in object files.</li>
</ul>

<p>However, the main disadvantage of such tight linker integration is that we then don’t get all the
benefits of this work unless we’re using the integrated linker. My linker, Wild, is still a way off
being ready for general use on Linux and I haven’t even started to look at porting to other
platforms. So I think it’s important to try to do deferred codegen without integrating the linker.</p>

<p>Dong codegen just prior to linking could be done as follows:</p>

<ul>
  <li>Compile binary crates to rlibs rather than directly invoking the linker when the binary crate gets
compiled.</li>
  <li>Have cargo invoke rustc to perform the linking step. This final rustc invocation would determine
what codegen was needed, do it, then invoke the linker on the resulting object files.</li>
</ul>

<h1 id="caching">Caching</h1>

<p>If codegen is deferred until we are building a binary, then we need to make sure that we avoid
repeating the same codegen more than once. This means that when doing a warm build, we need only do
codegen for new / changed code.</p>

<p>We also need to be careful if we’re building multiple binary crates. All binary crates need to be
able to share the codegen outputs where appropriate. One way to achieve this might be as follows.</p>

<ul>
  <li>Keep an index file in which we record which functions are in which object files.</li>
  <li>When invoking rustc to do codegen / linking, lock the index file, figure out which functions we’re
going to codegen, update the index file to indicate which object files those new functions will be
in, create those files and lock them, then release the lock on the main index file.</li>
  <li>That should hold the main index lock for a relatively short time after which another rustc process
can do the same.</li>
  <li>When we finish doing codegen, before we invoke the linker, make sure that none of the object files
we are going to pass to the linker are still locked, which would indicate that the rustc process
writing them was still working.</li>
</ul>

<p>Caching is quite possibly the hardest part of all of this to get correct. Ideally we’d like to avoid
storing each compiled function twice (once in the cache and once in the object file to be linked),
but this does make things significantly more complicated, especially without causing
non-determinism.</p>

<h1 id="keeping-memory-usage-in-check">Keeping memory usage in check</h1>

<p>With all codegen being done by the one rustc process, care needs to be taken to ensure memory usage
isn’t too high. Several strategies might help here:</p>

<ul>
  <li>Store graph information (what references what) separately from the MIR so that we can do a graph
traversal without loading all the MIR.</li>
  <li>Load the MIR for each function only when we’re ready to codegen it, write the resulting machine
code into an object file then drop it and the MIR.</li>
</ul>

<h1 id="why-not-do-all-compilation-on-demand">Why not do all compilation on demand?</h1>

<p>It would be pretty hard to retrofit fully on-demand compilation to a mature compiler like rustc.
It’s also unclear how much you could actually skip. At least a bit of processing of each file is
required in order to find all trait implementations so that method resolution can give correct
results.</p>

<p>Correctness checks within function bodies could potentially be done only for functions that were
reachable. But that raises lots of questions about whether you’d want to do that. I’ve heard that
Zig doesn’t report some errors for dead code.</p>

<p>At least for now, it’s better to still do all correctness checking even for dead code.</p>

<h1 id="related-work">Related work</h1>

<p>I was somewhat inspired here by an excellent episode of the Software Unscripted podcast where
Richard interviewed matklad. The episode is called “Incremental Compilation with Alex Kladov”
(<a href="https://podcasts.apple.com/us/podcast/incremental-compilation-with-alex-kladov/id1602572955?i=1000647825248">link</a>).
On the same podcast, in a much earlier episode, Richard interviews Andrew Kelley, the creator of Zig
(<a href="https://podcasts.apple.com/us/podcast/open-source-with-zig-creator-andrew-kelley/id1602572955?i=1000554066581">link</a>)
which does a lot of its compilation in a more on-demand way.</p>

<p>Various related previous discussions:</p>

<ul>
  <li><a href="https://internals.rust-lang.org/t/laziness-in-the-compiler/19112">Laziness in the compiler</a> (July 2023)</li>
  <li><a href="https://internals.rust-lang.org/t/towards-a-second-edition-of-the-compiler/5582">Towards a second edition of the
compiler</a> (July
2017)</li>
  <li>MIR-only RLIBs (Discussions on github / Zulip from January 2017 and more in 2024!)
    <ul>
      <li>The motivations for MIR-only RLIBs are different than for this post but there’s substantial
cross-over.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">-Z share-generics</code>
    <ul>
      <li><a href="https://internals.rust-lang.org/t/explicit-monomorphization-for-compilation-time-reduction/15907">Explicit monomorphization for compilation time
reduction</a></li>
    </ul>
  </li>
</ul>

<p>I’ve only linked to discussions where they’re archived, but you can find open issues etc on these
topics with a quick web search.</p>

<h1 id="next-steps">Next steps</h1>

<p>I’m busy making the <a href="https://github.com/davidlattimore/wild">Wild linker</a>, however I think I
potentially have some bandwidth to start working on at least one of the ideas here. I haven’t yet
figured out which one. If you’ve got any comments or would like to discuss this, my contact details
are on my about page.</p>

<h1 id="thanks">Thanks</h1>

<p>Thanks to bjorn3, simulacrum, Jakub Beránek, davidtwco and nora (Nilstrieb) for providing feedback
on an earlier draft of this post. Any errors or inaccuracies are mine.</p>

<p>Thanks also to my <a href="https://github.com/sponsors/davidlattimore">github sponsors</a>. Your contributions
help to make it possible for me to continue to work on this kind of stuff rather than going and
getting a “real job”.</p>

<ul>
  <li>repi</li>
  <li>bes</li>
  <li>Urgau</li>
  <li>coastalwhite</li>
  <li>mstange</li>
  <li>bcmyers</li>
  <li>Shnatsel</li>
  <li>Rafferty97</li>
  <li>joshtriplett</li>
  <li>acshi</li>
  <li>teh</li>
  <li>yerke</li>
  <li>alexkirsz</li>
  <li>Pratyush</li>
  <li>lexara-prime-ai</li>
  <li>ethanmsl</li>
  <li>+1 anonymous</li>
</ul>

<h1 id="discussion-threads">Discussion threads</h1>

<ul>
  <li><a href="https://www.reddit.com/r/rust/comments/1d9b36j/speeding_up_rustc_by_being_lazy/">Reddit</a></li>
</ul>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[I’ve been busy working on the Wild linker (see previous posts), but wanted to divert for a moment to look at some other compilation speed things that I’ve been thinking about. This post discusses various thoughts about moving Rust codegen, monomorphisation and inlining later in compilation and some of the ways this might reduce both from-scratch and incremental build times.]]></summary></entry><entry><title type="html">Video: Rust Sydney - A linker in the Wild</title><link href="https://davidlattimore.github.io/posts/2024/04/17/video-rust-syd-wild-linker.html" rel="alternate" type="text/html" title="Video: Rust Sydney - A linker in the Wild" /><published>2024-04-17T13:00:00+00:00</published><updated>2024-04-17T13:00:00+00:00</updated><id>https://davidlattimore.github.io/posts/2024/04/17/video-rust-syd-wild-linker</id><content type="html" xml:base="https://davidlattimore.github.io/posts/2024/04/17/video-rust-syd-wild-linker.html"><![CDATA[<p>This week I presented a talk at the Rust Sydney meetup about the Wild linker.</p>

<p><a href="https://www.youtube.com/watch?v=WSHt3-gwVxc">Video</a></p>

<p>There are also <a href="https://docs.google.com/presentation/d/149uYKGbT0Jn4N6tBqdGTc6DEAX1olmj3m7H7qdMJJdU/edit?usp=sharing">slides, including speaker
notes</a>
with roughly what I said, or intended to say.</p>

<p><a href="https://www.reddit.com/r/rust/comments/1c7izhz/video_a_linker_in_the_wild_rust_linker/">Discussion on
Reddit</a></p>]]></content><author><name></name></author><category term="posts" /><summary type="html"><![CDATA[This week I presented a talk at the Rust Sydney meetup about the Wild linker.]]></summary></entry></feed>