Brendan Ang

Search

Search IconIcon to open search

Garbage Collection

Last updated Jun 26, 2023 Edit Source

# Garbage Collection

In a managed language, the language implementation manages memory allocation and freeing on the user’s behalf. When a user performs an operation that requires some dynamic memory, the Virtual Machine automatically allocates it. The programmer never worries about deallocating anything. It ensures any memory the program is using sticks around as long as needed using a garbage collector.

# Reachability

How does a VM tell what memory is not needed? It considers a piece of memory to still be in use if it could possibly be read in the future. A value is reachable if there is some way for a user program to reference it.

Some values can be directly accessed:

1
2
3
4
5
var global = "string";
{
  var local = "another";
  print global + local;
}

These are available on the stack or as an entry in the global hashmap and are called roots. Any values referenced by roots must still be alive and hence also reachable.

  1. Starting with the roots, traverse through object references to find the full set of reachable objects.
  2. Free all objects not in that set.

# Mark-Sweep Garbage Collection

# Tricolor Abstraction

Each object has a conceptual “color” that tracks what state the object is in, and what work is left to do. This allows the GC to pause and pick up where it left off when needed.

# Weak References

A reference that does not protect the referenced object from collection by the GC. For example, strings saved in the intern table have a direct (root) reference by the VM, but must be treated as a weak reference if not the GC would never clean up its memory.

# When to collect

Every managed language pays a performance price compared to explicit, user-authored deallocation. The time spent actually freeing memory is the same, but the GC spends cycles figuring out which memory to free. That is time not spent running the user’s code and doing useful work. In our implementation, that’s the entirety of the mark phase. The goal of a sophisticated garbage collector is to minimize that overhead.

# Self adjusting heap

The collector frequency automatically adjusts based on the live size of the heap. We track the total number of bytes of managed memory that the VM has allocated. When it goes above some threshold, we trigger a GC. After that, we note how many bytes of memory remain—how many were not freed. Then we adjust the threshold to some value larger than that.

The result is that as the amount of live memory increases, we collect less frequently in order to avoid sacrificing throughput by re-traversing the growing pile of live objects. As the amount of live memory goes down, we collect more frequently so that we don’t lose too much latency by waiting too long.

# Generational GC

A collector loses throughput if it spends a long time re-visiting objects that are still alive. But it can increase latency if it avoids collecting and accumulates a large pile of garbage to wade through. If only there were some way to tell which objects were likely to be long-lived and which weren’t. Then the GC could avoid revisiting the long-lived ones as often and clean up the ephemeral ones more frequently.

The key observation was that most objects are very short-lived but once they survive beyond a certain age, they tend to stick around quite a long time. The longer an object has lived, the longer it likely will continue to live.