BlogDecember 23, 2025•about 2 months ago

How Modern Programming Languages Implement Threading, and how I did it

Shrehan Raj Singh

Mod

@shrehanrajsingh

Building a reliable thread runtime for Sunflower

Threads are one of the harder parts of language implementation, both to use correctly and to implement correctly. Writing thread-safe code is tedious and notoriously difficult to debug: race conditions, memory leaks, and subtle undefined behavior are common. Languages with thoughtful concurrency models (Go and its goroutines are a frequent example) make this problem far more manageable.

I’ve had my share of painful bugs while implementing threading in my own language, Sunflower. Below I explain the problems I faced, the design choices I considered, and the working implementation I arrived at.

#Problems I ran into

Some of the hard issues I encountered while implementing threads:

Tracking down undefined behavior such as random segmentation faults.

Allocating too many OS threads and causing resource exhaustion / leaks.

Freeing resources too early and producing dangling pointers.

A few other embarrassments I’ll spare you the details of.

#What is a thread?

A thread is a lightweight sequence of execution that runs concurrently with other threads and the main program. Each thread has its own stack and program counter and behaves like a small, independent program. For example, in a game one thread might listen for input, another maintain the network connection, and another render frames, they run concurrently and (on multicore hardware) possibly in parallel.

#The naïve approach, and why it fails

My initial approach created one OS thread for every high-level thread requested from Sunflower’s frontend. That works for a few threads but quickly hits system limits: creating thousands of OS threads leads to segmentation faults or poor performance because the OS can’t allocate stack and scheduling resources for so many threads.

This made me study how modern languages handle high numbers of concurrent tasks.

#How other languages solve it (brief)

Go (and several other runtimes) uses an N:M threading model: map N logical jobs (goroutines) onto M OS threads (workers), where M is sized according to hardware concurrency. The runtime schedules many logical tasks on a smaller pool of OS threads and thus avoids the cost and limit of one-to-one OS thread allocation.

#My approach for Sunflower

I implemented a worker pool: M long-lived OS threads (workers) that sleep when idle and wake to execute queued jobs. Workers are driven by a condition_variable. When a job is queued, the runtime wakes one worker (or more) and a free worker picks up and executes the job.

This differs from a strict Round-Robin scheduler, job allocation is effectively random among free workers, which yields a trivially distributed load across the pool.

#High-level architecture

A global queue of pending ThreadHandle jobs (q_jobs).

M worker threads (v_workers) that wait on a condition_variable.

v_handles stores active ThreadHandle pointers and allows reuse of slots via idx_avl.

API primitives: create, run, join, join_all, and close.

#Initializer (create worker threads)

CPP

static std::vector<std::thread> v_workers;

SF_API void
init_runtime_threads ()
{
  size_t n = std::thread::hardware_concurrency ();
  if (!n) n = 4;

  for (size_t i = 0; i < n; ++i)
    v_workers.emplace_back(worker_loop);
}

v_workers holds the M workers that watch the job queue and execute work.

#Worker loop

CPP

static void
worker_loop ()
{
  while (1)
  {
    ThreadHandle *th = nullptr;

    {
      std::unique_lock<std::mutex> lock(wl_mutex);
      t_cv.wait(lock, [] { return shutting_down || !q_jobs.empty(); });

      if (shutting_down && q_jobs.empty())
        return;

      th = q_jobs.front();
      q_jobs.pop();
    }

    if (th == nullptr)
      return;

    /* call sunflower function (execute the job) */

    th->get_done() = true;

    if (th->get_is_closed())
    {
      size_t id = th->get_id();
      {
        std::lock_guard<std::mutex> lock(cre_mutex);
        delete v_handles[id];
        v_handles[id] = nullptr;
        idx_avl.push_back(id);
      }
    }

    /* local cleanup */
  }
}

Key point: the worker uses std::unique_lock + condition_variable::wait(predicate) to block efficiently until work arrives or shutdown is requested. This avoids busy-waiting and wasted CPU cycles.

#State variables

CPP

static Vec<ThreadHandle *> v_handles; /* active handles */
static Vec<size_t> idx_avl;           /* free indices */

static std::mutex cre_mutex; /* protect create/close operations */
static std::mutex wl_mutex;  /* protect worker loop queue access */
static std::mutex thr_mutex; /* protect pushing jobs into queue */

static std::condition_variable t_cv;
static std::queue<ThreadHandle *> q_jobs;
static std::vector<std::thread> v_workers;
static bool shutting_down = false;

#`ThreadHandle` (summary)

CPP

class ThreadHandle
{
private:
  Object *name = nullptr;
  Object *args = nullptr;
  Module *mod = nullptr;
  bool done = false;
  bool is_closed = false;
  size_t id = 0;

public:
  ThreadHandle(Object *_Name, Object *_Args, Module *_Mod)
    : name{_Name}, args{_Args}, mod{_Mod}
  {
    IR(name);
    IR(args);
  }

  Module *&get_mod();
  Object *&get_name();
  Object *&get_args();
  inline bool &get_done();
  inline size_t &get_id();
  inline bool &get_is_closed();
  void run();

  ~ThreadHandle()
  {
    DR(name);
    DR(args);
  }
};

(Where IR/DR are the Sunflower runtime reference management macros.)

#Pushing jobs to the queue

CPP

void
ThreadHandle::run()
{
  {
    std::lock_guard<std::mutex> lock(thr_mutex);
    q_jobs.push(this);
  }
  t_cv.notify_one();
}

run() enqueues the ThreadHandle and signals a worker to wake and pick it up.

#Sunflower `_Native_Thread` API

`create`

Creates a ThreadHandle and stores it in v_handles, reusing indices from idx_avl when possible:

CPP

SF_API Object *
create(Module *mod)
{
  std::lock_guard<std::mutex> lock(cre_mutex);

  Object *o_fname = mod->get_variable("fname");
  Object *o_fargs = mod->get_variable("fargs");

  assert(o_fname->get_type() == ObjectType::FuncObject);
  assert(o_fargs->get_type() == ObjectType::ArrayObj);

  ThreadHandle *th = new ThreadHandle(
    o_fname, o_fargs,
    static_cast<FunctionObject *>(o_fname)->get_v()->get_parent());

  size_t idx;
  if (idx_avl.get_size()) {
    size_t p = idx_avl.pop_back();
    v_handles[p] = th;
    idx = p;
  } else {
    idx = v_handles.get_size();
    v_handles.push_back(th);
  }

  th->get_id() = idx;

  Object *ret = static_cast<Object *>(new ConstantObject(
    static_cast<Constant *>(new IntegerConstant(static_cast<int>(idx)))));

  IR(ret);
  return ret;
}

`run`

Starts a previously created thread by enqueueing its handle:

CPP

SF_API Object *
run(Module *mod)
{
  Object *o_id = mod->get_variable("id");
  assert(OBJ_IS_INT(o_id));

  size_t id = static_cast<size_t>(
    static_cast<IntegerConstant *>(
      static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());

  assert(id < v_handles.get_size());
  v_handles[id]->run();

  Object *ret = static_cast<Object *>(new ConstantObject(
    static_cast<Constant *>(new NoneConstant())));

  IR(ret);
  return ret;
}

`join`

Waits until the single thread has finished. (This implementation busy-waits on the done flag; it’s a simple workaround given detached execution by default.)

CPP

SF_API Object *
join(Module *mod)
{
  Object *o_id = mod->get_variable("id");
  assert(OBJ_IS_INT(o_id));

  size_t id = static_cast<size_t>(
    static_cast<IntegerConstant *>(
      static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());

  assert(id < v_handles.get_size());

  ThreadHandle *&th = v_handles[id];

  while (!th->get_done())
    ;

  Object *ret = static_cast<Object *>(new ConstantObject(
    static_cast<Constant *>(new NoneConstant())));

  IR(ret);
  return ret;
}

Note: The busy-wait above is functional but not optimal. A better approach is to use a per-handle condition_variable or future/promise so join() can block efficiently without spinning.

`join_all`

Waits for all outstanding threads:

CPP

SF_API Object *
join_all(Module *mod)
{
  for (ThreadHandle *&th : v_handles) {
    if (th == nullptr) continue;
    while (!th->get_done())
      ;
  }

  Object *ret = static_cast<Object *>(new ConstantObject(
    static_cast<Constant *>(new NoneConstant())));

  IR(ret);
  return ret;
}

`close`

Closes a handle, deleting it if finished or marking it for deletion (detached) if it’s still running:

CPP

SF_API Object *
close(Module *mod)
{
  std::lock_guard<std::mutex> close(cre_mutex);
  Object *o_id = mod->get_variable("id");
  assert(OBJ_IS_INT(o_id));

  size_t id = static_cast<size_t>(
    static_cast<IntegerConstant *>(
      static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());

  ThreadHandle *&th = v_handles[id];

  if (id < v_handles.get_size() && th != nullptr) {
    if (th->get_done() && !th->get_is_closed()) {
      delete th;
      th = nullptr;
      idx_avl.push_back(id);
    } else if (!th->get_done()) {
      th->get_is_closed() = true; /* detach: delete when done */
    }
  }

  Object *ret = static_cast<Object *>(new ConstantObject(
    static_cast<Constant *>(new NoneConstant())));

  IR(ret);
  return ret;
}

Sunflower intentionally does not provide a hard-kill thread primitive, since forcibly killing threads can cause severe undefined behavior.

#Object-oriented wrapper (Sunflower pseudo-code)

This is the high-level thread module wrapper that a Sunflower program would use:

TEXT

import '_Native_Thread' as client

class Thread
    id = -1

    fun _init (self, fn, args)
        self.id = client.create(fn, args)
    
    fun run (self)
        if self.id == -1
            return? "Thread has not been initialized or is closed"
        client.run(self.id)
    
    fun join (self)
        if self.id == -1
            return? "Thread has not been initialized or is closed"
        return client.join(self.id)
    
    fun _kill (self)
        if self.id != -1
            client.close(self.id)

fun _init (fn, args)
    return Thread(fn, args)

fun join_all ()
    client.join_all()

#Observations and next steps

The worker pool model avoids creating thousands of OS threads and keeps resource usage bounded by M.

Using a condition_variable prevents busy-waiting and reduces CPU waste.

Current join() and join_all() implementations use busy-waiting on a done flag. Replacing these with synchronization primitives (per-handle condition_variable, std::promise/std::future, or similar) would improve efficiency and responsiveness.

Return value handling from thread functions is intentionally minimal to avoid complex ownership and collection semantics; designing a robust mechanism for returning values (futures/promises or message passing) is a clear next enhancement.

#Closing

This is the first stable threading implementation for Sunflower, a pragmatic, C++-based runtime using a fixed worker pool, condition variables, and careful resource management. There’s room for improvement (better join semantics, richer scheduling, more advanced task stealing, etc.), but the current design addresses the main pain points I encountered early on.

I’ll be documenting more of Sunflower’s language design and runtime internals in a short series on my blog. I’d love to hear feedback, suggestions, or pointers to good resources on language runtime design and concurrency. Cheers!

Comments

No comments yet. Be the first to share your thoughts!