How Modern Programming Languages Implement Threading, and how I did it
Building a reliable thread runtime for Sunflower
Threads are one of the harder parts of language implementation, both to use correctly and to implement correctly. Writing thread-safe code is tedious and notoriously difficult to debug: race conditions, memory leaks, and subtle undefined behavior are common. Languages with thoughtful concurrency models (Go and its goroutines are a frequent example) make this problem far more manageable.
I’ve had my share of painful bugs while implementing threading in my own language, Sunflower. Below I explain the problems I faced, the design choices I considered, and the working implementation I arrived at.
#Problems I ran into
Some of the hard issues I encountered while implementing threads:
- Tracking down undefined behavior such as random segmentation faults.
- Allocating too many OS threads and causing resource exhaustion / leaks.
- Freeing resources too early and producing dangling pointers.
- A few other embarrassments I’ll spare you the details of.
#What is a thread?
A thread is a lightweight sequence of execution that runs concurrently with other threads and the main program. Each thread has its own stack and program counter and behaves like a small, independent program. For example, in a game one thread might listen for input, another maintain the network connection, and another render frames, they run concurrently and (on multicore hardware) possibly in parallel.
#The naïve approach, and why it fails
My initial approach created one OS thread for every high-level thread requested from Sunflower’s frontend. That works for a few threads but quickly hits system limits: creating thousands of OS threads leads to segmentation faults or poor performance because the OS can’t allocate stack and scheduling resources for so many threads.
This made me study how modern languages handle high numbers of concurrent tasks.
#How other languages solve it (brief)
Go (and several other runtimes) uses an N:M threading model: map N logical jobs (goroutines) onto M OS threads (workers), where M is sized according to hardware concurrency. The runtime schedules many logical tasks on a smaller pool of OS threads and thus avoids the cost and limit of one-to-one OS thread allocation.
#My approach for Sunflower
I implemented a worker pool: M long-lived OS threads (workers) that sleep when idle and wake to execute queued jobs. Workers are driven by a condition_variable. When a job is queued, the runtime wakes one worker (or more) and a free worker picks up and executes the job.
This differs from a strict Round-Robin scheduler, job allocation is effectively random among free workers, which yields a trivially distributed load across the pool.
#High-level architecture
- A global queue of pending
ThreadHandlejobs (q_jobs). Mworker threads (v_workers) that wait on acondition_variable.v_handlesstores activeThreadHandlepointers and allows reuse of slots viaidx_avl.- API primitives:
create,run,join,join_all, andclose.
#Initializer (create worker threads)
static std::vector<std::thread> v_workers;
SF_API void
init_runtime_threads ()
{
size_t n = std::thread::hardware_concurrency ();
if (!n) n = 4;
for (size_t i = 0; i < n; ++i)
v_workers.emplace_back(worker_loop);
}v_workers holds the M workers that watch the job queue and execute work.
#Worker loop
static void
worker_loop ()
{
while (1)
{
ThreadHandle *th = nullptr;
{
std::unique_lock<std::mutex> lock(wl_mutex);
t_cv.wait(lock, [] { return shutting_down || !q_jobs.empty(); });
if (shutting_down && q_jobs.empty())
return;
th = q_jobs.front();
q_jobs.pop();
}
if (th == nullptr)
return;
/* call sunflower function (execute the job) */
th->get_done() = true;
if (th->get_is_closed())
{
size_t id = th->get_id();
{
std::lock_guard<std::mutex> lock(cre_mutex);
delete v_handles[id];
v_handles[id] = nullptr;
idx_avl.push_back(id);
}
}
/* local cleanup */
}
}Key point: the worker uses std::unique_lock + condition_variable::wait(predicate) to block efficiently until work arrives or shutdown is requested. This avoids busy-waiting and wasted CPU cycles.
#State variables
static Vec<ThreadHandle *> v_handles; /* active handles */
static Vec<size_t> idx_avl; /* free indices */
static std::mutex cre_mutex; /* protect create/close operations */
static std::mutex wl_mutex; /* protect worker loop queue access */
static std::mutex thr_mutex; /* protect pushing jobs into queue */
static std::condition_variable t_cv;
static std::queue<ThreadHandle *> q_jobs;
static std::vector<std::thread> v_workers;
static bool shutting_down = false;#ThreadHandle (summary)
class ThreadHandle
{
private:
Object *name = nullptr;
Object *args = nullptr;
Module *mod = nullptr;
bool done = false;
bool is_closed = false;
size_t id = 0;
public:
ThreadHandle(Object *_Name, Object *_Args, Module *_Mod)
: name{_Name}, args{_Args}, mod{_Mod}
{
IR(name);
IR(args);
}
Module *&get_mod();
Object *&get_name();
Object *&get_args();
inline bool &get_done();
inline size_t &get_id();
inline bool &get_is_closed();
void run();
~ThreadHandle()
{
DR(name);
DR(args);
}
};(Where IR/DR are the Sunflower runtime reference management macros.)
#Pushing jobs to the queue
void
ThreadHandle::run()
{
{
std::lock_guard<std::mutex> lock(thr_mutex);
q_jobs.push(this);
}
t_cv.notify_one();
}run() enqueues the ThreadHandle and signals a worker to wake and pick it up.
#Sunflower _Native_Thread API
create
Creates a ThreadHandle and stores it in v_handles, reusing indices from idx_avl when possible:
SF_API Object *
create(Module *mod)
{
std::lock_guard<std::mutex> lock(cre_mutex);
Object *o_fname = mod->get_variable("fname");
Object *o_fargs = mod->get_variable("fargs");
assert(o_fname->get_type() == ObjectType::FuncObject);
assert(o_fargs->get_type() == ObjectType::ArrayObj);
ThreadHandle *th = new ThreadHandle(
o_fname, o_fargs,
static_cast<FunctionObject *>(o_fname)->get_v()->get_parent());
size_t idx;
if (idx_avl.get_size()) {
size_t p = idx_avl.pop_back();
v_handles[p] = th;
idx = p;
} else {
idx = v_handles.get_size();
v_handles.push_back(th);
}
th->get_id() = idx;
Object *ret = static_cast<Object *>(new ConstantObject(
static_cast<Constant *>(new IntegerConstant(static_cast<int>(idx)))));
IR(ret);
return ret;
}run
Starts a previously created thread by enqueueing its handle:
SF_API Object *
run(Module *mod)
{
Object *o_id = mod->get_variable("id");
assert(OBJ_IS_INT(o_id));
size_t id = static_cast<size_t>(
static_cast<IntegerConstant *>(
static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());
assert(id < v_handles.get_size());
v_handles[id]->run();
Object *ret = static_cast<Object *>(new ConstantObject(
static_cast<Constant *>(new NoneConstant())));
IR(ret);
return ret;
}join
Waits until the single thread has finished. (This implementation busy-waits on the done flag; it’s a simple workaround given detached execution by default.)
SF_API Object *
join(Module *mod)
{
Object *o_id = mod->get_variable("id");
assert(OBJ_IS_INT(o_id));
size_t id = static_cast<size_t>(
static_cast<IntegerConstant *>(
static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());
assert(id < v_handles.get_size());
ThreadHandle *&th = v_handles[id];
while (!th->get_done())
;
Object *ret = static_cast<Object *>(new ConstantObject(
static_cast<Constant *>(new NoneConstant())));
IR(ret);
return ret;
}Note: The busy-wait above is functional but not optimal. A better approach is to use a per-handle
condition_variableor future/promise sojoin()can block efficiently without spinning.
join_all
Waits for all outstanding threads:
SF_API Object *
join_all(Module *mod)
{
for (ThreadHandle *&th : v_handles) {
if (th == nullptr) continue;
while (!th->get_done())
;
}
Object *ret = static_cast<Object *>(new ConstantObject(
static_cast<Constant *>(new NoneConstant())));
IR(ret);
return ret;
}close
Closes a handle, deleting it if finished or marking it for deletion (detached) if it’s still running:
SF_API Object *
close(Module *mod)
{
std::lock_guard<std::mutex> close(cre_mutex);
Object *o_id = mod->get_variable("id");
assert(OBJ_IS_INT(o_id));
size_t id = static_cast<size_t>(
static_cast<IntegerConstant *>(
static_cast<ConstantObject *>(o_id)->get_c().get())->get_value());
ThreadHandle *&th = v_handles[id];
if (id < v_handles.get_size() && th != nullptr) {
if (th->get_done() && !th->get_is_closed()) {
delete th;
th = nullptr;
idx_avl.push_back(id);
} else if (!th->get_done()) {
th->get_is_closed() = true; /* detach: delete when done */
}
}
Object *ret = static_cast<Object *>(new ConstantObject(
static_cast<Constant *>(new NoneConstant())));
IR(ret);
return ret;
}Sunflower intentionally does not provide a hard-kill thread primitive, since forcibly killing threads can cause severe undefined behavior.
#Object-oriented wrapper (Sunflower pseudo-code)
This is the high-level thread module wrapper that a Sunflower program would use:
import '_Native_Thread' as client
class Thread
id = -1
fun _init (self, fn, args)
self.id = client.create(fn, args)
fun run (self)
if self.id == -1
return? "Thread has not been initialized or is closed"
client.run(self.id)
fun join (self)
if self.id == -1
return? "Thread has not been initialized or is closed"
return client.join(self.id)
fun _kill (self)
if self.id != -1
client.close(self.id)
fun _init (fn, args)
return Thread(fn, args)
fun join_all ()
client.join_all()#Observations and next steps
- The worker pool model avoids creating thousands of OS threads and keeps resource usage bounded by
M. - Using a
condition_variableprevents busy-waiting and reduces CPU waste. - Current
join()andjoin_all()implementations use busy-waiting on adoneflag. Replacing these with synchronization primitives (per-handlecondition_variable,std::promise/std::future, or similar) would improve efficiency and responsiveness. - Return value handling from thread functions is intentionally minimal to avoid complex ownership and collection semantics; designing a robust mechanism for returning values (futures/promises or message passing) is a clear next enhancement.
#Closing
This is the first stable threading implementation for Sunflower, a pragmatic, C++-based runtime using a fixed worker pool, condition variables, and careful resource management. There’s room for improvement (better join semantics, richer scheduling, more advanced task stealing, etc.), but the current design addresses the main pain points I encountered early on.
I’ll be documenting more of Sunflower’s language design and runtime internals in a short series on my blog. I’d love to hear feedback, suggestions, or pointers to good resources on language runtime design and concurrency. Cheers!
Comments
No comments yet. Be the first to share your thoughts!