Why std::this_thread::sleep_for() is broken on ESP32

20 Feb 2025 by Steve Noonan

A curious bug appearing after upgrading to IDF v5 led me into a deep dive of how std::this_thread::sleep_for() is implemented on the ESP32. I discuss how the IDF implements pthreads and newlib to provide C++ threading functionality. The results are surprising: a simple 10 millisecond sleep was killing performance, but only in the new version of IDF due to an interaction between libstdc++ and usleep().

Optional motivation to continue onwards

Why so slow IDF v5?
FreeRTOS on the ESP32
- System Tick Period
- Small Sleeps and Delays
C++: the Solution to and Cause of All My Problems
IDF Newlib
Would You Like Precision or Efficiency?
Is usleep() broken in IDF v5?
Conclusion

Why so slow IDF v5?

One of my ESP32 projects was at version v4.4 of Espressif’s IoT Development Framework (IDF). After about two days of work, the project compiled and ran in IDF v5.3. Except, something was not quite right. Things were sluggish. FreeRTOS’s system state API indicated that one of the two Xtensa LX6 processors on the ESP32 was pinned at 100% usage. The culprits were two high-priority threads, both of which had a relatively small loop with a 10 millisecond sleep between each iteration.

Sleeping for 10 milliseconds should leave plenty of time to do other things. Plus, this code worked fine in IDF v4 without an issue. What did IDF v5 do?

FreeRTOS on the ESP32

Let’s go into some background first. Experienced embedded engineers will likely find parts of this section as a review, but we need to be solid in our understanding of the fundamentals before moving on.

System Tick Period

My project is set to the default ESP32 FreeRTOS tick period of 10 milliseconds, represented by the portTICK_PERIOD_MS macro:

#define CONFIG_FREERTOS_HZ 100 /* sdkconfig.h */
#define configTICK_RATE_HZ CONFIG_FREERTOS_HZ /* FreeRTOSConfig.h */
#define portTICK_PERIOD_MS ((uint32_t)1000 / configTICK_RATE_HZ) /* portmacro.h */

This is used by ESP32 to setup the system tick (systick) interrupt, done by vSystimerSetup() in components\freertos\port_systick.c. Notably, there are two system ticks because there are two cores. Here is the interesting part done for each core:

/* configure the timer */
uint32_t alarm_id = SYSTIMER_ALARM_OS_TICK_CORE0 + cpuid;
systimer_hal_connect_alarm_counter(&systimer_hal, alarm_id, SYSTIMER_COUNTER_OS_TICK);
systimer_hal_set_alarm_period(&systimer_hal, alarm_id, 1000000UL / CONFIG_FREERTOS_HZ);
systimer_hal_select_alarm_mode(&systimer_hal, alarm_id, SYSTIMER_ALARM_MODE_PERIOD);

Obviously, by “sleeping” a thread for a period of time, we want to yield execution to other threads. This is accomplished by the FreeRTOS scheduler. The basic unit of the scheduler is the system tick period. Other things can occur based on interrupts as well, but terms of sleeping threads, the system tick period is what really matters.

ESP32’s FreeRTOS scheduler is running a Best Effort Round Robin time slicing algorithm. This is slightly different than the standard FreeRTOS algorithm (“best effort” instead of perfect) due to symmetric-multiprocessing. The important bit to note here is even when a thread does not explicitly yield, the scheduler will time slice amongst the highest priority ready-state threads on the system tick period intervals.

Small Sleeps and Delays

The following code yields the minimal amount of time:

vTaskDelay(1);

This will guarantee the thread will sleep at least until the next system tick interrupt.

But when is the next system tick interrupt in relation to the execution of vTaskDelay(1)? It could be in 1 nanosecond, all the way up to our system tick period of 10 milliseconds. Exactly how much time the thread will be sleep is unknown to this code.

So to be clear: sleeping “1 tick” does not mean it will sleep for one tick period of time. In fact, it will always sleep some amount less than one tick period, and sometimes it will not sleep at all.

So then, is calling vTaskDelay(1) in a small loop a problem? Not intrinsically. Like most things, it’s fine if you know what you are doing. If you want to ensure a thread (technically called a Task in FreeRTOS) executes its logic is performed on every tick, this a way to do it. There are no problems so long as the logic is not sensitive to tick jitter and its processing only requires a small fraction of the tick period.

We could attempt to be more idiomatic by using code like this:

vTaskDelay(pdMS_TO_TICKS(10));

The macro pdMS_TO_TICKS() converts milliseconds into ticks by accounting for the FreeRTOS configuration tick period. If the period changes or, god forbid, the code is copied and pasted into a different project, the proper tick count would be calculated. If our tick period is 10 milliseconds it will call vTaskDelay(1). All is great.

Except not always. Say we want to sleep 9 milliseconds. If we write vTaskDelay(pdMS_TO_TICKS(9)) the macro will actually call vTaskDelay(0). This is because it does integer math:

typedef uint32_t TickType_t;
#define pdMS_TO_TICKS( xTimeInMs )    ( ( TickType_t ) ( ( ( TickType_t ) ( xTimeInMs ) * ( TickType_t ) configTICK_RATE_HZ ) / ( TickType_t ) 1000U ) )

So (9*100)/1000 = 0.9, but as a uint32_t, just 0. This is certainly not what we want, because passing 0 to vTaskDelay() does nothing at all.

Okay, but what if we really need to wait for less than a tick period? We can busy wait. The IDF has esp_rom_delay_us(), which does just that. But this is an Internal API which we don’t have source code for, and its functionality may change or be removed in the future. ESP does give us usleep() though, and that allows precise delays. The caveat here is that the FreeRTOS scheduler is still free to context switch out of the busy loop due to the important bit I said above about time slicing. If you must absolutely guarantee timing, then you must disable the FreeRTOS scheduler (and perhaps all interrupts) for that period.

The other option is to use a free running timer and trigger an interrupt when it expires. The interrupt can signal FreeRTOS to wake the waiting task. This is great as long as your context switch overheads allow it and you have a free hardware timer to do it.

All of this is to say the following:

If you care about precision timing, details matter.
If you don’t care about precision timing but are using it, you are probably wasting resources.

C++: the Solution to and Cause of All My Problems

Alright then, so what does all this have to do with our issue related to a 10 millisecond delay? Well, I am not actually using vTaskDelay(1) directly. Instead, I am using C++ thread APIs, and therefore, the actual code causing all the CPU usage is std::this_thread::sleep_for(10ms).

Espressif provides good support for C++, including std::thread. This allows us to write portable C++ code. But it’s also where our problem lies. So how does something like an ESP32 MCU support C++ threads anyway?

`std::thread`, POSIX Threads, and FreeRTOS Tasks, oh my!

Espressif’s IDF compiles with GCC and therefore uses The GNU C++ Library (libstdc++). libstdc++’s implementation of std::thread is built on top of pthreads, but it is up to the platform to provide a pthread implementation. Espressif did just that, including things like thread local storage.

If you read the code you’ll see that IDF’s pthreads is using FreeRTOS APIs for thread creation, synchronization, and all functionality found within std::this_thread, such as sleep_for().

If that’s not enough, we also need to consider the C standard library. IDF uses newlib and has custom platform implementations of its sleeping functions. Thankfully, newlib is not precompiled and you can easily find its source code in the components/newlib directory of an IDF install.

My problem appears in IDF v5 because it upgraded GCC from 8.4 to 13.2, and libstdc++’s implementation of std::this_thread::sleep_for() changed. But is libstdc++ really to blame? Let’s see.

IDF 4.x: `std::this_thread::sleep_for()`

Let’s look at the working side of things first. When calling std::this_thread::sleep_for(10ms) we first enter here:

/// sleep_for
template<typename _Rep, typename _Period>
inline void
sleep_for(const chrono::duration<_Rep, _Period>& __rtime)
{
  if (__rtime <= __rtime.zero())
    return;
  auto __s = chrono::duration_cast<chrono::seconds>(__rtime);
  auto __ns = chrono::duration_cast<chrono::nanoseconds>(__rtime - __s);
#ifdef _GLIBCXX_USE_NANOSLEEP
  __gthread_time_t __ts =
    {
      static_cast<std::time_t>(__s.count()),
      static_cast<long>(__ns.count())
    };
  while (::nanosleep(&__ts, &__ts) == -1 && errno == EINTR)
    { }
#else
  __sleep_for(__s, __ns);
#endif
}

If you are following along in your own IDF project, you can find the std::this_thread::sleep_for() implementation in the ESP tools directory under xtensa-esp32-elf/esp-2021r2-patch5-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/thread (or similar, depending on the version).

IDF does not define _GLIBCXX_USE_NANOSLEEP, so no nanosleep(). That shouldn’t be too surprising for an MCU implementation. Therefore, we go straight to __sleep_for(__s, __ns).

Here is GCC 8.4 C++11 libstdc++’s code for __sleep_for():

namespace this_thread
{
  void
  __sleep_for(chrono::seconds __s, chrono::nanoseconds __ns)
  {
#ifdef _GLIBCXX_USE_NANOSLEEP
    __gthread_time_t __ts =
      {
        static_cast<std::time_t>(__s.count()),
        static_cast<long>(__ns.count())
      };
    while (::nanosleep(&__ts, &__ts) == -1 && errno == EINTR)
      { }
#elif defined(_GLIBCXX_HAVE_SLEEP)
# ifdef _GLIBCXX_HAVE_USLEEP
    ::sleep(__s.count());
    if (__ns.count() > 0)
      {
        long __us = __ns.count() / 1000;
        if (__us == 0)
          __us = 1;
        ::usleep(__us);
      }
# else
    ::sleep(__s.count() + (__ns.count() >= 1000000));
# endif
#elif defined(_GLIBCXX_HAVE_WIN32_SLEEP)
    unsigned long ms = __ns.count() / 1000000;
    if (__ns.count() > 0 && ms == 0)
      ms = 1;
    ::Sleep(chrono::milliseconds(__s).count() + ms);
#endif
  }
}

_GLIBCXX_END_NAMESPACE_VERSION
} // namespace std

Let’s clean it up by taking out the parts not compiled for the ESP32:

 __sleep_for(chrono::seconds __s, chrono::nanoseconds __ns)
  {
    ::sleep(__s.count());
    if (__ns.count() > 0)
      {
        long __us = __ns.count() / 1000;
        if (__us == 0)
          __us = 1;
        ::usleep(__us);
      }
  }

Pretty simple: get the number of seconds and call sleep(), then get the number of microseconds and call usleep(). Both of these are calls into newlib. We will get to those in a moment. First, let’s look at the IDF v5’s sleep_for().

IDF 5.x: `std::this_thread::sleep_for()`

Once again, sleep_for() is a wrapper for __sleep_for(). Here is the newer code (see online):

__sleep_for(chrono::seconds __s, chrono::nanoseconds __ns)
  {
    const auto target = chrono::steady_clock::now() + __s + __ns;
    while (true)
      {
        unsigned secs = __s.count();
        if (__ns.count() > 0)
          {
            long us = __ns.count() / 1000;
            if (us == 0)
              us = 1;
            ::usleep(us);
          }
        if (secs > 0)
          {
            // Sleep in a loop to handle interruption by signals:
            while ((secs = ::sleep(secs)))
              { }
          }
        const auto now = chrono::steady_clock::now();
        if (now >= target)
          break;
        __s = chrono::duration_cast<chrono::seconds>(target - now);
        __ns = chrono::duration_cast<chrono::nanoseconds>(target - (now + __s));
    }
  }

Now we have a while loop. In it, usleep() is called, but afterward, there is an explicit check to determine if the minimal amount of time has actually gone by. The only way out of the loop is if now is at or after the target time.

Why the change? This is because nanosleep() may return early if a signal is delivered to the thread during its sleep. The change is in commit cfef4c3, with the relevant commit comment as the following:

Loop while sleep call is interrupted and until steady_clock
 shows requested duration has elapsed.

ESP IDF doesn’t support signals, so the change should have no impact on the ESP32. Oh, but it does!

IDF Newlib

A few things to unpack here which lead us into Newlib. Let’s start with how chrono::steady_clock::now() works.

`clock_gettime()`

Calling chrono::steady_clock::now() on the ESP32 is just a wrapper for clock_gettime() located in newlib (idf/components/newlib/time.c), where clock_id is CLOCK_MONOTONIC.

int clock_gettime(clockid_t clock_id, struct timespec *tp)
{
#if IMPL_NEWLIB_TIME_FUNCS
    if (tp == NULL) {
        errno = EINVAL;
        return -1;
    }
    struct timeval tv;
    uint64_t monotonic_time_us = 0;
    switch (clock_id) {
    case CLOCK_REALTIME:
        _gettimeofday_r(NULL, &tv, NULL);
        tp->tv_sec = tv.tv_sec;
        tp->tv_nsec = tv.tv_usec * 1000L;
        break;
    case CLOCK_MONOTONIC:
        monotonic_time_us = esp_time_impl_get_time();
        tp->tv_sec = monotonic_time_us / 1000000LL;
        tp->tv_nsec = (monotonic_time_us % 1000000LL) * 1000L;
        break;
    default:
        errno = EINVAL;
        return -1;
    }
    return 0;
#else
    errno = ENOSYS;
    return -1;
#endif
}

Calling esp_time_impl_get_time() gets a value based on a free-running hardware timer with at least 1 us accuracy. So that is how the monotonic clock is implemented on the ESP32. Note that this clock is completely independent of the timer that is controlling the FreeRTOS system tick interrupt. No problems there.

`sleep()`

ESP32 implements sleep() as a simple wrapper of usleep():

unsigned int sleep(unsigned int seconds)
{
    usleep(seconds * 1000000UL);
    return 0;
}

`usleep()`

Okay, so in both versions of IDF, usleep() is at the bottom of all sleeping calls. Here it is:

int usleep(useconds_t us)
{
    const int64_t us_per_tick = portTICK_PERIOD_MS * 1000;
    if (us < us_per_tick) {
        esp_rom_delay_us((uint32_t) us);
    } else {
        /* since vTaskDelay(1) blocks for anywhere between 0 and portTICK_PERIOD_MS,
         * round up to compensate.
         */
        vTaskDelay((us + us_per_tick - 1) / us_per_tick);
    }
    return 0;
}

Remember when I said that esp_rom_delay_us() is an internal API and we should use usleep() instead? Well, usleep() is just a wrapper to esp_rom_delay_us() for short periods (e.g. less than one tick period). We can consider usleep() the “public” exposure of esp_rom_delay_us(), but only when the specified time is less than a system tick period. As mentioned above, this is a busy wait, and since it does not disable the scheduler, it still allows other threads of equal or higher priority to run. So, the timing represents a guaranteed minimum only. More importantly, if there are other threads of lower priority, it will not context switch during this busy time. It will just sit in the thread until the wait is over.

This is all good. A guaranteed minimum is how I expect usleep() to work.

In terms of blocking vs yielding, IDF docs say:

When calling a standard libc or C++ sleep function, such as usleep defined in unistd.h, the task will only block and yield the core if the sleep time is longer than one FreeRTOS tick period. If the time is shorter, the thread will busy-wait instead of yielding to another RTOS task.

It should say for sleeping equal to or longer than one tick period cause yielding vs. busy waiting. In any case, the yielding is done via vTaskDelay().

There is a problem here though. The ticks to yield calculations often produce times yielded less than the specified amount.

Let’s play out an example. If we wanted to sleep for 15 milliseconds, the calculations would give us vTaskDelay(2):

portTICK_PERIOD_MS is defined as 10. So us_per_tick is 10,000.
us is 15000
(15000 + 10000 - 1) / 10000) is (24999) / 10000 which gives 2.4999
But since we are using integers, it is just 2

But what does 2 mean? It means two occurrences of the tick interrupt. Once again, we can ask: when is the next tick interrupt from the moment that vTaskDelay(2) is called? The tick interrupt could be in 1 nanosecond all the way up to 10 milliseconds from now. After that, the second tick will be the system tick period of 10 milliseconds. So for a tick count of 2, we will actually sleep between between 10 and 20 milliseconds.

Even though the comment says it is rounding up to compensate for the first tick potentially not blocking at all, the compensation does not account for the worse-case minimal timing. In the example I gave, a 15-millisecond request will sometimes only sleep for 10 milliseconds. Likewise, a 10 millisecond usleep() will sometimes sleep about 0 milliseconds. The greatest potential differential comes with calling usleep() with a multiple of the tick period. In that case, the time spent may be short by an entire tick period.

Whether this “shorter than specified” time matters depends on requirements. In my experience, a lot of application-level logic would probably be fine with a “10 to 20” millisecond wait. On the other hand, hardware driver logic and audio pipelines have specific realtime characteristics where this would absolutely be a problem.

According to man 3 sleep and POSIX, usleep() should always sleep at least the time specified. It is allowed to sleep more if needed.

The usleep() function shall cause the calling thread to be suspended from execution until either the number of realtime microseconds specified by the argument useconds has elapsed or a signal is delivered to the calling thread and its action is to invoke a signal-catching function or to terminate the process. The suspension time may be longer than requested due to the scheduling of other activity by the system.

The fact that ESP32’s IDF usleep() can be short by up to one system tick period means it doesn’t follow the specifications.

Would You Like Precision or Efficiency?

Let’s recap.

In IDF v4, calling std::this_thread::sleep_for() calls usleep() once. When std::this_thread::sleep_for(10ms) is executed, it calls vTaskDelay(1), and the thread will sleep between 0 and 10 milliseconds. It will usually sleep for less than the time specified.

In IDF v5, calling std::this_thread::sleep_for(10ms) almost always calls usleep() twice. The first time will use vTaskDelay(1), and it will usually sleep for less than the time specified. Then, back in libstdc++ __sleep_for(), the monotonic clock will be checked and it will be seen that some fractional component of 10 milliseconds remains, causing a second call to usleep(). The second call will always be with a time less than the system tick period. This will cause a blocking call to esp_rom_delay_us(). That blocking call is the problem.

Let’s break it down further. For a 10 millisecond delay:

(10000 + 10000 - 1) / 10000) is (19999) / 10000 which gives 1.9999
But since we are using integers, it is just 1

But once again, saying vTaskDelay(1) means “wait for 1 tick interrupt.” That will almost always be less than 10 milliseconds from now. When vTaskDelay(1) returns, somewhere between 0.000001 and 10.000000 milliseconds will have actually transpired. The newer version of __sleep_for() in IDF v5 / GCC 13 will double-check that the correct time has actually elapsed according to the monotonic clock. That check will fail, so it calls usleep() again, this time with the remainder of the 10 milliseconds. The specified time is less than a system tick period, so the blocking esp_rom_delay_us() is now called.

So what about time slicing? Even if esp_rom_delay_us() blocks, the FreeRTOS scheduler can switch to another task. Firstly, if this thread is of a higher priority, no lower priorities will ever run. But even if everything is of the same priority, the CPU will just switch back to the blocking call on the next round robin, continuing the blocking wait. In our current scenario, this is horribly inefficient, unnecessary, and unexpected.

Any call to sleep_for() greater than the tick period has this problem because the tick interrupt is asynchronous to the sleep_for() call. This means when the scheduler returns from vTaskDelay() some random remainder of time will be done with esp_rom_delay_us() in order to sleep for the precise amount of time requested.

The new version of sleep_for() is much more precise, but it is at the cost of computing efficiency on the ESP32 because some fraction of the tick period will be busy waited instead of yielded. That is very bad to do on an MCU.

Of course, none of this is transparent to the application code, and I doubt it was something intentional from Espressif. It is just a consequence of upgrading to a new version of GCC combined with how IDF implements usleep().

Is `usleep()` broken in IDF v5?

Did Espressif actually implement usleep() wrong? Yes. It needs to be fixed.

For periods at or longer than the system tick, usleep() can return before the specified time. It shouldn’t do that. It must error on the side of sleeping too long to ensure it never sleeps to little. So yes, it is broken in my view. stdlibc++ isn’t to blame.

Since usleep() is sometimes short by 1 system tick period, we could just add another tick count to our calculations. That would sometimes cause usleep() to take longer, but never shorter than the specified time. This would prevent __sleep_for() from calling usleep() a second time with a fraction of a system tick period, so no more blocking busy wait.

We can be a little more sophisticated by taking a cue from the newer __sleep_for() and keep checking the monotonic clock. Something like this:

int usleep(useconds_t us)
{
    const int64_t us_per_tick = portTICK_PERIOD_MS * 1000;
    if (us < us_per_tick) {
        esp_rom_delay_us((uint32_t) us);
    } else {
        /* vTaskDelay may return up to (n-1) tick periods due to the tick ISR
           being asynchronous to the call. We must sleep at least the specified
           time, or longer. Checking the monotonic clock allows making an
           additional call to vTaskDelay when needed to ensure minimal time is
           actually slept. Adding `us_per_tick - 1` prevents ever passing 0 to
           vTaskDelay().
        */
        uint64_t now_us = esp_time_impl_get_time();
        uint64_t target_us = now_us + us;
        do {
            vTaskDelay((((target_us - now_us) + us_per_tick - 1) / us_per_tick));
            now_us = esp_time_impl_get_time();
        } while (now_us < target_us);
    }
    return 0;
}

With this approach, short sleep times are precise (as they always were), and longer sleep times are non-blocking and always as long or longer than the requested time. By testing the monotonic clock, the extra tick is only applied when needed.

I opened a PR with this change to IDF. Hopefully it gets approved.

Since I didn’t want to work with a custom, patched IDF, I replaced all calls to std::this_thread::sleep_for() with our own function. It has the same default signature, with the option to specify a “sleep strategy.” We can now force the custom sleep_for() to busy wait or to yield:

enum class SleepStrategy {
    Default,            // Platform decides when to use busy wait vs. yield
    PreciseBusyWait,    // Busy wait, which is usually very precise
    EfficientYield      // Efficently yield to other threads, but will often sleep longer than specified
};

template<typename _Rep, typename _Period>
static void sleep_for(const std::chrono::duration<_Rep, _Period>& rtime, SleepStrategy strat = SleepStrategy::Default) {

    static constexpr int64_t us_per_tick = portTICK_PERIOD_MS * 1000;
    if (rtime <= rtime.zero()) return;
    auto us = std::chrono::duration_cast<std::chrono::microseconds>(rtime).count();

    if (strat == SleepStrategy::Default && (us < us_per_tick)) {
        // Mimic how std::this_thread::sleep_for() would act, which is to
        // (eventually) call usleep() after going through libstdc++. We only do
        // this for periods less than a tick because for longer periods the
        // implementation is broken (often returns in shorter than specified
        // time). If `usleep()` is fixed then we will update this to always call
        // it for the default strategy.
        usleep(us);
        return;
    }

    uint64_t now_us = esp_timer_get_time();
    const uint64_t target_us = now_us + us;
    do {
        if (strat == SleepStrategy::PreciseBusyWait) {
            // This is an "internal and unstable API" according to ESP, but
            // `usleep()` is just a wrapper for it anyway. If it does change,
            // this function needs to be updated.
            esp_rom_delay_us(target_us - now_us);
        } else {
            // Ensure we never call vTaskDelay(0)
            static constexpr int prevent_zero_ticks = us_per_tick - 1;
            vTaskDelay((TickType_t)(((target_us - now_us) + prevent_zero_ticks) / us_per_tick));
        }
        // Validate against monotonic clock
        now_us = esp_timer_get_time();
    } while (now_us < target_us);
}

This approach gives us the minimal change needed to ensure things work correctly while allowing more control over how to perform the sleep when using C++.

Conclusion

I cut my teeth on bare metal C code where everything was statically allocated. No malloc(). No floating point math because there was no FPU. Custom linker scripts. Debugging using GPIO pins and an oscilloscope. Using precalculated value tables to save a few microseconds in an ISR. We ran at 24 MHz. At that time, C was luxurious because the “old” stuff was running at 1 MHz, usually in assembler.

Back then common wisdom was that C++ was just not appropriate for microcontrollers. Of course, that was a long time ago, before “modern C++” and before MCUs were clocked at hundreds of MHz with L1 caching and multistage instruction pipelines.

It seems today that using C++ for firmware brings up a lot of strong reactions. A lot of embedded people hate it. A lot of people love it. For myself, I think it can be a great tool, but it does have much complexity you need to get right, especially when using it on an MCU. This seems to be a good example of such.

I sincerely hope usleep() is fixed. Until then, don’t use std::this_thread::sleep_for() in your IDF v5 projects. It’s a waste of time!

See anything you'd like to change? Submit a pull request or open an issue on our GitHub

Steve Noonan is a Principal Embedded Systems Engineer at Nomo International, Inc.

Interrupt

by Memfault