What separates C++ from being a "perfect" performance-oriented language¶

What this is: A subjective collection of various "performance issues" of C++.

What constitutes a "performance issue": Any quirk of the language and its ecosystem that inclines / forces developers to implement things in a suboptimal (in regards to performance) way.

The "issues" are listed in no particular order and were initially compiled as a personal note to summarize a bunch of curios quirks, however after a bit of revision it seems good enough to make for a rather interesting read.

Every "issue" comes with an attached cause and a short explanation.

Note

This does not intend to be a language critique or necessarily make an argument for the practicality of all listed points. Many of the present points are largely presented as "what if" without concerning their pragmatic importance.

Performance issues¶

No destructive move¶

Cause: Language design.

In C++ any moved-from object is required to stay in some kind of a "valid" moved-from state. This often requires additional logic in the move-constructor and in certain cases can prevent move from being noexcept (which can affect performance of standard containers, mainly std::vector<>). A good overview of the topic can be found in this blogpost by Jonathan Müller.

Implicit copy¶

Cause: Language design.

C++ is a copy-by-default rather than move-by-default language. This makes it easy to accidentally perform a heavyweight copy, which in some cases can hide behind a very innocuous syntax, such as returning a local variable created by a structured binding from a function:

SomeHeavyClass f() {
    auto [res, err] = compute();
    // ...
    return res; // this is a copy, no copy elision takes place
}

No `constexpr` priority¶

Cause: Practical concerns.

While constexpr functions can be evaluated both at runtime & compile-time, there are no guarantees to which of the options will be chosen unless we manually force the constexpr context:

constexpr heavy_function() {
    // ...
}

std::cout << heavy_function();
// 'heavy_function()' might be evaluated at runtime unless we force constexpr by assigning it to a constexpr variable first

From the idealistic perspective, we always want to pre-compute as much as possible at compile time and having to manually "force" constexpr evaluation puts additional burden on the programmer. From a more pragmatic perspective forcing constexpr evaluation on everything can have a very significant impact on compilation times due to being significantly slower than a runtime evaluation of the same function.

Manual struct packing¶

Cause: ABI compatibility, C legacy

In C++ member variables are guaranteed to be stored in the order of their declaration. This can often waste space due to alignment requirements, especially when working with large classes where "performance-optimal" order and "readability-optimal" order might differ:

struct Large {
    std::uint32_t a;
    std::uint64_t b;
    std::uint32_t c;
};

struct Small {
    std::uint64_t b;
    std::uint32_t a;
    std::uint32_t c;
};

static_assert(sizeof(Large) == 24);
static_assert(sizeof(Small) == 16);

Some languages that prefer static linking leave this ordering to the compiler, in case of C++ the choice was made in favor of compatibility.

No way to tell when a critical optimization has failed¶

Cause: Compiler QoL

In some scenarios, performance IS correctness. This mainly concerns vectorization, which can fail due to seemingly minor changes and cause a sudden x2-x4 performance degradation (and degrading, for example, a game from 60 FPS to 15 FPS makes it effectively non-functional).

Unfortunately this is a very complex issue with no easy solution for the problem, most of the time such matters are handled by making a comprehensive benchmark suite with mandated performance degradation tests. An interesting effort was made by the Unity C# compiler which provides a way to declare data-independence and error should these rules be violated.

`std::unordered_map` pointer stability¶

Cause: API design

Almost every method of std::unordered_map is significantly slower than it could be due to the requirement of pointer stability, which dictates a node-based implementation that does not reallocate any elements.

Whether pointer stability is worth the performance cost is frequently debated. An excellent overview of existing map implementations and their performance trade-offs can be found on Martin Ankerl's website.

As of 2025 it seems that densely stored designs with open addressing & linear probing (such as boost::unordered_flat_map and ankerl::unordered_dense::map) are a good general go-to.

`<cmath>` error handling¶

Cause: C legacy

<cmath> uses a rather questionable error handling strategy which relies on modifying a global errno object. In many cases this global access prevents compiler from being more aggressive with optimizations, with prevented vectorization being the biggest issue in terms of impact. For this reason many compilers have an option to disable errno reporting (such as -fno-math-errno on GCC & clang).

In addition to that, modifying a global variable prevented <cmath> functions from being constexpr up until C++26, which affects a lot of generic code that could also be compile-time evaluated otherwise.

`<random>` algorithms¶

Cause: Outdated algorithms

While the core design of <random> is incredibly flexible, its performance suffers from outdated PRNGs and strict algorithmic requirements for distributions. Switching to a more modern set of algorithms can frequently lead to a 2x-6x speedup with no loss of statistical quality.

A rather comprehensive overview of this topic can be found in the docs of utl::random.

`<regex>` performance¶

Cause: ABI stability

Standard library <regex> is known for its downright horrific performance caused by a suboptimal implementation back in the day. At the moment standard library regex tends to be dozens or even hundreds of times slower than modern regex engines of other languages.

At the API level there is nothing preventing <regex> from achieving reasonable performance, however it fell victim to the requirement of ABI stability which set in stone its initial implementation, thus preventing any meaningful improvements in the future.

No standard 128-bit types¶

Cause: Library support

A lot of bleeding-edge algorithms for hashing, RNG, serialization and etc. rely on wider type arithmetics which are often natively supported by the architecture (most use cases only need \(64 \times 64 \rightarrow 128\) bit arithmetic instructions which are common on modern hardware).

Since every major compiler supports them through extensions (GCC & clang __uint128, MSVC _umul128() and etc.), this is usually worked around with a bunch of compiler-specific #ifdef blocks with an emulated fallback.

Such algorithms would be significantly easier and less error-prone to implement if uint128_t & int128_t were standardized across compilers. Doing so through <cstdint> however might prove challenging due to the concerns of old code compatibility.

No bit operations¶

Cause: Library support

[!] Fixed with: C++20

Many performant algorithms tend to be written in terms of bit operations present in practically all modern hardware (such as rotl(), popcount(), bit_width(), countl_zero(), bit_cast() and etc.). Up until C++20 <bit> there was no portable way to call such instructions, they were usually written in terns of branches & regular shifts and hopefully optimized down to the intended asm by a compiler.

Floating point parsing & serialization¶

Cause: Library support

[!] Fixed with: C++17

Quickly & correctly parsing / serializing floating point numbers is a task of significant complexity, which saw major improvements with the advancement of Ryu / Grisu / Schubfach / Dragonbox family of algorithms (speedup of several times with better round-trip guarantees).

Old serialization methods (such as streams and std::snprintf()) are unable to benefit from such advancements due to their legacy requirements. In C++17 <charconv> was standardized as a performant low-level way of float serialization which should be more flexible in case of future algorithmic improvements.

Stream-based formatting¶

Cause: API design

[!] Fixed with: C++20, C++23 (partially due to incomplete adoption)

The "classic" way to parse / serialize / format values is based around rather heavyweight polymorphic stream objects that conflate formatting, I/O and locale manipulation (which in many cases is largely detrimental).

This approach is significantly outperformed by the design of fmtlib which was partially standardized in C++20 and C++23.

No zero-size member optimization¶

Cause: Language design

[!] Fixed with: C++20 (partially due to ABI stability concerns)

In C++ all member variables are required to have a valid non-overlapping address. This can pointlessly bloat the object size when working with potentially stateless members (such as allocators, comparators and etc.) whose address is never taken:

template <class T, class Allocator = std::allocator<T>, class Comparator = std::less<>>
struct Map {
    Allocator  alloc; // will take space even if stateless
    Comparator comp;  //

    // ...
}

This inefficiency used to be worked-around through inheritance hacks and empty base class optimization, however now with C++20 we have a proper attribute [[no_unique_address]] to mark those potentially stateless members. Unfortunately, MSVC still ignores it and uses a custom attribute instead.

No parallel execution guarantees¶

Cause: Library support

While C++17 added parallel execution modes to much of the <algorithm>, the standard does not mandate that implementations actually have to respect them. For example, GCC will ignore parallel execution modes unless linked against Intel TBB, which goes against the intuitive assumption that parallel algorithms should work out of the box.

Limited reallocation¶

Category: C legacy, complexity

Contiguous arrays such as std::vector<> (and other similar classes) should logically be able to grow in-place should the memory allow it, however due to a combination of C realloc() design flaws, std::allocator<> lack of reallocation mechanism and RAII requirements such ability never made it into the standard (or most of the existing libraries for that matter).

In fact, even further gains could be made if containers had an ability to intrusively ingrate with specific allocators to account for various implementation details, such as, for example folly::fbvector which accounts for the jemalloc fixed-size quanta. Providing such mechanisms in a general case however proved to be a task of significant complexity.

Pointer aliasing¶

Cause: Language design

In many scenarios potential pointer aliasing can prevent compiler from being more aggressive with optimization, in C we can use restrict to signal a lack of aliasing, in C++ however there is no general solution for the problem.

Many compilers provide extensions such as GCC __restrict__, however those qualifiers are only applicable to raw pointers and cannot, for example, specify that two instances of std::span<> are non-aliasing:

void vector_sum(std::span<double> res, std::span<double> lhs, std::span<double> rhs) {
    assert(res.size() == lhs.size());
    assert(res.size() == rhs.size());

    for (std::size_t i = 0; i < res.size(); ++i) res[i] = lhs[i] + rhs[i];
    // vectorization would be incorrect in a general case due to the potential aliasing
    // when 'res' / lhs' / 'rhs' point to the intersecting chunks of the same array
}

For a simple loop like this one many compilers will be able to figure out a special case and use vectorized version when all pointers are proven to be non-aliasing (with a non-vectorized fallback for a general case). In real applications however dependency chains are frequently too complex to be resolved by a compiler, which leads to a significant performance loss relatively to a manually annotated version.

Final notes¶

The list above was initially made as a personal note to summarize a bunch of curios quirks and as such it does not intend to be a critique of language designers & implementers. Every passing standard makes significant strides in resolving & improving a lot of these points and with C++26 bringing std::simd and reflection we are likely to see some excellent changes in the ecosystem.

Publication date: 2025.08.30

Last revision: 2025.09.03

What separates C++ from being a "perfect" performance-oriented language¶

Performance issues¶

No destructive move¶

Implicit copy¶

No constexpr priority¶

Manual struct packing¶

No way to tell when a critical optimization has failed¶

std::unordered_map pointer stability¶

<cmath> error handling¶

<random> algorithms¶

<regex> performance¶

No standard 128-bit types¶

No bit operations¶

Floating point parsing & serialization¶

Stream-based formatting¶

No zero-size member optimization¶

No parallel execution guarantees¶

Limited reallocation¶

Pointer aliasing¶

Final notes¶

No `constexpr` priority¶

`std::unordered_map` pointer stability¶

`<cmath>` error handling¶

`<random>` algorithms¶

`<regex>` performance¶