Deep Engineering #17: AVX10.2 meets RVV—how to write portable SIMD in 2025 with Ivo Balbaert
A practical playbook for cross-CPU vectorization—data layout, CI diagnostics, function multiversioning, plus a hands-on Mojo tutorial.
Join us on October 22, 2025 at DevSecCon25 - Securing the Shift to AI Native
This one-day event brings together AI, App, & Security pros to discuss securing the shift to AI Native SDLCs via DevSecOps.
The summit features speakers like Nnenna Ndukwe & Bob Remeika, plus an AI Lab with hands-on workshops, AI Security Research, and a Product Track!
✍️From the editor’s desk,
Welcome to the seventeenth issue of Deep Engineering.
Vectorization just got both easier and harder—AVX10.2 is unifying x86 vectors, RISC-V RVV is surging into production, and LLVM/GCC are auto-vectorizing more by default—so your code will either fly everywhere or fragment fast.
In this issue we cut through the noise: what AVX10.2 changes for mixed precision and masking; why RVV is now a first-class target; how LLVM’s new interleave IR and GCC 15 broaden auto-vectorization; and pragmatic ways to ship one portable binary—use function multi-versioning (target_clones
, ifunc) for hot kernels while keeping 95% of your code ISA-agnostic, plus checklists for layouts (SoA, alignment), CI diagnostics, and cross-ISA FP semantics.
For a fast, hands-on on-ramp, we have for you Ivo Balbaert’s (Lector at CVO Antwerpen and author of Packt introductions to Dart, Julia, Rust, and Red) latest article, Building with Mojo (Part 2): Using SIMD in Mojo—it walks you through DType
, SIMD[dtype, size]
, masking/select, lane sizing, and turning on vectorization diagnostics, with tips that carry cleanly to AVX10.2 and RVV.
Let’s get started.
SIMD in 2025: Vectorization Amid Diverging ISAs and Toolchain Advances
In the past few months, we have seen new capabilities and new urgency emerge for SIMD programming. Intel’s AVX10.2 has extended x86’s dominance in vector computing with more flexibility and precision, while open architectures like RISC-V are rapidly catching up and even innovating in how we vectorize code.
Compilers are racing to keep pace – adding features (as seen in LLVM’s intrinsics and GCC’s support for new ISAs) and giving developers sharper tools to monitor and tune vectorization.
Teams must deliver software that runs faster everywhere, from server processors to the coming wave of RISC-V accelerators. “Vectorizable software” of the next few years will need to run efficiently on a more diverse set of CPUs. Engineers need to get comfortable with writing clean, portable SIMD-friendly code, using the compiler diagnostics, and deploying runtime dispatch mechanisms to cover top-end ISA features without sacrificing compatibility.
AVX10.2: Unified Vectors and New Precision in x86-64
Intel’s latest AVX10.2 specification (Rev. 5.0, June 2025) rolled out several notable enhancements to the x86-64 vector ISA:
Mixed-precision support: 16-bit floating point (FP16) and bfloat16 operations are now fully integrated, along with helper instructions for 8-bit floating point conversionskib.kiev.ua which broadens mixed-precision support. This means vector code can more efficiently handle lower-precision data (common in AI workloads) without falling back to scalar routines.
Floating-point exceptions: The spec clarifies FP exception semantics. For example, all bfloat16 arithmetic under AVX10.2 treats subnormal inputs as zero and flushes subnormal outputs to zerokib.kiev.ua. In practice, this default Denormals-Are-Zero (DAZ) and Flush-To-Zero (FTZ) behavior eliminates the performance traps of subnormal values — but developers must be aware of potential tiny-value differences when comparing results across architectures.
Min/Max & NaN rules: Intel introduced new min/max instructions with IEEE-754-2019 NaN propagation rules. The AVX10.2 spec explicitly supports the “minimumNumber”/“maximumNumber” semantics (ignoring quiet NaNs unless both inputs are NaN) for compatibility with modern standards like WebAssemblykib.kiev.ua. This differs from legacy x86 behavior where any NaN would poison the result, improving consistency across platforms.
Unified vector ISA & masking: AVX10.2 continues Intel’s push toward a unified vector ISA, converging AVX-512 features across all cores (including efficiency cores) while allowing flexible vector lengths (256-bit on some cores, 512-bit on others). Mask registers and predication remain central: AVX10 inherits AVX-512’s masking for conditional vector ops, and even 256-bit-wide cores use those mask semantics. For developers, coding patterns using masked loads/stores or blending behave uniformly on AVX10-capable CPUs, regardless of core type.
The bottom line is that new instructions (from FP8 conversions to INT16 dot-products) in AVX10.2 bring x86 closer to feature parity with GPU tensor cores and other AI accelerators – but with the familiar x86 programming model. The changes in data types and exception handling matter for anyone writing numeric algorithms: you can mix precisions more freely, but you should also add tests for edge cases (NaNs, infinities, denormals) given the new handling rules.
Runtime Dispatch for AVX10.x Without Binary Sprawl
One practical challenge with a evolving ISA like AVX10 is supporting it in software without maintaining separate builds. Intel addressed this by simplifying feature detection. The AVX10 spec defines a CPUID versioning scheme: software can check a single CPUID bit for “AVX10 support” and then an integer “version” for the level (AVX10.1 vs 10.2, etc.). As Intel engineers note,
“the application developer will only need to check for two aspects:
1. A CPUID feature bit enumerating that the Intel® AVX10 ISA is supported.
2. A version number to ensure that the supported version is greater than or equal to the desired version.”
You no longer have to query dozens of feature flags (AVX512_F, AVX512_VNNI, AVX512_FP16, etc.) and combine them – a simple version check covers the baseline.
This quarter’s AVX10.2 update was also accompanied by toolchain support: for example, GCC 15 introduced the -mavx10.2
switch to enable AVX10.2 instructions, with sub-options to restrict vector length to 256 or allow full 512 bits. Compiler support means teams can begin compiling AVX10-ready binaries now.
How to deploy AVX10 code without fragmenting your binaries?
Multi-versioning critical functions (GCC): Modern compilers can create runtime-dispatch variants of the same function for different ISA levels. Richard Biener (GCC maintainer) highlights attributes like
target_clones
, enabling one version using AVX2 and another using AVX10.2, selected automatically at runtime.Ifunc-based multi-versioning (Clang/ICC): Clang and ICC support ifunc-based multi-versioning so a single binary adapts to the host: it takes the AVX10.2-optimized path on new CPUs and a baseline path elsewhere. The new CPUID scheme simplifies and future-proofs dispatch logic.
Near-term hardware payoff: Intel “Diamond Rapids” Xeon (announced for 2025) will support AVX10.2 at 512-bit width, so these dispatch strategies should yield sizable speedups on new servers.
Deployment guidance to avoid binary sprawl: Identify clear hotspots (e.g., image kernels, math routines) and apply function multi-versioning only there; keep the rest ISA-agnostic. This preserves one binary while still exploiting AVX10 gains. The caution is warranted because: (1) GCC 15 now warns that its FMV implementation is still experimental and not ACLE-conformant, so indiscriminate use is risky; (2) LLVM discussions in July 2025 detail that calls through FMV/ifunc complicate inlining, which can grow code and blunt wins if you version widely; and (3) active August 2025 threads emphasize ongoing work to curb executable size—another reason to limit FMV to proven hot paths.
Compiler Updates: Auto-Vectorization Gets Smarter
Three shifts stand out this quarter: LLVM’s new interleave IR, GCC 15’s broader default vectorization and AVX10.2 support, and sharper diagnostics across both toolchains.
LLVM (June 2025) | interleave intrinsics: LLVM 20 added
llvm.vector.interleave
/llvm.vector.deinterleave
(variants for 4, 6, 8). These let the compiler efficiently transform layouts (AoS ↔ SoA), enabling better auto-vectorization for interleaved data (common in multimedia/image pipelines). The vectorizer can now emit optimized shuffles instead of bailing or relying on ad-hoc byte shuffles.GCC 15.1/15.2 | broader vectorization by default: At
-O2
, the loop vectorizer now handles unknown trip counts more aggressively, vectorizing unless heavy runtime checks/epilogues would dominate. GCC 15 also adds official support for new ISAs (e.g., AVX10.2, AMX), so intrinsics enabled via flags like-mavx10.2
can be scheduled/allocated correctly. On Arm, cost-model tweaks (e.g., AArch64 SVE) and other optimizations landed. Net effect: more targets supported, more loops vectorized.Diagnostics | make vectorization visible: Both Clang and GCC improved optimization remarks. GCC’s
-fopt-info-vec*
now reports details such as masked epilogue vectorization, vector length, and unroll factor; Clang provides-Rpass=loop-vectorize
(and related) to surface when/why a loop did or didn’t vectorize. Enable these in CI by default to catch silent de-vectorization and verify expected SIMD speedups. Richard Biener’s May 2025 patch to GCC’s vectorizer (merged in time for GCC 15) expanded the detail in vectorization reports. Now, when compiled with-fopt-info-vec
flags, GCC will tell you not just “loop vectorized” but also whether it vectorized the loop’s remainder (epilogue) using masked instructions and what vector length and unroll factor were used. An example from his patch shows messages like:“loop vectorized using 64 byte vectors and unroll factor 32” alongside “epilogue loop vectorized using masked 64 byte vectors and unroll factor 32”.
Taken together, these changes let you rely on the compiler for more of the heavy lifting—so long as you keep vectorization remarks on in CI, profile the result, and reserve intrinsics for the few kernels the vectorizer still misses.
RISC-V Vectors Gain Momentum (and Support from Nvidia)
Two developments make RVV hard to ignore this quarter: CUDA now accepts RISC-V CPUs as hosts, and LLVM/GCC have moved RVV codegen from experimental to tuned—enough to treat RVV as a first-class SIMD target alongside x86/Arm.
Ecosystem signal | CUDA now hosts on RISC-V (July 2025): RISC-V’s vector extension (RVV) is moving from niche to serious portable-software target. Nvidia’s CUDA platform now officially supports RISC-V CPUs as hosts, joining x86 and Arm. A heterogeneous Nvidia GPU system could be driven by a RISC-V CPU, which must run the same vectorized C++ code that today targets Intel/AMD. Nvidia’s move is a major credibility boost for RISC-V in HPC, signaling it’s being groomed for heavy workloads where SIMD matters—not just microcontrollers/IoT. Developers can start by tracking RISC-V support in compilers/libraries and ensure inner loops vectorize on RVV just as they do on AVX.
Toolchains | RVV goes from “works” to “tuned”: GCC and LLVM keep improving RISC-V auto-vectorization. In LLVM, RVV codegen is enabled by default (mid-2025) and under active performance tuning. Recent LLVM work handles non-power-of-two vector lengths directly (RVV can operate on, e.g., 3- or 7-element vectors), teaches the SLP vectorizer to handle such irregular sizes without awkward 2+1 splits, and adds tail folding using the VL register so many loops no longer need a separate scalar cleanup—improving both speed and code size. The Igalia team reports ~9% geomean SPEC CPU speedup over the past 18 months of RVV-focused compiler work, indicating the RISC-V backend is closing the gap and, on some benchmarks, outperforming more mature x86/Arm backends.
Why does this matter for portability?
RISC-V vector machines are moving from labs to production, so your performance-critical code may soon run on RVV. Portability now means avoiding x86-specific assumptions and accounting for RVV’s calling convention (vector registers not preserved across calls) and variable-length vectors.
Likelihood of RVV targets: Performance-sensitive code may soon run on RISC-V servers or edge devices with vector acceleration; avoid x86-centric assumptions (e.g., fixed 512-bit widths or ISA-specific intrinsics).
Compiler help, but differences remain: Mainstream compilers increasingly make portable C++ vectorization work on RVV, yet portability still hinges on details like RVV’s calling convention not preserving vector registers across calls—so inlining (or avoiding calls) can be critical.
Action for teams: Treat RVV as production-bound: track its toolchain evolution and begin testing on RISC-V simulators or hardware now to surface portability issues early.
💡Portable Vector Code: Best Practices for 2025
With multiple vector ISAs in play (x86 AVX, RISC-V RVV, ARM SVE, etc.), a key question is how to write portable and performant SIMD-friendly code. Here are some actionable guidelines, informed by recent changes:
Write SIMD-friendly loops & layouts — Prefer straight-line loops over contiguous arrays. Use SoA where possible; if you must use AoS, recent LLVM interleave intrinsics help but older compilers may not vectorize. Align data (e.g.,
alignas(32|64)
) to avoid runtime alignment checks that still hurt at-O2
. Add__restrict__
to remove aliasing barriers.Make vectorization observable in CI — Use pragmas sparingly, but always enable diagnostics: Clang
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize
; GCC-fopt-info-vec-missed
(and friends). Fail builds when critical loops de-vectorize; grep reports to catch regressions early.Mind cross-ISA FP semantics — AVX10.2 flushes BF16 subnormals to zero and adopts IEEE-754 min/max rules; RVV may differ. If sensitive to denormals, NaNs, or rounding, test per target, prefer well-defined ops (e.g., consistent FMA), and document any use of
-ffast-math
. Treat small numeric deltas across ISAs as expected unless they exceed tolerances.Use intrinsics surgically, behind an API — Keep 95% portable. For the few hot kernels the vectorizer misses, provide ISA-specific implementations (e.g., AVX10.2, RVV) plus a plain C fallback, and dispatch at runtime/compile-time (e.g., ifunc/attributes). This preserves readability, testability, and portability.
Verify on real hardware — Profile with
perf
, VTune, or platform equivalents; inspect counters (packed vs scalar, vector utilization). On RVV, use tools likellvm-exegesis
where available. Do A/B runs (e.g.,-fno-tree-vectorize
) to validate actual speedups. Re-profile per target: a loop that shines on AVX-512 may behave differently on RVV (vector length, startup overhead).
With AVX10.2 reshaping x86 semantics and RVV becoming production-ready, the durable path is code the compiler can vectorize across x86, Arm, and RISC-V. Teams must invest in layout and measurement and confine ISA-specific kernels behind clean runtime dispatch to create one portable build that keeps its speed as the hardware landscape shifts.
🧠Expert Insight
If you’re ready for some action, read Building with Mojo (Part 2): Using SIMD in Mojo by Ivo Balbaert—a hands-on walk-through of DType and fixed-size SIMD[type, size]
, sizing lanes with sys.info
, using vectorize()
, and verifying results with compiler remarks and perf counters, all framed for AVX10.2/RVV targets.
Building with Mojo (Part 2): Using SIMD in Mojo
This article is Part 2 of our ongoing series on the Mojo programming language. Part 1 introduced Mojo’s origins, design goals, and its promise to unify Pythonic ergonomics with systems-level performance.
🛠️Tool of the Week
Google Highway 1.3.0 (Aug 14, 2025)
A production-proven, header-only C++ library for portable, width-agnostic SIMD with clean runtime dispatch. The 1.3.0 release lands AVX10.2 target support, RVV groundwork for runtime dispatch, new FP16/BF16 helpers, perf counters, and more—squarely relevant to this issue’s portability thread.
Highlights:
New targets/types: AVX10_2, Loongson LASX/LSX, AVX3_SPR FP16 types; complex ops; BF16/FP16 interleaved loads/stores.
RVV groundwork for runtime dispatch; superoptimizer-driven RVV ops; profiling hooks and perf counters.
📎Tech Briefs
From SIMD Wrappers to SIMD Ranges - Part 1 Of 2 - Denis Yaroshevskiy & Joel Falcou - C++Now 2025: In this talk, Meta performance engineer Dennis Yaroshevskiy shows how to build high-performance C++ ranges algorithms with explicit SIMD using the EVE library.
Mojo struct | SIMD: This documentation describes Mojo’s
SIMD[dtype, size]
type—a hardware-mapped, zero-cost vector abstraction for portable AVX/NEON-style operations—covering construction, elementwise math/logic, masking andselect
, shuffles/slices/interleave, reductions (reduce_add/min/max
), casting/bitcasting, constraints (power-of-two sizes), and short examples.Software Design for Performance by John Gentile: A guide to performance-oriented software design that emphasizes understanding hardware (latency, caches, NUMA) and surveys practical optimizations—memory/layout, branchless techniques, concurrency models, SIMD/GPUs and RTOS—alongside architectures, profiling/tracing tools, Linux tuning, and rich references for auto-vectorization and high-performance coding.
Lessons learned from implementing SIMD-accelerated algorithms (ChaCha20 / ChaCha12) in pure Rust by Sylvain Kerkour: This article explains how the implementation can approach hand-tuned assembly while staying safe and auditable, outlining the load–compute–store model, batching blocks across lanes, choosing AVX2/AVX-512/NEON/WASM targets, and more.
Faster substring search with SIMD in Zig: Demonstrates a SIMD-friendly substring search in Zig that scans 32–64-byte blocks by vector-matching the needle’s first/last (or rarest) characters and then verifying candidates, yielding ~60% faster performance than
std.mem.indexOf
(with far fewer CPU cycles and scalability to AVX-512) and remaining faster even on small inputs.
That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next. Do take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.
We’ll be back next week with more expert-led content.
Stay awesome,
Divya Anne Selvaraj
Editor-in-Chief, Deep Engineering
If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want to advertise with us.