SIMD scans
The element::simd module provides SIMD-accelerated scans over &[ElementId] slices. Six primitives, all of them dispatch through scalar fallback / AVX2 (x86_64) / NEON (aarch64), with hand-tuned inline assembly on the AVX2 path.
This is the lowest level of the crate. The lattice and predicate code consume these primitives; the user-facing API is unaware of them.
The primitives
| Function | Returns | Used by |
|---|---|---|
contains(slice, needle) | bool ; needle anywhere in slice | join's canonicalize, predicates |
position_of(slice, needle) | Option<usize> ; first index of needle | TypeBuilder::remove, TypeBuilder::replace |
any_of_kind(slice, kind) | bool ; some Element has this kind | predicates::contains_X, lattice prefilters |
all_of_kind(slice, kind) | bool ; every Element has this kind (false on empty) | predicates::is_X |
count_of_kind(slice, kind) | usize ; count of Elements with this kind | join's literal-collapse threshold |
is_sorted_strict(slice) | bool ; slice is strictly increasing under unsigned u32 order | interner's canonicality fast path |
The dispatch:
x86_64+ AVX2 (runtime-detected viastd::is_x86_feature_detected): 8 lanes (256 bits) per iteration. Hand-rolled inline assembly using FFmpeg's pointer-end trickery (one register holds a negative byte offset that doubles as both the load displacement and the loop counter).aarch64+ NEON (baseline ISA, no runtime check): 4 lanes (128 bits) per iteration. NEON intrinsics.- Other architectures or short slices: tight scalar loop.
The thresholds are 8 lanes for AVX2 and 4 lanes for NEON. Below that, scalar wins because the SIMD setup outweighs the parallel work.
Why ElementId-as-u32 is a perfect fit
ElementId is a NonZeroU32 (layout chapter). A slice of ElementId is contiguous 32-bit lanes. Equality scans (contains, position_of) and kind scans (any_of_kind, all_of_kind, count_of_kind) reduce to a per-chunk:
- Load 8 (AVX2) or 4 (NEON) lanes into a SIMD register.
- (For kind scans) Right-shift by 26 to extract the kind tag.
- Compare-equal against a broadcasted needle (or kind value).
- Reduce to a scalar via movemask (AVX2) or maxv/minv (NEON).
- Branch on the reduction.
The per-chunk cost is constant; the per-iteration overhead is one load, two-to-three SIMD ops, one branch.
The FFmpeg pointer-end trick
The AVX2 paths use a register-saving technique borrowed from FFmpeg. Instead of:
mov rcx, 0
.loop:
vmovdqu ymm1, [rsi + rcx*4]
; ... compare ...
inc rcx
cmp rcx, r8 ; r8 = chunk count
jb .loop
The trick is:
mov rdi, [end of slice] ; pointer to end-of-chunked-region
mov rcx, -bytes ; negative byte offset
.loop:
vmovdqu ymm1, [rdi + rcx]
; ... compare ...
add rcx, 32 ; byte stride = 32 (8 lanes × 4)
jl .loop
The negative offset doubles as the loop counter and the addressing-mode displacement. The loop tail is add + jl (two instructions, one micro-op fusion) instead of inc + cmp + jb (three instructions).
The savings are small per iteration but real, especially on long slices.
NEON: intrinsics, not assembly
NEON lacks a per-lane movemask instruction; the scalar reduction is done with vmaxvq_u32 (max across lanes) or vminvq_u32 (min across lanes) plus a per-element-position bit-set trick. The intrinsics version is fast enough that hand-rolled assembly doesn't help.
NEON is baseline on AArch64 (every aarch64 CPU has it). No runtime detection needed; the SIMD path runs whenever the slice is long enough.
Where the primitives are called from
The SIMD primitives are called by hot paths the lattice traverses:
lattice::overlapsandlattice::refinesusesimd::any_of_kindas a prefilter for theNegated,Intersected,Mixed,Object, etc. families ; "if no Element has this kind, skip the family rule entirely".lattice::join's canonicalisation usessimd::containsto detect well-known dominators (MIXED,NEVER,BOOL,RESOURCE, etc.) before applying the rule.predicates::is_int,is_string, etc. usesimd::all_of_kindas their core.predicates::contains_int,contains_string, etc. usesimd::any_of_kind.intern_type's slow path usessimd::is_sorted_strictto skip the sort + dedup when the input is already canonical.TypeBuilder::removeandTypeBuilder::replaceusesimd::position_of.
When the SIMD threshold isn't met
Most analyser-side unions are small (1-5 Elements). The threshold gates ensure that the SIMD code only runs when there's enough work to pay back the setup. On short slices, the scalar fallback runs ; LLVM autovectorises what it can.
Why hand-rolled assembly
The autovectoriser generates correct SIMD code, but it doesn't know:
- That the loop bound is the chunk count (not the byte count).
- That the broadcast can be done once outside the loop.
- That
vpcmpeqd + vpmovmskb + test + jnzis a tighter early-exit than the equivalent generated code. - That the FFmpeg pointer-end trick saves one micro-op per iteration.
For the AVX2 paths, hand-rolled assembly is consistently 10-30% faster than the autovectorised scalar on long slices. For the NEON paths, intrinsics-with-careful-loop are within 5% of hand-rolled assembly, and the maintenance is much lower.
Safety
Every SIMD function is marked unsafe fn. The public entry points are safe and gate on the threshold + the runtime feature detection (for AVX2). Inside the SIMD function:
- Unaligned loads (
vmovdqu,vld1q_u32) — both architectures support these without alignment. - Bounds: the function is called only when the slice is at least
THRESHOLDlanes; the chunk count islen / lanes_per_chunk, capped to fit. - Tail handling: each function has a scalar tail loop for the leftover lanes after the chunked region.
The unsafe blocks are local to each function and have SAFETY comments documenting the invariants.
Performance numbers
Approximate, on a modern x86_64 desktop, for a slice of 64 Elements:
simd::contains(hit early): ~5ns.simd::contains(miss to end): ~20ns.simd::any_of_kind(hit early): ~6ns.simd::any_of_kind(miss to end): ~25ns.simd::is_sorted_strict(already sorted, 64 elements): ~30ns.- Scalar equivalent (64 elements): ~3-5x slower across the board.
For typical analyser unions (5-10 Elements), the scalar fallback runs and the cost is on the order of the per-Element work itself ; nanoseconds per call.
See also: The ElementId tag layout, Interning and the arenas, Performance philosophy.