Why I Rewrote DDS From Scratch in Rust

The Problem

DDS (Data Distribution Service) is the backbone of modern robotics and defense systems. ROS2 uses it. Military systems use it. Yet every existing implementation feels like it was designed in a different era. RTI? $$$. FastDDS? 300K lines of C++. CycloneDDS? 50+ dependencies. There had to be a better way.

Terminal dependency hell

$ ldd /usr/lib/libfastrtps.so | wc -l
47

$ du -sh /opt/ros/humble/lib/libfastrtps*
12M     /opt/ros/humble/lib/libfastrtps.so.2.6.4

47 shared library dependencies. 12MB just for the RTPS layer. And that's before you add your actual application logic.

Why Rust?

DDS is fundamentally about moving data between processes without corruption, without copies, without surprises. That description maps almost perfectly to what Rust's ownership model enforces at compile time.

Zero-copy message passing? The borrow checker guarantees no one reads a buffer while someone else writes to it. Lock-free ring buffers? Rust's type system prevents data races statically. No garbage collector means no latency spikes in the middle of a 257-nanosecond publish path.

But the real reason is debuggability. When a C++ DDS stack segfaults at 3 AM in production, you get a core dump and a prayer. When Rust panics, you get a backtrace with the exact line, the exact invariant that was violated. After 20 years of C/C++ debugging, that alone was worth the rewrite.

The Constraints

I set myself three non-negotiable requirements:

1 Zero external dependencies in the core library. Not "minimal" -- zero.
2 Sub-microsecond latency for intra-process communication. No excuses.
3 Actually debuggable. When something goes wrong, I want to understand why.

The Architecture

HDDS is built on a foundation of zero-copy message passing. When you publish a message, it doesn't get copied. When a subscriber reads it, no copy happens there either.

rust zero_copy.rs

// Loan a sample from the writer's pool
let mut sample = writer.loan_sample()?;

// Write directly into shared memory
sample.position.x = sensor.read_x();
sample.position.y = sensor.read_y();
sample.timestamp = now();

// Publish -- no copy, just ownership transfer
writer.publish(sample)?;

The secret is SPSC (Single Producer Single Consumer) lock-free ring buffers. Each writer-reader pair gets its own channel. No locks, no contention, no surprises.

The transport layer alone is 44,000+ lines of Rust -- not because it's bloated, but because it handles 8 different transport types: UDP (unicast + multicast), TCP with TLS 1.3, QUIC for NAT traversal, shared memory with futex-based wakeup, IEEE 802.1 TSN for time-sensitive networking, a low-bandwidth mode for satellite/LoRa links (9.6 kbps to 2 Mbps with LZ4 compression), and intra-process channels. Each one is a different world with its own constraints.

The Interop War: 127 Iterations

Writing a DDS implementation is one thing. Making it talk to RTI Connext, FastDDS, CycloneDDS, and OpenDDS is another beast entirely. It took 127 iterative versions to get there. Here's what that journey looked like.

v23-v40: Discovery foundations. Getting SPDP (participant discovery) and SEDP (endpoint discovery) working from scratch. Multicast listeners, GUID routing, RTPS compliance. The boring stuff that has to be perfect.

v58: The first real milestone. After 24+ hours of straight debugging, port 7411 finally accepted user data. A status report was written at this point -- not out of pride, but out of relief.

v61: Five critical blockers, resolved in one shot. Stateful RTPS parsing for INFO_DST/INFO_TS context. Service-Request endpoints. QoS defaults (RELIABLE vs BEST_EFFORT -- getting this wrong means silent data loss). QoS propagation. Unicast addressing. 859 tests passing.

v73: The socket binding revelation. HDDS was binding to 0.0.0.0:7410 instead of the primary interface IP. On a machine with Docker bridges and multiple NICs, packets were going nowhere. A one-line fix that took hours to diagnose.

v89: The byte that changed everything. RTI was silently ignoring HDDS participants. After a byte-by-byte hexdump comparison between RTI and HDDS packets, the culprit was found: BUILTIN_ENDPOINT_SET was 0x3F instead of 0x0C3F. Two missing bytes -- the ParticipantMessage endpoints. RTI wouldn't talk to you without them, but wouldn't tell you either.

Terminal the two bytes that took days

# RTI packet (working)
offset 0x88: 3f 0c 00 00

# HDDS packet (broken)
offset 0x88: 3f 00 00 00
#                ^^ missing 0x0C = ParticipantMessage endpoints

v90: Talking to yourself. HDDS was sending ACKNACKs to its own multicast packets. The fix: GUID prefix filtering at two locations. Simple in hindsight, maddening to debug when you see valid-looking RTPS traffic that goes nowhere.

v106: FastDDS's surprise. FastDDS sends vendor-specific PIDs (0x0038, 0xe800) before PID_PARTICIPANT_GUID. HDDS was only checking the first PID. Switching to a full scan of all PIDs fixed FastDDS discovery instantly.

v122-v127: The dialect detector. Each vendor has quirks. RTI uses different SEDP locator encoding. FastDDS has relaxed alignment. CycloneDDS uses big-endian in some fields. OpenDDS adds custom handshake packets. The solution: a vendor dialect detection FSM that auto-adapts the parser at runtime. Vendor ID detected, confidence 100%, protocol adjusted. No configuration needed.

End result: 9 out of 9 interop scenarios passing. 50/50 samples delivered in every direction. RTI 6.1, RTI 7.3, FastDDS, CycloneDDS, OpenDDS -- all talking to HDDS without any special configuration.

The Results

246 us p50 UDP (64B)

1.8x vs CycloneDDS

4.6x Improvement (v208-v212)

0 Dependencies

Let's be honest about the numbers. On real consumer hardware (i9-9900KF, Dual Xeon), the current UDP loopback p50 is 246 microseconds for a 64-byte message. CycloneDDS does 133 microseconds -- so HDDS is about 1.8x slower.

But it started at 1,134 microseconds. Five optimization iterations -- polling reduction, WakeNotifier with condvar, atomic fast-path with spin-before-wait, mio/epoll event-driven I/O -- brought it down 4.6x. The remaining gap is a known memcpy (temp buffer to pool) and thread handoff overhead. io_uring and zero-copy receive are on the roadmap.

The shared memory transport targets sub-microsecond latency with lock-free ring buffers and futex-based wakeup. The serialization layer (CDR2) does 50 nanoseconds for a 64-byte payload. History cache inserts: 150 nanoseconds. The raw building blocks are fast -- the integration path is where the work continues.

3,347 Tests Passing

9/9 Vendor Interop

5 Language SDKs

50 ns CDR2 Serialize

From ESP32 to Kubernetes

One thing I refused to do: abandon embedded. Most DDS implementations treat microcontrollers as second-class citizens. HDDS has a dedicated no_std micro implementation that runs on an ESP32 with 600 KB of flash.

The embedded story isn't theoretical -- it's validated on real hardware. Pi Zero 2W talking to a Pi Zero v1 over WiFi: 10/10 messages, 0% loss. ESP32-WROOM talking to an x86_64 Linux box: 10/10 messages. Two ESP32s chatting over a 433 MHz HC-12 radio: 18/18 messages. No heap allocations in the critical path, no floating point, custom CDR2 subset, and multi-hop mesh relay so constrained devices can be routers.

On the other end of the spectrum: Kubernetes with DNS-based discovery (zero dependencies, just Headless Services), AWS Cloud Map, Azure Service Discovery, HashiCorp Consul. And in between: automotive diagnostics. HUDS, a companion project (29,000 lines of Rust, 142 tests), implements ISO 14229 UDS over CAN with 14-microsecond median response time and zero frame loss at 80% bus utilization. It's designed to bridge into HDDS -- turning point-to-point ECU diagnostics into distributed pub/sub where N clients monitor N ECUs simultaneously. Same protocol stack, same QoS semantics, from a microcontroller to an automotive gateway to the cloud.

Security by Design

Security wasn't bolted on after the fact. DDS Security v1.1 is implemented at the protocol layer: X.509 PKI with certificate chain validation, AES-256-GCM encryption with ECDH P-256 key exchange, HKDF session key derivation, and topic-level access control via Permissions XML.

Every RTPS submessage can be encrypted with session keys derived from the ECDH handshake. The authentication handshake takes 10-50 ms per participant (one-time, during discovery). The crypto itself adds negligible overhead -- AES-GCM is hardware-accelerated on any modern CPU.

The audit logging follows ANSSI guidelines: hash-chain integrity, file and syslog backends, tamper detection. Because when you're running DDS in a defense or critical infrastructure context, "trust me, it's encrypted" isn't enough.

The Paranoia Pipeline

I'm obsessive about code quality. So I built a release validation pipeline that would make a defense auditor smile. Every release candidate goes through 14 gates: compilation with all features, 2,321 unit tests, Clippy in strict mode (warnings = errors), a 20-layer extreme audit scan, interop tests against 7 vendors with PCAP capture, fuzzing (30 seconds per target), Valgrind for memory leaks, CVE dependency audit, golden vector verification (42 CDR2 test vectors), trust package validation (SBOM + SHA-256 checksums), documentation build, C++ SDK compilation, and a live pub/sub smoke test.

The extreme audit scanner alone runs 20 layers of analysis:

bash extrem-audit-scan.sh -- 20 layers

Layer  0: Core validation gates (cargo check/test/clippy)
Layer  1: Anti-stub enforcement (no TODO/FIXME/HACK/XXX)
Layer  2: Type safety audit (dangerous casts)
Layer  3: Unsafe code audit (ANSSI/IGI-1300 compliance)
Layer  4: Complexity analysis (McCabe + cognitive)
Layer  5: Panic/unwrap audit
Layer  6: Memory patterns audit
Layer  7: Dependency audit
Layer  8: Clippy ultra-hardened mode (ALL lints)
Layer  9: Documentation coverage
Layer 10: Concurrency audit
Layer 11: License and copyright
Layer 12: Performance antipatterns
Layer 13: RTPS/DDS specification compliance
Layer 14: Test coverage
Layer 15: Unsafe code budget (cargo-geiger)
Layer 16: Swallowed results detection
Layer 17: Unused dependencies (cargo-udeps)
Layer 18: Secrets detection (passwords, tokens, API keys)
Layer 19: Code duplication analysis

The unsafe budget is tracked against ANSSI recommendations: HDDS has 13 unsafe blocks in the core library. For context, tokio has ~170, crossbeam ~90, bytes ~40. Every unsafe block has an @audit-ok marker explaining why it's necessary -- lock-free ring buffers, shared memory operations, raw socket multicast, CDR2 serialization with alignment. No exceptions.

The entire evidence package -- logs, PCAP captures, audit reports -- is bundled into a signed ZIP with SHA-256 checksums. If you need to prove to a certification body that your DDS stack passed all gates, the evidence is there, timestamped and optionally PGP-signed. Exit code 0 means "approved for release". Anything else means "do not ship".

Compliance targets: ANSSI/IGI-1300 (French defense), Common Criteria EAL4+, DO-178C Level B (airborne systems), ISO 26262 ASIL-D (automotive), OMG DDS/RTPS v2.5. Not all fully certified yet -- but the infrastructure to get there is already running on every commit.

The Full Picture

HDDS implements all 22 QoS policies from the DDS v1.4 specification -- not stubs, real implementations tested in 96 combinations. Reliability with NACK-driven retransmission. Durability with persistent storage. Deadline monitoring, liveliness tracking, ownership arbitration, partition isolation. Each policy has its own state machine, and they compose correctly.

The IDL codegen (hdds_gen) is 33,000 lines of Rust that parses the full OMG IDL 4.2 specification -- structs, unions, enums, bitsets, bitmasks, modules, annotations, bounded sequences, maps, even a complete C-style preprocessor with macro expansion and cycle detection. It generates type-safe code for 7 backends: Rust, C++, C, Python, TypeScript, plus two embedded variants (no_std Rust and MCU C). The first version was written in C/C++ back in 2022 -- the Rust rewrite kept the hard-won knowledge of every IDL edge case while making the codebase actually maintainable.

The real validation came from 50 production IDL files from NATO/defense systems. First run: 5 out of 50 parsed successfully. Bounded strings, struct inheritance, qualified namespaces, preprocessor directives inside modules, Windows-1252 encoding (42 out of 50 files were legacy Latin-1) -- each failure revealed a gap, each fix brought the count up. After 8 rounds of fixes: 50/50. For comparison, FastDDS's own codegen (fastddsgen) fails on 8 of those same 16 canonical test files -- bitsets, const expressions, fixed-point decimals. CycloneDDS's idlc fails on 9 out of 16. hdds_gen parses them all.

And then there's the tooling -- 110,000 lines of it across two standalone applications.

HDDS Studio (31K Rust + 25K TypeScript) is a visual IDL editor built on Tauri and React Flow. You drag structs, enums, and unions onto a canvas, connect them visually, and export production-ready IDL 4.2. Round-trip fidelity is byte-identical on 57 out of 64 golden test files, validated by 1,024+ property-based fuzz tests with zero panics. All five performance SLOs are exceeded: validation under 10ms, export under 100ms, 60 FPS maintained, memory under 180 MB. The licensing uses Ed25519 signatures with SHA-256 hardware fingerprinting -- offline-first, no phone-home, no telemetry.

HDDS Viewer (53K lines of Rust, 21 egui panels) is a desktop observability tool with a custom binary capture format (.hddscap) that sustains 625 MB/s I/O. It replays captured DDS traffic, decodes CDR2 messages, renders force-directed topology graphs, and runs anomaly detection through an ensemble of 15 ONNX models trained on 43.5 million synthetic DDS frames -- with zero false positives on normal traffic (after a v2 that had 100% false positives, lesson learned). The AI assistant supports three backends: Ollama for air-gapped environments, Claude CLI, and OpenAI -- with local RAG over SQLite and tree-sitter AST context. No cloud required.

On top of that: hddsctl for CLI management, a latency probe, a stress tester, a shared memory inspector, and PCAP diff tools for wire-level interop validation. Because production systems need observability, not just performance.

Lessons Learned

Hexdumps don't lie. When RTI silently refused to talk to HDDS, no amount of log reading helped. What worked: capturing both sides with tcpdump, aligning the hex output byte by byte, and diffing. The BUILTIN_ENDPOINT_SET bug (v89) was found this way. So was the FastDDS PID ordering issue (v106). When protocol debugging fails, go to the wire.

Every vendor has a dialect. The DDS specification is precise. The implementations... interpret. RTI has 4 different SEDP variants depending on version. FastDDS is relaxed about alignment. CycloneDDS uses big-endian where others use little-endian. OpenDDS adds custom handshake packets. You can't hardcode for one vendor -- you need a dialect detector.

The boring parts take the longest. Zero-copy ring buffers? Two weeks. The reliability protocol (heartbeat/NACK/gap with selective repair, exponential backoff, out-of-order buffering)? Two months. Discovery with vendor compatibility? Ongoing. The glamorous parts are 10% of the work. The archives tell the story: 379 markdown files of design docs, debug notes, PCAP analyses, and architecture decisions. 66 WIP tracking files. 44 logged working sessions between October and December 2025 alone -- and that's just the Rust rewrite. The IDL parser (hdds_gen) started in C/C++ back in 2022.

What's Next

HDDS isn't deployed in production yet -- it's still a young project. But the foundation is solid: 314,000 lines of Rust (HDDS core + hdds_gen codegen), 3,347 tests with 0 failures, 9/9 vendor interop scenarios passing, and a release validation pipeline that would satisfy a defense auditor.

The performance gap with CycloneDDS (1.8x on UDP) is the main frontier. Zero-copy receive, io_uring kernel bypass, and DPDK/XDP are the next steps. The building blocks are already sub-microsecond -- it's the integration plumbing that needs work.

If you're building something that needs reliable, debuggable data distribution -- whether it's a ROS2 robot, an embedded sensor network, or a cloud-native microservice mesh -- give it a try. The source is available, the documentation is comprehensive, and I'm always happy to help.