Kernel · Networks · AI/ML Infrastructure

Naresh
Kumar

Systems Engineer building AI/ML Infrastructure

I work at the intersection of Linux kernel internals, high-performance networking, and AI/ML accelerator infrastructure. From writing eBPF programs to tuning DPDK datapaths, my focus is on squeezing every cycle out of the hardware stack.

eBPF / XDP DPDK RDMA / RoCE SR-IOV virtio / vhost netfilter tc / qdisc GPU networking netdev Linux kernel
→ Read my blog Get in touch
01

AI/ML Infra Roadmap

FOUNDATION NETWORKING ACCELERATORS AI/ML INFRA ADVANCED Linux Kernel VFS · MM · scheduler · syscalls C / Rust Systems unsafe Rust · libc · perf tooling Device Drivers PCI · PCIe · char/net drivers Tracing & Observability perf · ftrace · bpftrace · BCC eBPF / XDP kernel bypass · packet filter · maps DPDK PMD · huge pages · RSS / FDIR netfilter / tc / qdisc conntrack · HTB · fq_codel SR-IOV / VF / virtio vhost-net · VFIO · passthrough InfiniBand / RoCE libibverbs · CM · QP management RDMA / ROCE v2 zero-copy · SEND/RECV/WRITE/READ GPU Architecture CUDA streams · UVM · PCIe BAR GPUDirect / NVLink P2P DMA · NVSwitch topology DMA / IOMMU scatter-gather · SMMU · ATS SmartNIC / DPU DOCA · NVIDIA BlueField offload AI Cluster Networking Rail-optimized · fat-tree · RailOpt NCCL / MPI allreduce · ring-allreduce tuning Storage for AI NVMe-oF · Lustre · GDS Container / Orchestration K8s device plugins · GPU operator Custom ASIC/FPGA Verilog · HLS Kernel Contributions upstream patches P4 / eBPF Offload programmable data planes Completed In Progress / Planned Advanced / Stretch OVERALL JOURNEY ~40% complete Foundation Networking Accelerators AI Infra Advanced
02

Skills & Stack

Kernel & Systems
Linux Kernel
kernel modules sched VFS memory mgmt netdev
Proficiency85%
High-Perf Networking
eBPF / XDP / DPDK
XDP programs TC hooks BPF maps PMD DPDK mbuf
Proficiency80%
Virtualization
SR-IOV / virtio
VF management VFIO vhost-net QEMU KVM
Proficiency70%
RDMA / Fabric
InfiniBand / RoCE
libibverbs QP / CQ RDMA Write RoCE v2 ECN/PFC
Proficiency65%
Languages
C / Rust / Python
C99 unsafe Rust libbpf Python scripts Go
Proficiency82%
Observability
Perf & Tracing
perf ftrace bpftrace BCC strace / ltrace
Proficiency75%
03

Blog

eBPF Apr 2025 · 8 min read
Writing Your First XDP Program: Packet Drop at Line Rate
How I built a kernel-bypass packet filter using XDP that processes 14 Mpps on a single core — and what I learned about BPF maps, verifier errors, and the AF_XDP socket interface.
Read post →
DPDK Mar 2025 · 10 min read
DPDK PMD Internals: How Poll Mode Drivers Kill Interrupt Overhead
A deep dive into DPDK's poll mode driver model, huge page allocation, NIC RSS configuration, and why removing interrupts is the biggest win in high-throughput packet I/O.
Read post →
RDMA Feb 2025 · 12 min read
RDMA from Scratch: Queue Pairs, Completions, and Zero-Copy Transfers
Understanding the verbs API, how protection domains and memory regions work, and writing a simple RDMA Write benchmark that achieves sub-2μs latency between two hosts.
Read post →
AI Infra Jan 2025 · 7 min read
Why AI Clusters Need RoCE: NCCL, Congestion, and PFC Deadlocks
How GPU-to-GPU communication in distributed training works, why RoCE v2 is preferred over InfiniBand in hyperscale deployments, and how PFC pause frames can deadlock your whole fabric.
Read post →
Kernel Dec 2024 · 9 min read
SR-IOV Deep Dive: Virtual Functions and the VFIO Framework
How SR-IOV splits a physical NIC into virtual functions, how the VFIO driver exposes them safely to VMs, and the performance implications vs. software switching with OVS.
Read post →
Coming Soon
GPUDirect RDMA: Bypassing the CPU for DMA Transfers to GPU Memory
How GPUDirect allows NIC ↔ GPU direct DMA, the BAR mapping trick, and benchmarking MPI collectives with vs. without GPUDirect enabled.
Draft in progress…
04

Achievements

🛠️
Built a 14 Mpps XDP Packet Classifier
Designed and implemented a multi-rule stateless packet classifier using XDP + BPF maps that handles 14 million packets per second on a single CPU core, with sub-10ns per-packet processing overhead.
Apr 2025
📄
Open Source Contribution — libbpf
Submitted and merged a bug fix for incorrect BTF type resolution in libbpf when dealing with anonymous structs inside unions. First upstream kernel-adjacent contribution.
Mar 2025
🎓
Completed LFD460: Linux Kernel Internals & Development
Finished the Linux Foundation's advanced kernel development course covering driver writing, kernel debugging, memory management, and synchronization primitives. Scored in top 5% of cohort.
Feb 2025
🔬
Benchmarked RDMA Write at 1.8μs Latency
Set up a dual-host RDMA testbed with Mellanox ConnectX-5 NICs, tuned ECN and PFC parameters, and achieved 1.8μs median latency on 64-byte RDMA Write operations over RoCE v2.
Jan 2025
📝
Started this Blog & Technical Journey Log
Launched this site as a public accountability log for my AI/ML Infrastructure learning path — documenting experiments, benchmarks, and deep dives as I build toward kernel + networking roles at AI companies.
Dec 2024
05

Get In Touch

Currently open to kernel & networking roles

I'm actively looking for roles in semiconductor, networking, and AI/ML infrastructure companies. If you're building low-latency systems, HPC fabrics, or AI accelerator software — let's talk. Target locations include Canada 🇨🇦 and Western Europe (Germany / Netherlands) 🇩🇪🇳🇱.