Download PDF

Summary

Performance Engineer specializing in inference architecture, hardware/software co-design, and E2E performance for distributed GenAI systems. Proven lead for large-scale silicon enablement, with deep expertise in optimizing the full execution stack and architecting high-utilization inference engines.

Work experience

Senior Software Engineer

2025/07Present
Meta - Toronto ON
  • Lead E2E GenAI inference performance in MTIA, architecting a modern inference engine with distributed parallelism support and serving features such as disaggregated inference
  • Coordinated model enablement and performance for a flagship MTIA initiative, leading 20+ engineers to secure a 1-gigawatt hardware commitment
  • Optimized async D2H pipelining and load-balancing, cutting host overhead 20% and TTFT 16% on Llama3
  • Developed fused compute and collective kernels to mask communication latency in distributed workloads.
  • Ranked #1 in engineering velocity within Foundation Org, averaging 400+ commits and 1,000+ code reviews per half while scaling the team by 5+ engineers

Senior Staff Software Engineer - Tech Lead

2024/072025/06
Untether AI - Toronto ON
  • Led the design and architecture of next-generation inference compiler technologies for LLMs
  • Delivered industry-leading MLPerf results across BERT Large and ResNet-50 benchmarks
  • Drove silicon spin readiness, analyzing ECO cost-risk tradeoffs and guiding compiler-related silicon decisions
  • Managed the deployment of Llama3 8B and 70B across multi-chip, rack-scale inference systems

Staff Software Engineer - Deep Learning Performance

2023/012024/07
Untether AI - Toronto, ON
  • Led hardware/software co-design and performance optimization for a generative AI inference accelerator
  • Managed and trained a team of 17 contractors, deploying CNN workloads across a spatial compute fabric
  • Designed kernel implementations and contributed to core software stack architecture
  • Developed spatial placement and routing strategies to maximize inference throughput and bandwidth
  • Collaborated with clients on custom neural network applications
  • Owned hardware bring-up planning and execution, collaborating the entire stack to enable first-silicon readiness

Deep Learning Engineer

2021/082023/01
Untether AI - Toronto, ON
  • Developed ingestion, quantization, and post-training optimization pipelines for CNNs and transformers
  • Designed INT8 and FP8 quantization algorithms, and developed PTQ algorithms such as knowledge distillation
  • Implemented graph-level optimizations including pruning, fusion, and post-training layer swapping

Support Researcher

2020/042021/08
Huawei R&D Laboratory - Waterloo, ON
  • Co-developed and benchmarked a novel Remote Differential Compression algorithm for efficient file synchronization
  • Published and patented methods for scalable delta encoding across heterogeneous devices

Education

B.A.Sc. in Systems Design Engineering - Artificial Intelligence

2016/082021/05
University of Waterloo
  • Summa Cum Laude / Dean's Honors List.

Publications

  • Bhatt, Ramón et al. "Unsupervised Detection of Lung Nodules in Chest Radiography Using Generative Adversarial Networks", EMBC Annual Meeting 2021
  • Borzov, Ramón et al. "Method and Apparatus for Replicating a Target File between Devices" World Intellectual Property Organization, WO2023000915A1 / US20230087778. Issued Jan 26, 2023
  • Kitamura, Ramón et al. : "Mapping Attention Mechanisms (Transformer) Function to Spatial Architecture (SIMD or At-Memory Processing)" US Patent Office, 63/608,539. Filled Dec 22, 2023, Patent Pending