Papers & Analysis
Deep dives into research papers on efficient ML systems, distributed training, and optimization techniques for frontier models. Documenting my evolution from robotic RL (RobIN/AMRL) to scalable ML infrastructure and training efficiency.
Current Focus: Drawing inspiration from work like this deep dive on efficient ML systems — exploring distributed training optimization, quantization, memory-efficient fine-tuning, and inference acceleration.
My Publications
GigaAPI: A User-Space API for Multi-GPU Programming →
Omeed Tehrani, et al. • arXiv:2504.01266 • 2025
Abstracting multi-GPU programming complexities through a user-space API. Enabling accessible parallel computing without requiring deep CUDA expertise.
Learning Inverse Kinodynamics for Autonomous Vehicle Drifting →
Omeed Tehrani, et al. • UT Austin AMRL • arXiv:2402.14928 • 2024
Data-driven kinodynamic model learning for high-speed autonomous drifting with UT Automata platform. Selected for presentation at Amazon AI Symposium. Achieved obstacle avoidance through learned curvature correction.
Decision Transformers for Robotic Imitation Learning →
Omeed Tehrani • UT Austin RobIN Lab • 2023
Extended Decision Transformer for return-conditioned imitation learning on mixed-quality robomimic datasets. Outperformed behavioral cloning baselines on manipulation tasks (earlier work from graduate research).
Paper Deep Dives
Coming soon: In-depth analyses of papers on efficient ML systems, distributed training, and optimization.
Currently studying recent work on training efficiency, quantization techniques, and inference optimization. Will be documenting implementations and insights from papers on ZeRO, FSDP, FlashAttention, and more.
Current Reading List
Distributed Training & Optimization
- • "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al.)
- • "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel" (Zhao et al.)
- • "Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism" (Shoeybi et al.)
- • "GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism" (Huang et al.)
Efficient Fine-Tuning & Quantization
- • "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al.)
- • "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al.)
- • "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (Dettmers et al.)
- • "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al.)
Inference & Architecture Optimization
- • "FlashAttention: Fast and Memory-Efficient Exact Attention" (Dao et al.)
- • "Inference Optimization for Large Language Models" (various)
- • "KV Cache Optimization" and "Continuous Batching" techniques
- • Gradient compression & communication-efficient training papers
Foundation (Earlier Work)
- • "Decision Transformer: Reinforcement Learning via Sequence Modeling" (Chen et al.)
- • "Attention Is All You Need" (Vaswani et al.)
- • Robotic manipulation and imitation learning papers from RobIN/AMRL work