Team Information: The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities - Conduct research and development on large-scale LLM training infrastructure and efficiency - Design and optimize distributed training strategies for LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters - Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads - Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements - Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods - Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world AI infrastructure solutions
The base salary range for this position in the selected city is $244800 - $450000 annually.