Systems Software Engineer, AI Infrastructure

Nvidia
  • Posted On: 2026-03-18 19:25:30
  • Openings: 10
  • Applicants: 0
Job Description
What You Will Be Doing:
  • Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds.
  • Collaborate on tooling for HPC, GPU Training, and AI Model training workflows.
  • Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution, driving continuous improvement in system performance.
  • Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems to improve team efficiency and system resilience.
  • Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead.
  • Work with engineering teams to deliver innovative solutions, uphold high standards for code and infrastructure, and contribute to hiring for a diverse, high-performing team.
What We Need to See:
  • Degree in Computer Science or related field, or equivalent experience with 5+ years in Software Development, SRE, or Production Engineering.
  • Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby).
  • Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI).
  • Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK).
  • Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
  • Strong communication skills with the ability to convey technical concepts effectively to diverse audiences.
  • Commitment to fostering a culture of diversity, curiosity, and continuous improvement.
Ways to stand out from the crowd:
  • Experience in AI training, inferencing, and data infrastructure services.
  • Proficiency in deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray.
  • A strong background in cloud or hardware health monitoring and system reliability.
  • Hands-on expertise in operating and scaling distributed systems with stringent SLAs, ensuring high availability and performance.
  • Knowledge of incident, change, and problem management processes, fostering continuous improvement in sophisticated environments.
More Info
Full Time
o
Not Disclosed
English
Not Disclosed
Education
Any Graduate
Not Disclosed
Required Skills
C++ linux Problem management Incident management Perl Windows Ruby Gaming

Contact Details
Nvidia
+91 987654567
info@nvidia.com
  • Experience5 years
  • Salary Above 10 LAKHS ANNUALLY
  • Location for Hiring Mumbai
  • Apply Now
Latest Job

Similar Jobs

  • 2 years
  • Hyderabad
  • 20 Hours
  • 1 years
  • Hyderabad
  • 20 Hours
Audio Driver Development Engineer
Lyptus Technologies
  • 6+ years
  • Hyderabad
  • 20 Hours
  • 5 years
  • Mumbai
  • 20 Hours
  • 1 years
  • Hyderabad
  • 20 Hours