Job Overview
NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.
Key Responsibilities
- Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
- AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
- Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
- System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
- Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
- Incident Response: Practice sustainable incident response and conduct blameless postmortems.
- On-call Support: Be part of an on-call rotation to support production systems.
Required Qualifications
- Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
- Experience: At least 5+ years of practical experience in a similar role.
- Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
- Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
- Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
- Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.
Preferred Qualifications
- SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
- CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
- Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.
Why Join NVIDIA?
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.
How to Apply
If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!
Benefits Extracted with AI
- Remote work
- Collaborative environment
- Opportunities for growth and learning
Similar jobs
Last update: 23 minutes ago
Senior Distributed Systems Backend Engineer
Join NVIDIA as a Senior Distributed Systems Backend Engineer to shape the future of Cloud Gaming with GeForce NOW.
Senior DevOps Engineer
Join NVIDIA as a Senior DevOps Engineer to enhance our Kubernetes platform and multi-cloud infrastructure.
Senior Deep Learning Performance Software Engineer
Senior role optimizing deep learning performance at NVIDIA, involving Python, HPC, and AI technologies.
Senior Full-Stack Web Applications Software Engineer
Join NVIDIA as a Senior Full-Stack Web Applications Software Engineer. Work on scalable web services and infrastructure.
Senior Software & Cloud Architect
Join NVIDIA as a Senior Software & Cloud Architect to lead cloud-based orchestration and provisioning solutions.
Senior Full-Stack Software Engineer
Join NVIDIA as a Senior Full-Stack Software Engineer, working on cutting-edge web applications and infrastructure.
Senior Software Engineer, AI Platform - Robotics
Senior Software Engineer needed for AI Robotics platform at NVIDIA, Santa Clara. Involves cloud platforms, Kubernetes, Python.
Senior Software Solution Engineer, Networking
Join NVIDIA as a Senior Software Solution Engineer in Networking, focusing on complex customer solutions and development.
Senior Backend Engineer, AI Platform - Robotics
Join NVIDIA as a Senior Backend Engineer to develop AI platforms for robotics. Work remotely with cutting-edge technology.
Senior Software and System Architect
Join NVIDIA as a Senior Software and System Architect to lead cloud-networking and security solutions, focusing on cutting-edge technologies.
Site Reliability Engineer, FlashArray
Join Pure Storage as a Site Reliability Engineer in Prague, focusing on cloud infrastructure uptime and incident response.
Senior Deep Learning Engineer
Join NVIDIA as a Senior Deep Learning Engineer to optimize AI performance using PyTorch, TensorFlow, and more in Berlin.
Software Engineering Intern
Join NVIDIA as a Software Engineering Intern in 2025. Work remotely on AI, cloud, and data science projects. Enhance your skills in a diverse environment.
Site Reliability Engineer (SRE) - Stability AI
Join Stability AI as a Site Reliability Engineer (SRE) to enhance cloud infrastructure and system reliability. Remote work available.
Senior Full Stack Engineer, Deep Learning Algorithms
Join NVIDIA as a Senior Full Stack Engineer to build software for AI, focusing on deep learning algorithms and high-performance computing.
Senior Software Engineer - HPC
Senior Software Engineer for HPC at NVIDIA in Westford, MA. Design and improve high-performance computing systems.
Senior Cloud Site Reliability Engineer
Senior Cloud Site Reliability Engineer role focusing on enhancing cloud service reliability and efficiency.
Senior Site Reliability Engineer (SRE) - Hasura Cloud
Join Hasura as a Senior Site Reliability Engineer to maintain and enhance Hasura Cloud's reliability and performance.
Senior Software Architect – Data Center Platform Simulation and Virtualization
Join NVIDIA as a Senior Software Architect focusing on data center platform simulation and virtualization.
Senior Software Architect, Advanced Development
Join NVIDIA as a Senior Software Architect in Advanced Development, focusing on innovative solutions in network programmability and data centers.
Staff AI Platform Engineer
Join SentinelOne as a Staff AI Platform Engineer to develop cutting-edge AI technology in a remote role based in Poland.
Senior Site Reliability Engineer (SRE) - Hasura Cloud
Join Hasura as a Senior Site Reliability Engineer to maintain and scale Hasura Cloud. Remote role in the US with competitive salary and benefits.
Senior Site Reliability Engineer
Senior Site Reliability Engineer at IBM in Cracow, skilled in AWS, Kubernetes, Linux, and Terraform.
Site Reliability Engineer (SRE) - Hasura Cloud
Join Hasura as a Site Reliability Engineer to ensure smooth operation of Hasura Cloud systems, working remotely from India.