Senior Production SRE Engineer - Storage

Job Overview

NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.

Key Responsibilities

Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
Incident Response: Practice sustainable incident response and conduct blameless postmortems.
On-call Support: Be part of an on-call rotation to support production systems.

Required Qualifications

Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
Experience: At least 5+ years of practical experience in a similar role.
Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.

Preferred Qualifications

SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Why Join NVIDIA?

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.

How to Apply

If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!

Benefits
Extracted with AI

Remote work
Collaborative environment
Opportunities for growth and learning

Similar jobs

Last update: 23 minutes ago

NVIDIA

Senior Distributed Systems Backend Engineer

Join NVIDIA as a Senior Distributed Systems Backend Engineer to shape the future of Cloud Gaming with GeForce NOW.

Job Overview

Key Responsibilities

Required Qualifications

Preferred Qualifications

Why Join NVIDIA?

How to Apply

Benefits Extracted with AI

Similar jobs

Senior Distributed Systems Backend Engineer

Senior DevOps Engineer

Senior Deep Learning Performance Software Engineer

Senior Full-Stack Web Applications Software Engineer

Senior Software & Cloud Architect

Senior Full-Stack Software Engineer

Senior Software Engineer, AI Platform - Robotics

Senior Software Solution Engineer, Networking

Senior Backend Engineer, AI Platform - Robotics

Senior Software and System Architect

Site Reliability Engineer, FlashArray

Senior Deep Learning Engineer

Software Engineering Intern

Site Reliability Engineer (SRE) - Stability AI

Senior Full Stack Engineer, Deep Learning Algorithms

Senior Software Engineer - HPC

Senior Cloud Site Reliability Engineer

Senior Site Reliability Engineer (SRE) - Hasura Cloud

Senior Software Architect – Data Center Platform Simulation and Virtualization

Senior Software Architect, Advanced Development

Staff AI Platform Engineer

Senior Site Reliability Engineer (SRE) - Hasura Cloud

Senior Site Reliability Engineer

Site Reliability Engineer (SRE) - Hasura Cloud

Benefits
Extracted with AI