NVIDIA logo

Senior Production SRE Engineer - Storage

NVIDIA

Job Overview

NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.

Key Responsibilities

  • Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
  • AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
  • Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
  • System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
  • Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
  • Incident Response: Practice sustainable incident response and conduct blameless postmortems.
  • On-call Support: Be part of an on-call rotation to support production systems.

Required Qualifications

  • Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
  • Experience: At least 5+ years of practical experience in a similar role.
  • Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
  • Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
  • Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.

Preferred Qualifications

  • SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
  • CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
  • Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Why Join NVIDIA?

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.

How to Apply

If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!

Benefits
Extracted with AI

  • Remote work
  • Collaborative environment
  • Opportunities for growth and learning

Similar jobs

Last update: 23 minutes ago

NVIDIA logo
NVIDIA

Senior Full Stack Web Software Engineer

Join NVIDIA as a Senior Full Stack Web Software Engineer to build AI-assisted developer tools using React.js and TypeScript.

Valtech logo
Valtech

Senior Site Reliability Engineer

Join Valtech as a Senior Site Reliability Engineer in Sofia, Bulgaria. Work with AWS, GCP, and Azure in a hybrid environment.

NVIDIA logo
NVIDIA

Senior DevOps Engineer

Join NVIDIA as a Senior DevOps Engineer to enhance our Kubernetes platform and multi-cloud infrastructure.

NVIDIA logo
NVIDIA

Software Engineering Intern

Join NVIDIA as a Software Engineering Intern in 2025. Work remotely on AI, cloud, and data science projects. Enhance your skills in a diverse environment.

SentinelOne logo
SentinelOne

Senior Backend Engineer - Cloud Native Security

Join SentinelOne as a Senior Backend Engineer focusing on cloud-native security solutions. Work remotely in Slovakia.

Google logo
Google

Technical Solutions Engineer, Infrastructure, Serverless

Join Google as a Technical Solutions Engineer in Warsaw, focusing on Serverless infrastructure and customer support.

Relativity logo
Relativity

Senior Java Software Engineer

Join Relativity as a Senior Java Software Engineer to work on AI-based products in a hybrid environment.

Keelvar logo
Keelvar

Staff Engineer - Python, Cloud, Distributed Systems

Join Keelvar as a Staff Engineer to lead design and architecture in a remote role, focusing on Python, cloud, and distributed systems.

Google logo
Google

Software Engineer II, Video Creation

Join Google as a Software Engineer II in Warsaw, focusing on video creation for Google Ads. Develop front-end tools and collaborate with back-end teams.

SentinelOne logo
SentinelOne

Senior Backend Engineer - Cloud Native Security

Join SentinelOne as a Senior Backend Engineer to design and implement cloud-native security solutions.

Google logo
Google

Senior Full Stack Software Engineer

Join Google as a Senior Full Stack Software Engineer to build innovative solutions using LLMs in Warsaw.

Remote Crew logo
Remote Crew

Senior Data Engineer

Join us as a Senior Data Engineer in Lisbon to design and maintain data infrastructure. Hybrid role with flexible benefits.

Google logo
Google

Software Engineer III, Full Stack

Join Google as a Software Engineer III, Full Stack in Warsaw, Poland, working on next-gen technologies for billions of users.

SSi People logo
SSi People

Senior Machine Learning Engineer

Join as a Senior Machine Learning Engineer to design and deploy advanced ML solutions using Python, Spark, and cloud platforms. Remote work opportunity.

webAI logo
webAI

Senior Distributed Systems Engineer

Join webAI as a Senior Distributed Systems Engineer to design and maintain scalable systems using Python, Kubernetes, and more.

SQUER logo
SQUER

Cloud Engineer

Join SQUER as a Cloud Engineer in Vienna. Work with AWS, Azure, or GCP, automate with Terraform, and enjoy hybrid work and great benefits.

Nike logo
Nike

Senior Machine Learning Engineer

Join Nike as a Senior Machine Learning Engineer to develop and optimize ML algorithms for innovative applications.

Google logo
Google

Software Engineer III, Full Stack

Join Google as a Software Engineer III, Full Stack in Warsaw to develop next-gen technologies for Pixel Biometrics.

CAST AI logo
CAST AI

Senior Software Engineer - Virtualization and Systems Programming

Join CAST AI as a Senior Software Engineer specializing in virtualization and systems programming. Work remotely within the EU.

ShiftKey logo
ShiftKey

Senior Node.js Engineer

Join ShiftKey as a Senior Node.js Engineer in Warsaw, focusing on microservices and AWS in a remote-friendly role.

Niantic, Inc. logo
Niantic, Inc.

Senior Software Engineer, Machine Learning

Join Niantic as a Senior Software Engineer in Machine Learning to enhance products using generative AI technologies.

Feedonomics logo
Feedonomics

Remote PHP Software Engineer

Join Feedonomics as a Remote PHP Software Engineer to develop scalable SaaS platform features. Requires PHP, MySQL, and Python skills.

Grafana Labs logo
Grafana Labs

Senior Backend Engineer

Join Grafana Labs as a Senior Backend Engineer, working remotely in the US/Canada on Kubernetes monitoring.

Nebius AI logo
Nebius AI

MLOps Engagement Engineer

Join Nebius AI as an MLOps Engagement Engineer to design and optimize ML workflows using Kubernetes, Docker, and Slurm.