NVIDIA logo

Senior Production SRE Engineer - Storage

NVIDIA

Job Overview

NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.

Key Responsibilities

  • Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
  • AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
  • Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
  • System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
  • Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
  • Incident Response: Practice sustainable incident response and conduct blameless postmortems.
  • On-call Support: Be part of an on-call rotation to support production systems.

Required Qualifications

  • Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
  • Experience: At least 5+ years of practical experience in a similar role.
  • Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
  • Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
  • Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.

Preferred Qualifications

  • SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
  • CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
  • Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Why Join NVIDIA?

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.

How to Apply

If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!

Benefits
Extracted with AI

  • Remote work
  • Collaborative environment
  • Opportunities for growth and learning

Similar jobs

Last update: 23 minutes ago

NVIDIA logo
NVIDIA

Senior Distributed Systems Backend Engineer

Join NVIDIA as a Senior Distributed Systems Backend Engineer to shape the future of Cloud Gaming with GeForce NOW.

Binance logo
Binance

Senior Backend Developer (Node.js) / SRE

Join Binance as a Senior Backend Developer (Node.js) / SRE to develop monitoring systems for high-load production environments.

Neon logo
Neon

Software Engineer, Storage (Rust, PostgreSQL)

Join Neon as a Software Engineer, Storage. Work with Rust and PostgreSQL to build scalable, reliable cloud-native database services.

NVIDIA logo
NVIDIA

Senior DevOps Engineer

Join NVIDIA as a Senior DevOps Engineer to enhance our Kubernetes platform and multi-cloud infrastructure.

NVIDIA logo
NVIDIA

Senior Full-Stack Software Engineer

Join NVIDIA as a Senior Full-Stack Software Engineer, working on cutting-edge web applications and infrastructure.

Amazon Web Services (AWS) logo
Amazon Web Services (AWS)

Senior Systems Engineer, Managed Operations

Join AWS as a Senior Systems Engineer in Berlin to lead operations for the European Sovereign Cloud, ensuring high-availability AWS services.

Reddit, Inc. logo
Reddit, Inc.

Senior Solutions Engineer

Join Reddit as a Senior Solutions Engineer in Amsterdam to support our growing advertising business with technical expertise and problem-solving skills.

NVIDIA logo
NVIDIA

Senior Deep Learning Performance Software Engineer

Senior role optimizing deep learning performance at NVIDIA, involving Python, HPC, and AI technologies.

netgo logo
netgo

Senior Cloud DevOps Engineer

Join netgo as a Senior Cloud DevOps Engineer in Berlin. Work with Kubernetes, GitOps, and more in a dynamic team environment.

NVIDIA logo
NVIDIA

Senior Full-Stack Web Applications Software Engineer

Join NVIDIA as a Senior Full-Stack Web Applications Software Engineer. Work on scalable web services and infrastructure.

NVIDIA logo
NVIDIA

Senior Software & Cloud Architect

Join NVIDIA as a Senior Software & Cloud Architect to lead cloud-based orchestration and provisioning solutions.

Aiven logo
Aiven

Staff Software Engineer

Join Aiven as a Staff Software Engineer to develop cloud operations platforms using open-source technologies. Hybrid work in Berlin.

NVIDIA logo
NVIDIA

Senior Software Engineer, AI Platform - Robotics

Senior Software Engineer needed for AI Robotics platform at NVIDIA, Santa Clara. Involves cloud platforms, Kubernetes, Python.

NVIDIA logo
NVIDIA

Senior Software Solution Engineer, Networking

Join NVIDIA as a Senior Software Solution Engineer in Networking, focusing on complex customer solutions and development.

Optiver logo
Optiver

Production Engineer

Join Optiver as a Production Engineer in Amsterdam to manage live trading environments and enhance system reliability and performance.

NVIDIA logo
NVIDIA

Senior Backend Engineer, AI Platform - Robotics

Join NVIDIA as a Senior Backend Engineer to develop AI platforms for robotics. Work remotely with cutting-edge technology.

NVIDIA logo
NVIDIA

Senior Software and System Architect

Join NVIDIA as a Senior Software and System Architect to lead cloud-networking and security solutions, focusing on cutting-edge technologies.

Pure Storage logo
Pure Storage

Site Reliability Engineer, FlashArray

Join Pure Storage as a Site Reliability Engineer in Prague, focusing on cloud infrastructure uptime and incident response.

NVIDIA logo
NVIDIA

Senior Deep Learning Engineer

Join NVIDIA as a Senior Deep Learning Engineer to optimize AI performance using PyTorch, TensorFlow, and more in Berlin.

Devire logo
Devire

Senior Backend Engineer (JavaScript & Node.js)

Join Devire as a Senior Backend Engineer specializing in JavaScript & Node.js, working on innovative fintech solutions in a hybrid role in Warsaw.

Redcare Pharmacy logo
Redcare Pharmacy

Senior DevOps Engineer with Linux, Kubernetes, and GCP

Join Redcare Pharmacy as a Senior DevOps Engineer to enhance infrastructure efficiency using Linux, Kubernetes, and GCP.

NVIDIA logo
NVIDIA

Senior Full Stack Engineer, Deep Learning Algorithms

Join NVIDIA as a Senior Full Stack Engineer to build software for AI, focusing on deep learning algorithms and high-performance computing.

Nebius AI logo
Nebius AI

Senior Backend Engineer (Go)

Join Nebius as a Senior Backend Engineer (Go) to develop fault-tolerant cloud services in a hybrid work environment.

Skytree logo
Skytree

Senior IoT Engineer

Join Skytree as a Senior IoT Engineer to lead IoT projects, focusing on Azure IoT solutions, edge computing, and data pipelines.