NVIDIA logo

Senior Production SRE Engineer - Storage

NVIDIA

Job Overview

NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.

Key Responsibilities

  • Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
  • AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
  • Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
  • System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
  • Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
  • Incident Response: Practice sustainable incident response and conduct blameless postmortems.
  • On-call Support: Be part of an on-call rotation to support production systems.

Required Qualifications

  • Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
  • Experience: At least 5+ years of practical experience in a similar role.
  • Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
  • Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
  • Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.

Preferred Qualifications

  • SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
  • CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
  • Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.

Why Join NVIDIA?

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.

How to Apply

If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!

Benefits
Extracted with AI

  • Remote work
  • Collaborative environment
  • Opportunities for growth and learning

Similar jobs

Last update: 23 minutes ago

Amazon Web Services (AWS) logo
Amazon Web Services (AWS)

Senior Systems Engineer, Managed Operations

Join AWS as a Senior Systems Engineer in Berlin to lead operations for the European Sovereign Cloud, ensuring high-availability AWS services.

netgo logo
netgo

Senior Cloud DevOps Engineer

Join netgo as a Senior Cloud DevOps Engineer in Berlin. Work with Kubernetes, GitOps, and more in a dynamic team environment.

Aiven logo
Aiven

Staff Software Engineer

Join Aiven as a Staff Software Engineer to develop cloud operations platforms using open-source technologies. Hybrid work in Berlin.

Nebius AI logo
Nebius AI

Senior Backend Engineer (Go)

Join Nebius as a Senior Backend Engineer (Go) to develop fault-tolerant cloud services in a hybrid work environment.

Semrush logo
Semrush

Senior NodeJS Developer

Join Semrush as a Senior NodeJS Developer to build and enhance digital marketing tools. Work remotely with flexible hours.

Cere Network logo
Cere Network

Principal AI Engineer

Join Cere Network as a Principal AI Engineer to drive AI innovation in Web3. Requires 10+ years in AI/ML, NLP, and software development.

Redcare Pharmacy logo
Redcare Pharmacy

Senior DevOps Engineer with Linux, Kubernetes, and GCP

Join Redcare Pharmacy as a Senior DevOps Engineer to enhance infrastructure efficiency using Linux, Kubernetes, and GCP.

Devire logo
Devire

Senior Backend Engineer (JavaScript & Node.js)

Join Devire as a Senior Backend Engineer specializing in JavaScript & Node.js, working on innovative fintech solutions in a hybrid role in Warsaw.

Bitmovin logo
Bitmovin

Senior C++ Software Engineer

Join Bitmovin as a Senior C++ Software Engineer to develop scalable video streaming solutions using modern C++ and cloud-native architectures.

HeyJobs logo
HeyJobs

Senior Software Engineer - AWS, Python, Ruby on Rails

Join HeyJobs as a Senior Software Engineer to design scalable systems using AWS, Python, and Ruby on Rails in a dynamic team.

n8n logo
n8n

Senior Software Engineer (Node.js & TypeScript)

Join n8n as a Senior Software Engineer to build AI applications using Node.js and TypeScript. Remote role within Europe.

Swift logo
Swift

Senior Developer with Kubernetes and Automation Expertise

Join Swift as a Senior Developer to enhance our Kubernetes platform with automation and security expertise.

Nebius AI logo
Nebius AI

Senior Software Engineer (C++)

Join Nebius as a Senior Software Engineer (C++) to develop reliable cloud services in a hybrid work environment.

Grand Parade logo
Grand Parade

Senior Full Stack Developer (Node.js, React.js)

Join Grand Parade as a Senior Full Stack Developer, working with Node.js and React.js in a hybrid role in Cracow, Poland.

VIAVI Solutions logo
VIAVI Solutions

Senior Software Engineer (C++, Python & Cloud)

Join VIAVI Solutions as a Senior Software Engineer specializing in C++, Python, and cloud technologies. Work in a hybrid environment in Berlin.

Mindbox SA logo
Mindbox SA

SRE Engineering Manager

Join Mindbox SA as an SRE Engineering Manager in Warsaw. Lead teams, manage software development, and ensure reliability in a hybrid work environment.

Attio logo
Attio

Senior Product Engineer [Rust & Typescript]

Join Attio as a Senior Product Engineer working with Rust & TypeScript to build innovative CRM features. Remote work available.

Aiven logo
Aiven

Senior Software Engineer - Python, Apache Kafka

Join Aiven as a Senior Software Engineer in Berlin, focusing on Python and Apache Kafka in a hybrid work environment.

DeepL logo
DeepL

Senior Backend Engineer C++

Join DeepL as a Senior Backend Engineer C++ to design and maintain scalable backend services using C++ and AI technologies.

Basetime BV logo
Basetime BV

Senior Python Developer with AWS Experience

Join Basetime BV as a Senior Python Developer to develop and maintain AWS cloud solutions. Hybrid work, competitive salary, and growth opportunities.

Computer Futures logo
Computer Futures

Cloud Data Engineer

Seeking a Cloud Data Engineer with expertise in AWS, Python, and CI/CD for a hybrid role in Hannover. Join our dynamic team!

Sysdig logo
Sysdig

Senior Software Engineer - Backend Development

Join Sysdig as a Senior Software Engineer to develop scalable backend services using Go, RESTful APIs, and microservices in a hybrid work environment.

Travian Games logo
Travian Games

Senior Developer/DevOps Software Engineer

Join Travian Games as a Senior Developer/DevOps Engineer in Munich. Work with PHP, Kubernetes, and Docker in a hybrid environment.

Labelbox logo
Labelbox

Full-Stack Engineer with Angular and React.js

Join Labelbox as a Full-Stack Engineer to develop scalable systems using Angular, React.js, and GraphQL. Work remotely in a dynamic AI-driven environment.