Senior Production SRE Engineer - Storage
NVIDIAJob Overview
NVIDIA is seeking a Senior Production SRE Engineer - Storage to join our dynamic team. As a Site Reliability Engineer (SRE), you will be responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role involves working with cutting-edge technologies and ensuring the reliability and performance of our GPU cloud services.
Key Responsibilities
- Design and Support Storage Clusters: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting.
- AI/ML Workloads: Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows.
- Service Lifecycle Improvement: Collaborate with peers to improve the lifecycle of services from inception and design through deployment, operation, and refinement.
- System Health Monitoring: Maintain services by measuring and monitoring availability, latency, and overall system health, leveraging machine learning models.
- Sustainable Scaling: Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems to improve reliability and velocity.
- Incident Response: Practice sustainable incident response and conduct blameless postmortems.
- On-call Support: Be part of an on-call rotation to support production systems.
Required Qualifications
- Educational Background: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
- Experience: At least 5+ years of practical experience in a similar role.
- Technical Skills: Proficiency in algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems.
- Programming Languages: Experience in one or more of the following: C/C++, Java, Python, Go, Perl, or Ruby.
- Infrastructure Tools: Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
- Observability Tools: Experience using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack.
Preferred Qualifications
- SRE Mindset: Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction.
- CI/CD Experience: Experience with Git, code review, pipelines, and CI/CD.
- Distributed Systems: Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Cloud Systems: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.
Why Join NVIDIA?
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! Join us in a collaborative environment that encourages innovation and growth.
How to Apply
If you are interested in this exciting opportunity, please apply through our career site. We look forward to reviewing your application!
Benefits Extracted with AI
- Remote work
- Collaborative environment
- Opportunities for growth and learning
Similar jobs
Last update: 23 minutes ago
Senior Full Stack Web Software Engineer
Join NVIDIA as a Senior Full Stack Web Software Engineer to build AI-assisted developer tools using React.js and TypeScript.
Senior Site Reliability Engineer
Join Valtech as a Senior Site Reliability Engineer in Sofia, Bulgaria. Work with AWS, GCP, and Azure in a hybrid environment.
Senior DevOps Engineer
Join NVIDIA as a Senior DevOps Engineer to enhance our Kubernetes platform and multi-cloud infrastructure.
Software Engineering Intern
Join NVIDIA as a Software Engineering Intern in 2025. Work remotely on AI, cloud, and data science projects. Enhance your skills in a diverse environment.
Senior Backend Engineer - Cloud Native Security
Join SentinelOne as a Senior Backend Engineer focusing on cloud-native security solutions. Work remotely in Slovakia.
Technical Solutions Engineer, Infrastructure, Serverless
Join Google as a Technical Solutions Engineer in Warsaw, focusing on Serverless infrastructure and customer support.
Senior Java Software Engineer
Join Relativity as a Senior Java Software Engineer to work on AI-based products in a hybrid environment.
Staff Engineer - Python, Cloud, Distributed Systems
Join Keelvar as a Staff Engineer to lead design and architecture in a remote role, focusing on Python, cloud, and distributed systems.
Software Engineer II, Video Creation
Join Google as a Software Engineer II in Warsaw, focusing on video creation for Google Ads. Develop front-end tools and collaborate with back-end teams.
Senior Backend Engineer - Cloud Native Security
Join SentinelOne as a Senior Backend Engineer to design and implement cloud-native security solutions.
Senior Full Stack Software Engineer
Join Google as a Senior Full Stack Software Engineer to build innovative solutions using LLMs in Warsaw.
Senior Data Engineer
Join us as a Senior Data Engineer in Lisbon to design and maintain data infrastructure. Hybrid role with flexible benefits.
Software Engineer III, Full Stack
Join Google as a Software Engineer III, Full Stack in Warsaw, Poland, working on next-gen technologies for billions of users.
Senior Machine Learning Engineer
Join as a Senior Machine Learning Engineer to design and deploy advanced ML solutions using Python, Spark, and cloud platforms. Remote work opportunity.
Senior Distributed Systems Engineer
Join webAI as a Senior Distributed Systems Engineer to design and maintain scalable systems using Python, Kubernetes, and more.
Cloud Engineer
Join SQUER as a Cloud Engineer in Vienna. Work with AWS, Azure, or GCP, automate with Terraform, and enjoy hybrid work and great benefits.
Senior Machine Learning Engineer
Join Nike as a Senior Machine Learning Engineer to develop and optimize ML algorithms for innovative applications.
Software Engineer III, Full Stack
Join Google as a Software Engineer III, Full Stack in Warsaw to develop next-gen technologies for Pixel Biometrics.
Senior Software Engineer - Virtualization and Systems Programming
Join CAST AI as a Senior Software Engineer specializing in virtualization and systems programming. Work remotely within the EU.
Senior Node.js Engineer
Join ShiftKey as a Senior Node.js Engineer in Warsaw, focusing on microservices and AWS in a remote-friendly role.
Senior Software Engineer, Machine Learning
Join Niantic as a Senior Software Engineer in Machine Learning to enhance products using generative AI technologies.
Remote PHP Software Engineer
Join Feedonomics as a Remote PHP Software Engineer to develop scalable SaaS platform features. Requires PHP, MySQL, and Python skills.
Senior Backend Engineer
Join Grafana Labs as a Senior Backend Engineer, working remotely in the US/Canada on Kubernetes monitoring.
MLOps Engagement Engineer
Join Nebius AI as an MLOps Engagement Engineer to design and optimize ML workflows using Kubernetes, Docker, and Slurm.