Site Reliability Engineer (GenAI)

Posted on Oct 2, 2024 by Publicis Sapient
Irving, TX
Engineering
Immediate Start
Annual Salary
Full-Time
Job Description

The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.

Responsibilities:

Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.

Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.

Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.

Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.

Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.

Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.

Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.

Reference: 203065596

https://jobs.careeraddict.com/post/95712704

Site Reliability Engineer (GenAI)

Posted on Oct 2, 2024 by Publicis Sapient

Irving, TX
Engineering
Immediate Start
Annual Salary
Full-Time
Job Description

The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.

Responsibilities:

Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.

Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.

Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.

Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.

Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.

Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.

Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.

Reference: 203065596

Share this job:
CareerAddict

Alert me to jobs like this:

Amplify your job search:

CV/résumé help

Increase interview chances with our downloads and specialist services.

CV Help

Expert career advice

Increase interview chances with our downloads and specialist services.

Visit Blog

Job compatibility

Increase interview chances with our downloads and specialist services.

Start Test