Site Reliability Engineer (GenAI)
Posted on Oct 2, 2024 by Publicis Sapient
Irving, TX
Engineering
Immediate Start
Annual Salary
Full-Time
Job Description
The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.
Responsibilities:
Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.
Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.
The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.
Responsibilities:
Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.
Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.
Reference: 203065596
https://jobs.careeraddict.com/post/95712704
Site Reliability Engineer (GenAI)
Posted on Oct 2, 2024 by Publicis Sapient
Irving, TX
Engineering
Immediate Start
Annual Salary
Full-Time
Job Description
The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.
Responsibilities:
Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.
Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.
The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.
Responsibilities:
Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.
Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.
Reference: 203065596
Share this job:
Alert me to jobs like this:
Amplify your job search:
Expert career advice
Increase interview chances with our downloads and specialist services.
Visit Blog