Senior Site Reliability Engineer
Posted on Nov 13, 2021 by Cititec Talent Limited
Our client empowers forward-thinking businesses to build a sustainable future by providing flexible and scalable electric vehicle charging solutions. With a focus on larger players in the industry, they are a premier cloud-native and multi-tenant Platform-as-a-Service. They offer a service that's built from the ground up to be global, future proof, API-first, highly scalable and secure.
The core mission of the SRE team is to maximize the reliability and resilience of the platform by providing production platform, tools, best practices and processes. This team will aim to achieve a balance between engineering, operational work and engaging with the product development teams.
As Senior Site Reliability Engineer you are responsible for keeping all the production systems running smoothly. You represent a mix of pragmatic operational and software engineering skills, have a passion for operational discipline and automation.
Projects you could work on:
- Coding Google Cloud Platform infrastructure automation with Terraform
- Improving their monitoring and alerting setup.
- Building relationships with the product development teams, helping define their SLIs and SLOs and help them stay in their error budgets.
- Helping product development teams design, deploy and fix various services.
- Contributing to the Production Readiness Reviews.
- Design, build and plan the growth of infrastructure of their platform to support all the current and future drivers.
- Work closely with the Product Development Teams to improve the deployment process.
- Coach, guide and inspire others with the team and technology group.
- Use your knowledge and experience to prevent incidents from ever happening.
- Debug production issues across services.
- Run their infrastructure with Terraform and Kubernetes.
- Make monitoring and alerting alert on symptoms and not on outages.
- Automate repeatable actions.
- Continuously improve the reliability of their systems.
- Establish and maintain best practices around SRE.
- Be part of the Incident Response team to respond to availability incidents.
Experience we're looking for:
- At least 3 years of professional experience with SRE practices.
- At least 5 years of experience with system administration/infrastructure engineering/cloud engineering.
- Strong experience with cloud-based infrastructure providers (preferably GCP).
- Programming/Scripting skills in at least one of the following: Go, Python, Shell.
- Familiarity with Java Ecosystem is highly preferred.
- Experience with Nginx, Docker, Kubernetes, Terraform or related technologies.
- Experience in delivering and providing support for low latency and highly available software systems.
- Experience in capacity planning and disaster recovery.
- Experience in monitoring and setting up alerting.
- Experience applying CI & CD concepts.
- Be an advocate of and apply security in every aspect of your work.
Our offer to you:
In addition to collaborating with a dynamic group of top industry and subject matter experts, they offer:
- A very competitive salary
- Relocation assistance/visa support if applicable
- The option to work from home partially
- A budget to set up your home office with high-end laptop and related equipment
- Contribution to a private pension fund
- Discounted health insurance
- 25 days holiday
- Home-office travel compensation