Site Reliability Engineer
Posted on Nov 11, 2021 by Levy Associates Ltd
Your working environemnt and team:
You'll be working in a customer-focused financial organization that relies heavily on bank applications. As a result, they aim to ensure their complete availability by bringing you into a team where you will be an integral part of the worldwide digital banking transition.
The TouchPoint Platform is an important component of client's strategy to become a genuinely global bank, as well as a crucial success element on journey to become a financial services platform that goes beyond banking. Client aims to provide a scalable framework for platform business models through their platform, allowing them to succeed in the new financial environment. The worldwide scalable banking platform will set it apart from the competition.
The TouchPoint Site Reliability Engineering (SRE) team is a diversified group of senior engineers with extensive experience in application and infrastructure development and operations. The major purpose is to increase the dependability and maintainability of the IT environments engaged with the TouchPoint Platform, which are delivered and controlled from several (international) domains, on a continuous and structural basis.
Your main responsobilties:
* Ensure Service Level Objective (SLO) levels are set and met
. Drive Always Available mindset and behavior within the TouchPoint organization. Be able to recognize shortcomings in knowledge and expertise, and deliver the necessary resources, skills, guidance and training to DevOps teams where needed.
. Define and enhance standards for logging monitoring and alerting, and actively monitor end to end platform performance through white and Black Box monitoring tools.
. Improve incident response practices and be actively engaged in incident response of escalated and critical incidents. On call duty is currently not part of the job, but should not be an objection if and when required.
. Participate in Root Cause Analysis. Prioritize and implement the RCA recommendations through improvement plans with the responsible Squads/DevOps teams
. Drive Continuous improvement on all services in the TouchPoint Platform through analysis of the current level of service, functional and technical setup, code, dev/ops practices and the underlying causes of incidents, underperformance, etc.
. Organization and coordination of platform tests like DDOS, DR, Ceiling/Break, and Penetration tests.
. Setting up and maintaining automatic reporting and feedback loops
. Contribute to automating Build, Test and Deployment practices through the CI/CD pipeline
. Contribute to tuning application resources and updating high available deployment patterns of (mostly) container and VM based environments.
. Initiate and contribute to new SRE initiatives like AI Ops, Chaos Engineering, migrations to Public Cloud, and Error Budgeting
. Participate and initiate experiments with new tools and concepts, and evaluate it's value against set goals
. Operations expert: 5+ years of experience working using Agile DevOps principles
. Solid understanding how technology setup and ITSM processes relate to service level objectives like Availability (time based, successful call rate, response times), MTTR, and MTBF.
. Good understanding of microservices architecture and related high availability/resilience patterns and experience building systems with multiple layers of redundancy to withstand failures in software, hardware, network infrastructure.
. Proven experience:
o worked as Site Reliability Engineer or DevOps engineer
o script in at least one of the following: Ruby, Python, Bash, PowerShell
o set up Build and Deployment pipelines in Azure DevOps (ADO)
o set up white-box monitoring and able to formulate meaningful metrics for monitoring and reporting
. Able to coordinate/lead incident response and root cause analysis activities
. Preivios experience with TouchPoint platform
* CI/CD Pipeline: Azure Devops/Jenkins/Gitlab
. Cloud computing and container orchestration: Linux VM's and Kubernetes container platforms. Knowledge of Openshift and AKS and related certifications are a pre.
. Service mesh and SDK's
. logging/monitoring/alerting: Kafka, ELK, and Prometheus. Experience with Black Box monitoring tools like Rigor/Splunk and AI Ops tools like Loom is a pre.
. Backlog management: Azure Boards
. ITSM: SNOW
Set up alerts to get notified of new vacancies.
€50k - €90k Annual
€85k - €90k Annual
€85k - €90k Annual
€50k - €75k Annual
€65k - €80k Annual