Reports to: Project Lead
Experience: 5+ years
Start date: 1st August 2022
Responsibilities
- Responsible for Toil Reduction, implementing identified improvement opportunities, and handling minor enhancement and non-ticketed activity.
- Define and monitor service level metrics that include Reliability metrics like MTTD, MTTR, MTBF, MTTF, Unavailability rate, Incident count, etc.
- Create rules to optimize incident response by metrics, streamlining alert flows, and collaboration and communication across squads.
- Proactively identify the issues that might disrupt the service in production
- Address incoming service requests to their support groups/Jira tool
- Create and maintain alerts
- Change validation or change planning-related requests
- Assist business stakeholders in determining SLO or adjusting threshold limits
- Demand and capacity management & make corrections to SLI/SLO threshold limits
- Gather and analyze metrics from both Infrastructure and applications to assist in bug fixing
- Engage in capacity planning & performance tuning exercises
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objective (SLO, SLI)
- Debug production issues across services and levels of the stack.
required skills and qualification
- Bachelor’s degree in computer science or other highly technical, scientific discipline
- Ability to program (structured and OO) with one or more high-level languages, such as Python, Java, Ruby, C/C++, and JavaScript
- Experience in AEM, Webservices/APIs
- Experience in working with Public Clouds (Min 3 years experience is a must)
- Experience with Git or other source control systems
- Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines
- Working knowledge in service level definitions and identifying the KPIs
- Working knowledge of the TCP/IP stack, internet routing, and load balancing
- Experience with distributed storage technologies like NFS, HDFS, Ceph
- Experience in Observability strategy
Delivery Model: Onsite
Job Type: Full Time
Job Location: Auckland