What you’ll Do:
- Lead the design, implementation, and operation of highly available and scalable infrastructure solutions to support our organization's applications and services.
- Support services before they go live such as system design consulting, capacity planning, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
- Improve monitoring, alerting and resilience of systems.
- Practice sustainable incident response and blameless postmortems.
What you’ll Need:
- Minimum of 5-10 years of experience in a Site Reliability Engineering (SRE) role, with a proven track record of designing and implementing scalable and reliable infrastructure solutions.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- Experience in designing, analyzing, and troubleshooting micro-services.
- Understanding of monitoring, logging, and tracing systems to help teams quickly detect problems such as ELK, Prometheus, Grafana, Jaeger.
It’d be Great if you have:
- Experience with Linux and Network administration skills for troubleshooting.
- Familiar with Cloud Platform (AWS or Google Cloud) and Kubernetes
- Experience programming in Go or similar is an advantage is an advantage
- Experience designing and managing MongoDB and MySQL databases is an advantage
- Knowledge in Security and how to test is an advantage