What you’ll Do:
- Assist and work together with the Engineering team for SRE duties as part of the internal system.
- Design and implement useful tools or systems that are related to SRE responsibilities.
- Learn and improve monitoring, alerting and resilience of systems.
- Learn and practice sustainable incident response and blameless postmortems.
What you’ll Need:
- Have a growth mindset and are willing to learn new things.
- Experience with Linux and Network administration skills for troubleshooting.
- Experience with Cloud Platform (AWS or Google Cloud) and Kubernetes.
- Understanding of monitoring, logging, and tracing systems to help teams quickly detect problems such as ELK, Prometheus, Grafana, Jaeger.
- Experience in programming in any language such as Go, Python or other scripting language such as Shell.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.