Site Reliability Engineer, Cloud Platform

Engineering · Tokyo, Japan

Job description

About Woven by Toyota Woven by Toyota, a part of the Toyota Group, is challenging the current state of mobility through human-centric innovation and empowering mobility transformation. Through our AD/ADAS technology, our automotive software development platform Arene OS, our mobility test course Toyota Woven City, and Toyota’s growth fund, Woven Capital, we are pioneering the movement of people, goods, information, and energy, weaving a future of enhanced safety, connectivity and well-being for all.

=========================================================================

TEAM  Our mission is to make software development for Woven by Toyota and the greater Toyota organization as a whole more productive and efficient. We use the latest technologies to help engineering teams go faster, with safety as our top priority. Our modern, agile, and transparent services are designed to bring to life Woven by Toyota's vision of "Mobility to Love, Safety to Live."

WHO ARE WE LOOKING FOR?  The Enterprise Technology SRE team collaborates with the product development team, sharing the same codebase, but with a primary focus on non-functional requirements. Our objective is to enhance production readiness and reliability. We are looking for an SRE engineer with a background in software engineering, DevOps, and cloud engineering. You will be passionate about establishing SRE best practices, and you'll report to our SRE Manager. This role is hybrid, requiring on-site presence three days per week.

RESPONSIBILITIES

  • Develop software systems for improved product monitoring, reliability, and development efficiency
  • You will have on-call responsibilities to monitor and respond to incidents, ensuring service health. Our 8-hour on-call rotation includes workdays, weekends, and holidays, and can be done remote.
  • Provide guidance on reliability practices throughout the software development lifecycle, including architecture and code reviews
  • Establish SRE best practices within product teams, including capacity planning, chaos testing, and disaster recovery drills
  • Learn from incidents through blameless post-mortems and address service reliability issues through hands-on coding
  • Enhance development and operations teams' efficiency

MINIMUM QUALIFICATIONS

  • Bachelor’s degree in Computer Science, Technology, Engineering, Mathematics, or equivalent practical experience
  • 4+ years of experience in Go, Python, or a similar language. Proficient in data structures, algorithms, and software design
  • Intermediate to advanced level of expertise in public cloud technologies, Kubernetes, and Infrastructure as Code
  • Proficient in production on-call, troubleshooting, and incident management
  • Business level English skills

NICE TO HAVES

  • Hands-on experience in SRE best practices, including SLO monitoring, disaster recovery planning, chaos testing, capacity planning, automation, toil reduction and more
  • Experience with APM solutions and monitoring systems such as Prometheus, Wavefront, and GCP monitoring
  • Previous experience in an SRE, DevOps, or Platform Engineering role
  • AWS, GCP, or Kubernetes Certifications
  • Japanese language skill to talk with customers.

Org chart

This job is not in the org chart


Teams

This job is not in any teams