Description
PROFILE: Looking for a SENIOR developer profile - someone who can be a COACH for the teamPERM: Strong possibility of conversion. Possibility and manager's preference for a resource that can be converted to permanent, but open to temporaryBILINGUALISM: Preference for bilingualism, but is open to an English speaker who understands French very well. Must be able to understand a French speaking team.Short Description
As a Cloud Site Reliability Engineer (SRE), you are the expert in helping the organization operate the cloud effectively.Team Mission
The squad has a four-pronged mission; Technology Architecture, Cloud Solution Architecture, GitOps-based Cloud Project Management and Cloud Operational Governance.Your role
You will actively participate in the realization of our DevOps vision by defining and improving our operational and cloud reliability practices and work with the teams to adopt these practices. You will improve the observability of IT systems while performing task-based operational work optimization (labor) to unlock teams with operational needs where automated or self-service solutions do not yet exist.You focus on automating deployment pipelines, while documenting processes that reduce the impact of errors. Simplify systems and ensure adoption of robust design patterns. As an SRE Cloud expert, you are a generalist in all things DevOps and have extensive experience with SDLC software engineering practices.Your primary responsibilities will be to
Promote a culture focused on DevOps and reliability by building strong relationships with technical and non-technical teamsWork with the broader software engineering community to design and implement highly resilient architectures and processesImprove our cloud deployment practices for improved reliability, repeatability, and securityIdentify automation opportunities and failure points in application pipelinesAutomate and optimize provisioning and configuration of cloud servicesHelp develop chaos engineering practices to improve site reliability and internal incident response processes and toolsHelp improve continuous deployment processes and toolsHelp refine our security posture and automate security practices (security as code) in accordance with government and regulatory agency requirementsYou will write code for all of the above and if not possible, document it in a WikiTechnical Skills
University degree in related fields with 5 years experience or equivalent experienceOfficial certifications required; AWS, CKA, CNCFProficiency in Linux / Unix administration, including scripting; knowledge of Windows Server administration (asset)Strong technical problem solving skills and knowledge of resilient software design patterns and architecturesAdvanced knowledge of CLI/API, Terraform, Ansible, Python, Bash and GitGood knowledge of one or more programming languages such as Python, Go, Java, Ruby, JavaScripGood knowledge of container administration, including Docker, Kubernetes, Istio or other service mesh implementationsExperience with networking, including VPC, SDN / VLAN, DNS, routers and firewallsExperience with SQL, using RDS-PostgreSQL or other DBMSExperience with monitoring/alerting tools such as Grafana, Prometheus, Sysdig, DataDogExperience with log aggregation tools such as FluentD, ELK, SplunkExperience with vault and secret management processes and toolsStrong communication and technical simplification skills