Deltek is looking for a Team Lead and Senior Software Engineer to join our Site Reliability Engineering team. In this role, you will be responsible for the reliability, scalability, and performance of our globally-used SaaS platforms. You will bridge the gap between software engineering and infrastructure operations, building the tools, automation, and systems that keep our products running for thousands of customers and millions of users.
This is a high-ownership role in a “never-stop-learning” environment. You will work closely with development teams to embed reliability practices early in the software lifecycle, respond to production incidents, and drive continuous improvements to our observability and operational posture.
Key Responsibilities:
Site Reliability & Platform Engineering:
- Design, build, and maintain the infrastructure and tooling that underpins Deltek’s SaaS platforms at scale.
- Drive reliability improvements across the full stack, spanning application-level resilience patterns through to infrastructure-level fault tolerance.
- Uphold and extend our IaC-first engineering culture, where all infrastructure changes are made through code and shipped to production via fully automated CI/CD pipelines.
- Build and improve CI/CD pipelines to support safe, frequent deployments with automated rollback capabilities.
- Develop internal tooling and automation to reduce toil and increase engineering self-service.
Observability & Performance:
- Design and maintain comprehensive observability solutions including logging, metrics, tracing, and alerting across our AWS-based infrastructure.
- Proactively identify performance bottlenecks and reliability risks before they impact customers.
- Conduct capacity planning and load testing to ensure systems can scale to meet demand.
Incident Management & On-Call Support:
- Participate in and own the on-call rotation, ensuring fair distribution and adequate coverage across the team, and acting as a first responder for production incidents affecting our SaaS platforms.
- Lead incident response: triage, coordinate cross-team resolution, communicate clearly with stakeholders, and drive issues to resolution with a sense of urgency.
- Own post-incident reviews, facilitate blameless post-mortems, identify root causes, and ensure action items are tracked and completed.
- Take pride in leaving systems better than you found them, consistently reducing the frequency and impact of incidents over time.
Team Leadership:
- Act as the technical lead for the SRE team, setting direction, priorities, and standards for how the team operates.
- Lead and facilitate team ceremonies including standups, retrospectives, and planning sessions.
- Serve as an escalation point for complex or high-severity incidents, providing guidance and support to engineers during critical moments.
- Collaborate with engineering managers and stakeholders to align SRE priorities with broader product and platform goals.
Collaboration & Engineering Culture:
- Partner with software engineering teams to review system designs and architectures with a reliability lens.
- Mentor and provide technical guidance to junior engineers on SRE practices, tooling, and operational excellence.
- Contribute to a strong team culture, supportive, curious, and focused on doing great work while having fun.
Education:
- Bachelor’s degree in Computer Science or a related field, or equivalent experience.
Experience:
- Minimum of 7 years of overall experience in software development, infrastructure engineering, or site reliability engineering, with at least 2 years in a team lead or senior technical leadership capacity.
- Demonstrated experience leading or mentoring a team of engineers in a technical environment.
- 3+ years of hands-on experience in an SRE, DevOps, or platform engineering role in a production SaaS environment.
- 3+ years applying an automation-first approach to problem-solving using configuration management tools and scripting.
- Strong experience with AWS; familiarity with services such as EC2, EKS, RDS, S3, CloudWatch, and IAM.
Technical Skills:
- Infrastructure-as-Code expertise with Terraform.
- Proficiency in at least one scripting/programming language (Python, Node.js, or similar) for automation and tooling development.
- Strong understanding of networking fundamentals: DNS, load balancing, TLS, firewalls, and VPCs.
- Experience with CI/CD pipelines and deployment automation.
- Solid understanding of relational databases (PostgreSQL preferred) including query performance and operational concerns.
- Hands-on experience with observability tooling (e.g., Prometheus/Grafana, CloudWatch, or similar).
Soft Skills:
- Strong communication skills: able to explain complex systems clearly, write crisp incident reports, and influence technical decisions across teams.
- Proven ability to lead, coach, and grow a team of engineers, fostering a culture of accountability and continuous improvement.
- Calm under pressure, able to lead effectively during high-severity incidents.
- Passion for reliability, operational excellence, and building systems that just work.
- Commitment to reducing toil through thoughtful automation and process improvement.
- Blameless, growth-oriented mindset with a focus on continuous improvement.
Benefits and perks listed here may vary depending on the nature of employment with Deltek. Employees have access to healthcare benefits, a 401(k) plan and company match, paid vacation time and holidays, well-living programs, short-term and long-term disability coverage, basic life insurance and tuition reimbursement.
[#video#https://player.vimeo.com/video/1088518081?h=b4a4b95128&%3bbadge=0&%3bautopause=0&%3bplayer_id=0&%3bapp_id=58479{#550,300#}#/video#]
Why Join #TeamDeltek
Grow. Collaborate. Innovate.
We create innovative products and solutions that power our customers’ project success. Our market leadership is based on the work of our global and diverse team of innovators, creators and collaborators who have a passion for learning, growing and making a difference for Deltek Project Nation.


