Director of Business Operations

Service Delivery Manager

Position Description

Ensure high-quality, reliable operations of the state-of-the-art HPC environment behind the AI Computing Resource (AICR) that supports the Massachusetts AI Hub. This critical role will coordinate delivery of compute and data services to a diverse user base, including working in close partnership with facilitation and science support staff from AI Hub organizations, liaise with academic and industry stakeholders, oversee performance monitoring and continuous improvement, and contribute to strategic planning and vendor management. This is an exciting opportunity to drive service excellence in support of world-class AI research and innovation. Pay Range: $144,505 - $195,900

Overview

We are looking for an experienced and dynamic Service Delivery Manager to oversee the operational management and continuous improvement of the AICR. The Service Delivery Manager will be responsible for ensuring the smooth, efficient, and reliable delivery of resources and services to internal and external stakeholders. This role combines technical expertise with leadership skills to ensure that technical solutions align with organizational needs and provide an exceptional user experience. As a Service Delivery Manager for AICR, you will be the primary point of contact for ensuring that service delivery meets operational targets. You will coordinate with AICR technical staff and collaborate with users to ensure services are delivered effectively and that issues are addressed promptly.

Principal Responsibilities

Service Management: Oversee the end-to-end service delivery of the HPC cluster, ensuring the systems are running efficiently and meeting performance, availability, and security requirements.

Client and Stakeholder Liaison: Act as the primary point of contact to AICR stakeholders, for issues related to service delivery ensuring their needs and expectations are understood and met. Regularly communicate service performance, enhancements, and upcoming updates.

Problem Management: Lead the identification and resolution of service disruptions, performance issues, or other incidents affecting AICR. Coordinate with relevant parties to restore services and prevent reoccurrence.

Continuous Improvement: Analyze service delivery performance and identify areas for improvement. Implement process improvements to enhance the efficiency, reliability, and scalability of the HPC cluster.

Capacity Planning and Resource Allocation: Work with stakeholders to understand future service demands and capacity requirements. Plan for scaling the HPC environment to meet growth needs, ensuring optimal resource allocation.

Budget and Cost Management: Work with the Executive Director to budget for HPC services, ensuring cost-effective delivery without compromising service quality. Work with finance and other stakeholders to forecast and track expenses.

Vendor Management: Collaborate with external vendors and service providers to ensure the HPC cluster is supported by the necessary hardware, software, and maintenance services. Manage vendor relationships and ensure the service levels are maintained.

Risk Management: Identify potential risks in the service delivery pipeline and develop mitigation strategies. Ensure that any risks related to performance, security, or downtime are appropriately addressed.

Perform other duties as required.

Supervision Received

This position reports to the Executive Director, AI Computing Resource (AICR)

Supervision Exercised

None

Employment Type

Full-Time, Hybrid (primarily remote with occasional on-site)

Qualifications & Skills

Required

Education:

Bachelor’s degree in Computer Science, Engineering, IT, or a related field (or equivalent experience).

Experience:

Minimum 7 years relevant experience required.

Proven experience managing the service delivery of complex IT infrastructure or computing environments, preferably in HPC clusters.

Demonstrated leadership in a service management or delivery management role, managing both people and projects.

Strong understanding of HPC technologies, including cluster management, job scheduling, parallel computing, and performance tuning.

Skills:

Strong understanding of service management best practices

Excellent communication and interpersonal skills, with the ability to engage with technical and non-technical stakeholders.

Problem-solving skills

Ability to drive continuous improvement

Preferred

Experience delivering AI-specific cluster services

Experience with cloud-based HPC solutions or hybrid environments (e.g., AWS, Azure, Google Cloud).

Familiarity with monitoring tools and metrics, such as Nagios, Prometheus, or Zabbix.

Experience with service management frameworks (e.g. ITIL)

Experience in GPU-based computing, storage management, or network configuration for HPC clusters.

Ability to lead teams effectively

Click here to apply (Job Number 25341)