Menu

Director of Business Operations






Service Delivery Manager

Position Description

Ensure high-quality, reliable operations of the state-of-the-art HPC environment behind the AI Computing Resource (AICR) that supports the Massachusetts AI Hub. This critical role will coordinate delivery of compute and data services to a diverse user base, including working in close partnership with facilitation and science support staff from AI Hub organizations, liaise with academic and industry stakeholders, oversee performance monitoring and continuous improvement, and contribute to strategic planning and vendor management. This is an exciting opportunity to drive service excellence in support of world-class AI research and innovation. Pay Range: $144,505 - $195,900

Overview

We are looking for an experienced and dynamic Service Delivery Manager to oversee the operational management and continuous improvement of the AICR. The Service Delivery Manager will be responsible for ensuring the smooth, efficient, and reliable delivery of resources and services to internal and external stakeholders. This role combines technical expertise with leadership skills to ensure that technical solutions align with organizational needs and provide an exceptional user experience.  As a Service Delivery Manager for AICR, you will be the primary point of contact for ensuring that service delivery meets operational targets. You will coordinate with AICR technical staff and collaborate with users to ensure services are delivered effectively and that issues are addressed promptly. 

Principal Responsibilities

 
  • Service Management: Oversee the end-to-end service delivery of the HPC cluster, ensuring the systems are running efficiently and meeting performance, availability, and security requirements. 
  • Client and Stakeholder Liaison: Act as the primary point of contact to AICR stakeholders, for issues related to service delivery ensuring their needs and expectations are understood and met. Regularly communicate service performance, enhancements, and upcoming updates. 
  • Problem Management: Lead the identification and resolution of service disruptions, performance issues, or other incidents affecting AICR. Coordinate with relevant parties to restore services and prevent reoccurrence. 
  • Continuous Improvement: Analyze service delivery performance and identify areas for improvement. Implement process improvements to enhance the efficiency, reliability, and scalability of the HPC cluster. 
  • Capacity Planning and Resource Allocation: Work with stakeholders to understand future service demands and capacity requirements. Plan for scaling the HPC environment to meet growth needs, ensuring optimal resource allocation. 
  • Budget and Cost Management: Work with the Executive Director to budget for HPC services, ensuring cost-effective delivery without compromising service quality. Work with finance and other stakeholders to forecast and track expenses. 
  • Vendor Management: Collaborate with external vendors and service providers to ensure the HPC cluster is supported by the necessary hardware, software, and maintenance services. Manage vendor relationships and ensure the service levels are maintained. 
  • Risk Management: Identify potential risks in the service delivery pipeline and develop mitigation strategies. Ensure that any risks related to performance, security, or downtime are appropriately addressed. 
  • Perform other duties as required. 

Supervision Received 

  • This position reports to the Executive Director, AI Computing Resource (AICR) 

Supervision Exercised 

  • None 

Employment Type 

  • Full-Time, Hybrid (primarily remote with occasional on-site) 

Qualifications & Skills 

Required
  • Education:

Bachelor’s degree in Computer Science, Engineering, IT, or a related field (or equivalent experience). 

  • Experience: 

Minimum 7 years relevant experience required. 

Proven experience managing the service delivery of complex IT infrastructure or computing environments, preferably in HPC clusters. 

Demonstrated leadership in a service management or delivery management role, managing both people and projects. 

Strong understanding of HPC technologies, including cluster management, job scheduling, parallel computing, and performance tuning. 

  • Skills: 

Strong understanding of service management best practices 

Excellent communication and interpersonal skills, with the ability to engage with technical and non-technical stakeholders. 

Problem-solving skills 

Ability to drive continuous improvement 

Preferred
  • Experience delivering AI-specific cluster services 
  • Experience with cloud-based HPC solutions or hybrid environments (e.g., AWS, Azure, Google Cloud). 
  • Familiarity with monitoring tools and metrics, such as Nagios, Prometheus, or Zabbix. 
  • Experience with service management frameworks (e.g. ITIL) 
  • Experience in GPU-based computing, storage management, or network configuration for HPC clusters. 
  • Ability to lead teams effectively
Click here to apply (Job Number 25341)

Research projects

The US ATLAS Northeast Tier 2 Center
Yale Budget Lab
Volcanic Eruptions Impact on Stratospheric Chemistry & Ozone
Towards a Whole Brain Cellular Atlas
Tornado Path Detection
The Kempner Institute - Unlocking Intelligence
The Institute for Experiential AI
Taming the Energy Appetite of AI Models
All Research Projects

Collaborative projects

ALL Collaborative PROJECTS

OUTREACH & EDUCATION PROJECTS

See ALL Scholarships
100 Bigelow Street, Holyoke, MA 01040