USCâs Information Sciences Institute (ISI), a unit of the universityâs Viterbi School of Engineering, is a world leader in the research and development of advanced artificial intelligence, information processing, computing, and communications technologies. ISIâs 400 faculty, professional staff and graduate students carry out extraordinary information sciences research at three distinct locations - Marina Del Rey, CA, Arlington, VA, and Waltham, MA.
*This position is located in Waltham, MA* Â
ISI is seeking a Lead HPC Developer interested in helping us develop a shared compute cluster to support our language understanding research. A successful candidate will:
Collaborate with technical leadership in the design, development, installation, and maintenance of software for Linux and HPC cluster systems and ensure its scalability and fault-tolerance needs are met.
Take primary responsibility for planning, implementation, availability, performance, security, maintenance, and repair of cluster infrastructure.
Support best practices across the environment
The candidate will coordinate closely with both the research groups using the cluster and ISIâs information technology department. They will foster collaboration between researchers in an inclusive environment that values differences by building and maintaining collaborative relationships with team members, peers, and organizational leaders.
Drives the day-to-day operations for the Linux and HPC cluster systems by monitoring computing resource performance, managing configurations, and addressing security administration.
Applies revisions to system firmware and software; engages and collaborates with vendors to assist support activities as required.
Leads the development of new HPC software deployment plans, custom scripts, and testing procedures to ensure operational reliability for researchers; trains technical staff in the use of new software and hardware, either developed or acquired.
Oversees the maintenance and management of HPC researcher accounts for staff and research groups; leads the installation, modification, and maintenance of various research software applications for access on HPC clusters; acts as a trusted technical advisor for researcher support and documentation on software applications and programs.
Designs, installs, configures, and performs document management for cluster infrastructure, including operating systems, job schedulers, resource managers, provisioning managers, configuration managers, SAN devices, network devices, and other components.
Investigates, debugs, and addresses researcher inquiries and requests efficiently through a customer issue ticketing system. Implements customer-focused resolutions efficiently; communicates complex technical concepts in a simple, straightforward manner to address a broad range of stakeholders.
Education: Bachelorâs degree in a relevant field such as computer science, computer information systems, etc. OR equivalent combined education, training, and experience.
Minimum Experience: 5 years of professional experience with at least 3 years of experience in high-performance computing cluster support & linux system administration.
Bachelorâs degree in a relevant field such as computer science, computer information systems, etc. OR equivalent combined education, training, and experience.
Multi-vendor management, security, and network/Internet protocols.
Administrating, monitoring, and maintaining secure Linux/UNIX operating systems (CentOS/RHEL, Ubuntu).
Experience with HPC system software cluster management tool and job schedulers(e.g. SGE, slurm).
Experience with the planning and design of the hardware that supports an HPC cluster to include both CPU and GPU processing
Proficiency with interconnected infrastructure such as 10GigE.
Knowledge of HPC storage (FC, SAS) principles, file systems (ZFS, etc.), and compute node storage (NFS).
Proficiency in fundamental skills (Bash, Python, or similar languages).
Configuration management tools (Experience in non-production environments is acceptable. Examples include Salt, Ansible, Puppet, etc).
Ability to identify, troubleshoot, and resolve problems and manage system performance.
Ability to drive technical leadership and management of complex large-scale computing system projects.
Experience establishing processes for maintaining system performance and managing best-in-class standards.
Working knowledge of machine learning algorithms and software frameworks (TensorFlow, PyTorch, Keras, CUDA, cuDNN, Caffe, Theano, etc.)
Virtualization infrastructures (VMware).
Container technologies (Docker, Singularity).
Cloud computing (AWS, Azure).
The University of Southern California values diversity and is committed to equal opportunity in employment.
Minimum Education: Bachelor's degree, Combined work experience and education as equivalentMinimum Experience: 5 yearsMinimum Field of Expertise: Relevant work experience providing strong technical knowledge of programming and analysis, and senior or lead experience.
Internal Number: REQ20088674
USC is the leading private research university in Los Angeles—a global center for arts, technology and international business. With more than 47,500 students, we are located primarily in Los Angeles but also in various US and global satellite locations. As the largest private employer in Los Angeles, responsible for $8 billion annually in economic activity in the region, we offer the opportunity to work in a dynamic and diverse environment, in careers that span a broad spectrum of talents and skills across a variety of academic and professional schools and administrative units. As a USC employee and member of the Trojan Family—the faculty, staff, students, and alumni who make USC a great place to work—you will enjoy excellent benefits, including a variety of well-being programs designed to help individuals achieve work-life balance.