30 days old

HPC Systems Administrator

Intel
Albuquerque, NM 87102
  • Job Code
    JR0226224
Job Description

HPC Frontier Lab / CRT-DC runs the Intel High Performance Computing benchmarking cluster called Endeavour.  Endeavour is our renowned HPC cluster showcasing Intel Architecture, supporting deals, software development, performance optimization and so much more.  We are System Integrators of future platforms. We also host other clusters supporting AI, Cloud, and Enterprise teams in the pursuit of Technology, Pathfinding, and Innovation

 

The HPC and AI Systems Administrator has deep technical knowledge of the design and deployment of data centers and the associated subsystems.  These can include expertise in data center layout, mechanical design systems, cooling, power delivery and other critical data center design expertise.  The deliverables of this role may take the form of design of Intel's data centers, support for Customers in designing their data centers, or in the development of new products and technologies based on data center design expertise.

 

We partner closely with the Sales and Marketing team, as well as multiple Software Enabling and Optimization organizations to deliver performant clusters at scale with unreleased and sometimes unstable hardware.  This includes Intel Xeon and Discrete graphics products, both the latest generations and the yet-to-be released versions, high performance storage systems and fastest fabric interconnects available.

 

The HPC and AI Systems Administrator will be responsible for, but not limited to the following: 

  • Providing support and maintenance of large cluster hardware and software for optimized performance, security, consistency, and high availability

  • Managing various Linux OS distributions

  • Supporting hardware such as rack-mounted servers and network switches

  • Supporting the latest Intel HPC data center technologies, including servers, fabric, storage

  • Utilizing their skills in the areas of cluster debugging, Linux scripting, cluster validation tests, server expansion, file system tests, benchmarking, and job scheduling

  • Serving as a consultant for all projects and customers of the CRT Datacenter, creating and improving methodologies used in the datacenter to enhance the performance, reliability, and manageability of the CRT clusters

  • Research emerging capabilities in external HPC and AI clusters to help set direction on where the team needs to be internally

 

The Ideal Candidate Should Exhibit the Following Behavioral Skills: 

  • Passion for working on Intel's latest technology

  • Ability to see the Big Picture and guide the team for success

  • Outstanding stakeholder and relationship management skills

  • Effective influencing, written and verbal communication abilities

  • Strong sense of urgency, acting fast, and enforcing quality


Qualifications

Minimum qualifications are required to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.

 

Minimum Education & Experience:

 

A Bachelor's degree in Computer Science/Engineering or another directly related field and 6+ years of experience in one, or more, of the following: 

  • Linux experience supporting complex clusters with 100+ nodes

  • Installing and managing Linux operating Server systems

  • Administering Linux servers for large corporate/enterprise company

  • Writing sh/bash scripts

 

Preferred Qualifications: 

  • Master's degree in Computer Science/Engineering or a related field and 4+ years of experience, as outlined above

  • Programming in C and Python

  • High speed Ethernet (Gigabit and faster)

  • High performance interconnects, preferably Mellanox InfiniBand or Omni-Path

  • Administering and managing HPC cluster file systems with discrete GPUs (Nvidia, AMD or Intel) (Lustre, GPFS, etc)

  • Container experience (Singularity, Podman, Charliecloud, Docker, Kubernetes, etc) and containerization as it pertains to HPC / AI workloads

  • Support of AI frameworks (TensorFlow, others)

  • Arista, Extreme, or Cisco network hardware setup and configuration

  • MPI libraries, preferably Intels

  • Writing and debugging HPC applications

  • Managing Cloud based cluster systems

  • Compiling, patching, or developing Linux kernel and Linux kernel drivers

  • A+, RHCSA, CCENT or RHCE certification, or equivalent

Inside this Business Group

The focus of Accelerated Computing Systems and Graphics (AXG) is to accelerate our execution in strategic growth areas of high-performance computing and graphics. AXG is chartered with delivering high performance computing and graphics solutions (IP, Software, Systems), for both integrated and discrete segments across client, enterprise and data center.  Our mission is to make zeta-scale computing accessible to every human on the planet by the end of this decade and to entertain, educate and connect billions of people with buttery smooth visual experiences.


Intel strongly encourages employees to be vaccinated against COVID-19. Intel aligns to federal, state, and local laws and as a contractor to the U.S. Government is subject to government mandates that may be issued. Intel policies for COVID-19 including guidance about testing and vaccination are subject to change over time.



Posting Statement

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.



Work Model for this Role

This role will be eligible for our hybrid work model which allows employees to split their time between working on-site at their assigned Intel site and off-site.

Posted: 2022-06-05 Expires: 2022-07-06

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

HPC Systems Administrator

Intel
Albuquerque, NM 87102

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast