- Search for JobsSearch for Jobs
- Browse for JobsBrowse for Jobs
- Create a ResumeCreate a Resume
- Company DirectoryCompany Directory
-
Job CodeJR0226224
HPC Frontier Lab / CRT-DC runs the Intel High Performance Computing benchmarking cluster called Endeavour. Endeavour is our renowned HPC cluster showcasing Intel Architecture, supporting deals, software development, performance optimization and so much more. We are System Integrators of future platforms. We also host other clusters supporting AI, Cloud, and Enterprise teams in the pursuit of Technology, Pathfinding, and Innovation
The HPC and AI Systems Administrator has deep technical knowledge of the design and deployment of data centers and the associated subsystems. These can include expertise in data center layout, mechanical design systems, cooling, power delivery and other critical data center design expertise. The deliverables of this role may take the form of design of Intel's data centers, support for Customers in designing their data centers, or in the development of new products and technologies based on data center design expertise.
We partner closely with the Sales and Marketing team, as well as multiple Software Enabling and Optimization organizations to deliver performant clusters at scale with unreleased and sometimes unstable hardware. This includes Intel Xeon and Discrete graphics products, both the latest generations and the yet-to-be released versions, high performance storage systems and fastest fabric interconnects available.
The HPC and AI Systems Administrator will be responsible for, but not limited to the following:
Providing support and maintenance of large cluster hardware and software for optimized performance, security, consistency, and high availability
Managing various Linux OS distributions
Supporting hardware such as rack-mounted servers and network switches
Supporting the latest Intel HPC data center technologies, including servers, fabric, storage
Utilizing their skills in the areas of cluster debugging, Linux scripting, cluster validation tests, server expansion, file system tests, benchmarking, and job scheduling
Serving as a consultant for all projects and customers of the CRT Datacenter, creating and improving methodologies used in the datacenter to enhance the performance, reliability, and manageability of the CRT clusters
Research emerging capabilities in external HPC and AI clusters to help set direction on where the team needs to be internally
The Ideal Candidate Should Exhibit the Following Behavioral Skills:
Passion for working on Intel's latest technology
Ability to see the Big Picture and guide the team for success
Outstanding stakeholder and relationship management skills
Effective influencing, written and verbal communication abilities
Strong sense of urgency, acting fast, and enforcing quality
Qualifications
Minimum qualifications are required to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.
Minimum Education & Experience:
A Bachelor's degree in Computer Science/Engineering or another directly related field and 6+ years of experience in one, or more, of the following:
Linux experience supporting complex clusters with 100+ nodes
Installing and managing Linux operating Server systems
Administering Linux servers for large corporate/enterprise company
Writing sh/bash scripts
Preferred Qualifications:
Master's degree in Computer Science/Engineering or a related field and 4+ years of experience, as outlined above
Programming in C and Python
High speed Ethernet (Gigabit and faster)
High performance interconnects, preferably Mellanox InfiniBand or Omni-Path
Administering and managing HPC cluster file systems with discrete GPUs (Nvidia, AMD or Intel) (Lustre, GPFS, etc)
Container experience (Singularity, Podman, Charliecloud, Docker, Kubernetes, etc) and containerization as it pertains to HPC / AI workloads
Support of AI frameworks (TensorFlow, others)
Arista, Extreme, or Cisco network hardware setup and configuration
MPI libraries, preferably Intels
Writing and debugging HPC applications
Managing Cloud based cluster systems
Compiling, patching, or developing Linux kernel and Linux kernel drivers
A+, RHCSA, CCENT or RHCE certification, or equivalent
The focus of Accelerated Computing Systems and Graphics (AXG) is to accelerate our execution in strategic growth areas of high-performance computing and graphics. AXG is chartered with delivering high performance computing and graphics solutions (IP, Software, Systems), for both integrated and discrete segments across client, enterprise and data center. Our mission is to make zeta-scale computing accessible to every human on the planet by the end of this decade and to entertain, educate and connect billions of people with buttery smooth visual experiences.
Intel
strongly encourages employees to be vaccinated against COVID-19.
Intel aligns to federal, state, and local laws and as a contractor
to the U.S. Government is subject to government mandates that may
be issued. Intel policies for COVID-19 including guidance about
testing and vaccination are subject to change over
time.
Posting
Statement
All qualified
applicants will receive consideration for employment without regard
to race, color, religion, religious creed, sex, national origin,
ancestry, age, physical or mental disability, medical condition,
genetic information, military and veteran status, marital status,
pregnancy, gender, gender expression, gender identity, sexual
orientation, or any other characteristic protected by local law,
regulation, or
ordinance.
Work Model for this Role
This role will be eligible for our hybrid work model which allows employees to split their time between working on-site at their assigned Intel site and off-site.
Before you go...
Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.