19 days old

AI Cluster Software Development and Validation Engineer

Intel
Hillsboro, OR 97123
  • Job Code
    JR0208451
  • Jobs Rated
    11th
Job Description

In this role you are joining a highly dynamic team geared to build a best-in-class, fastest scale-up and scale-out cluster to support the AI workloads of today and the future, on Intel.

You will join a team to pull together software components and systems to validate and deliver a complete cluster/paradigm that can morph dynamically to the needs of AI workloads.

This cluster will need to have all the requisite frameworks, libraries, system drivers, and software to visualize results. Beyond that, telemetry of all parts of the stack (system, network/fabric, storage) will need to be collected and consolidated in dashboards, where you'll analyze performance issues so that you can fine-tune the cluster based on AI workload needs and bottlenecks.

Because this is a brand new endeavor, you will find there are missing components in the software stacks - or elements that aren't fully functional; for these you will collaborate with partner teams and external providers (e.g. Network/Fabric, Storage vendors; Software vendors) to report features and file bug - while at the same time you'll have the opportunity to create shim layers, or adopt workarounds, so that you can keep moving quickly

From a software development perspective, you may need to develop a variety of solutions like:

  • Scripts to extract telemetry and validate the performance of cluster components and the cluster as a whole.

  • Methods to ingest and visualize the performance of the cluster to find problems.

  • Automatic analysis of telemetry data to anticipate and/or address performance problems.

  • Create shim layers to account for capabilities that don't exist yet to support AI workloads.

  • Containerization of workloads.

  • Implementation of software components (especially if open source) to address interoperability gaps.


To do this well you will need:

  • System/Cluster and software performance analysis skills.

  • Agility and creativity in finding solutions to performance and interoperability issues.

  • Grounded discipline to deliver flexible and reliable reference designs that customers will want to use.

  • Great technical curiosity.


Your integration and testing work will address many combinations of hardware components (ethernet or fabric) connectivity or other node connectivity (GPU direct RDMA); multiple storage options; scale-out (across systems) and scale-up (within systems)


Qualifications

You must possess the below minimum qualifications to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates. Experience would be obtained through a combination of prior education level classes, and current level school classes, projects, research, and relevant previous job and internship experience.

Minimum Qualifications:

  • Master's degree in Computer Science, Computer Engineering or any other related field and 3+ years of experience

-OR-

  • PhD degree in Computer Science, Computer Engineering or any other related field.


Experience in:

  • 3+ years of Linux experience supporting complex servers.

  • 3+ years of experiencing managing heterogeneous clusters with high performance fabric interconnects and high-speed storage capabilities.

  • 3+ years of Experience with workload and system performance analysis in complex clusters.

  • 3+ years of Experience in AI software workloads.


Preferred Qualifications:
Experience in:

  • Programming in at least one of the following languages (C, Python or Bash).

  • Managing cluster systems with 100+ nodes.

  • Gigabit Ethernet.

  • High performance fabric interconnects.

  • Managing AI and HPC clusters with discrete GPUs (Nvidia, AMD or Intel).

  • Containers (Singularity, Podman, Charliecloud, Docker, Kubernetes, others).

  • Administering high performance cluster file systems (Lustre, GPFS, DDN, Others).

  • Supporting AI frameworks (TensorFlow, others).

  • MPI libraries, preferably Intel MPI.

  • AI applications and using AI frameworks.

  • Containerization as it pertains to HPC / AI workloads.

  • Provisioning capabilities (Kubernetes, Openshift, etc).

  • Collecting and analyzing telemetry in all parts of the HW/SW stack in a cluster.

  • OneDNN, oneVPL, oneMKL oneCCL.

Inside this Business Group

The focus of Accelerated Computing Systems and Graphics (AXG) is to accelerate our execution in strategic growth areas of high-performance computing and graphics. AXG is chartered with delivering high performance computing and graphics solutions (IP, Software, Systems), for both integrated and discrete segments across client, enterprise and data center.  Our mission is to make zeta-scale computing accessible to every human on the planet by the end of this decade and to entertain, educate and connect billions of people with buttery smooth visual experiences.



Other Locations

US, Arizona, Phoenix;US, California, Santa Clara;US, New Mexico, Albuquerque;US, Texas, Austin


Intel strongly encourages employees to be vaccinated against COVID-19. Intel aligns to federal, state, and local laws and as a contractor to the U.S. Government is subject to government mandates that may be issued. Intel policies for COVID-19 including guidance about testing and vaccination are subject to change over time.



Posting Statement

All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genetic information, military and veteran status, marital status, pregnancy, gender, gender expression, gender identity, sexual orientation, or any other characteristic protected by local law, regulation, or ordinance.

Jobs Rated Reports for Software Developer

Posted: 2022-04-30 Expires: 2022-05-31

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

AI Cluster Software Development and Validation Engineer

Intel
Hillsboro, OR 97123

Join us to start saving your Favorite Jobs!

Sign In Create Account
Software Developer
11th2019 - Software Developer
Overall Rating: 11/199
Median Salary: $103,560

Work Environment
Good
68/220
Stress
Very Low
26/220
Growth
Very Good
18/220
Powered ByCareerCast