Job Description
Job Title: HPC Systems Engineer
Contact type : W2 - Temp - 6 Months - Without Sponsorship or C2C
Location : Houston, TX (On site)
Scope:
This role supports a hybrid on-premises and cloud-based High Performance Computing (HPC) environment used for proprietary software and supporting multiple production and development groups.
Specific Work Requirements:
- Minimum of 5 years of experience in a large-scale HPC enterprise environment, including extensive server infrastructure, high-capacity storage systems, and automated tape libraries.
- Proficient in installing, configuring, and managing Linux-based operating systems, with preference for RHEL, CentOS, and Rocky Linux.
- Experience with distributed computing management software such as xCAT.
- Skilled in installation and maintenance of servers, tape drives, robotic libraries, GPUs, SSDs, and disk arrays.
- Familiarity with containerization technologies.
- Strong knowledge of networking and datacenter infrastructure, including switching, routing, high-availability systems, and topologies (LAN/WAN/WLAN), as well as configuration for Ethernet, InfiniBand, and Fibre Channel SAN.
- Hands-on experience with HPC storage solutions such as HPE ClusterStor, NetApp, Dell Isilon, and Pure Storage.
- Proficient in scripting languages such as Bash, C Shell, Perl, Python, Ruby, and MRTG.
- Experience with PostgreSQL installation and support.
- Familiar with public cloud environments, including provisioning, image building, and automation scripting in platforms such as Google Cloud and Azure.
- Skilled in using configuration management tools like Ansible and Terraform.
- Experience with backup and recovery solutions, including tools like IBM Spectrum and Dell Networker.
- Solid understanding of Linux security, including endpoint protection configurations.
- Ability to assess system environments and provide recommendations for performance and operational improvements.
- Capable of troubleshooting and resolving system-level issues.
General Work Requirements:
- Follow local change management processes, ensuring changes are tested in non-production environments before deployment.
- Clearly communicate planned changes, service impacts, and dependencies to relevant stakeholders.
- Ensure deployment standards and compliance are maintained and monitored through appropriate tools.
- Meet support documentation requirements, including detailed resolution steps and knowledge base contributions.
- Maintain a thorough understanding of computing hardware and its upkeep.
- Coordinate with cross-functional internal support teams to perform systems administration tasks.
- Interface with external vendors for technical support and issue escalation.
- Provide weekly status updates and actively participate in team meetings.
- Be available for after-hours maintenance, on-call support rotation, and participation in scheduled/emergency data center activities.
- Conduct peer reviews for major deployments to maintain deployment quality standards.
- Ensure full compliance with quality assurance, best practices, and health, safety, and environmental (QHSE) procedures relevant to the position.
Personal Traits:
- Self-motivated and capable of working independently with minimal supervision.
- Strong team player able to contribute in both leadership and support capacities.
- Excellent communication skills in written, verbal, and interpersonal contexts, suitable for collaboration with peers, stakeholders, and vendors.
- Consistent adherence to standard systems administration methodologies.
- Able to document user and operational requirements in a clear and structured manner.
Job Tags
Temporary work, Local area,