APP下载
机会在手,求职信息实时掌握
    Alternate Text
    APP下载
    Alternate Text
    微信公众号
    Alternate Text
    小程序
当前位置:首页> 列表 >职位详情
Infrastructure Engineer
100000-150000元 香港 8年以上 硕士
  • 补充医疗保险
  • 创业公司
  • 强积金
Video Rebirth Limited 2025-02-06 21:08:33 76人关注
职位描述
该职位还未进行加V认证,请仔细了解后再进行投递!
Position Overview We are seeking an experienced Infrastructure Engineer to architect and manage our AI computing infrastructure. The ideal candidate will have extensive experience in building and scaling ML infrastructure, with particular emphasis on distributed training systems and GPU cluster management. Key Responsibilities Design and implement high-performance computing infrastructure for large-scale AI model training Manage and optimize GPU clusters for distributed training workloads Build and maintain container orchestration systems for ML workflows Implement efficient resource allocation and scheduling systems Design and maintain monitoring and alerting systems for compute infrastructure Optimize infrastructure costs while maintaining performance Collaborate with ML teams to support their computing needs Ensure system reliability, security, and scalability Required Qualifications Master's degree in Computer Science, Systems Engineering, or related field 8+ years of experience in infrastructure engineering, with focus on ML/AI infrastructure Strong experience with: GPU cluster management and optimization Kubernetes and container orchestration Linux system administration Infrastructure as Code (IaC) Proven track record in building large-scale computing systems Experience with major cloud providers (AWS/GCP/Azure or Alibaba Cloud/Tencent Cloud etc) Preferred Qualifications Experience with ML infrastructure at major tech companies Knowledge of distributed training systems (PyTorch DDP, Horovod) Familiarity with ML frameworks and their infrastructure requirements Experience with high-performance networking (InfiniBand, RDMA) Background in performance optimization and troubleshooting Understanding of ML workload characteristics Bilingual proficiency (English/Chinese) Technical Skills Computing Infrastructure GPU Clusters: NVIDIA DGX, GPU management tools Distributed Systems: Slurm, Kubernetes ML Platforms: Kubeflow, Ray Job Scheduling: YARN, Slurm Cloud & Networking Cloud Platforms: International: AWS, GCP, Azure China: Alibaba Cloud, Tencent Cloud Networking: InfiniBand, RDMA, TCP/IP optimization Load Balancing: HAProxy, NGINX Infrastructure Management Container Technologies: Docker, Kubernetes, Singularity IaC: Terraform, Ansible, CloudFormation CI/CD: Jenkins, GitLab CI Monitoring: Prometheus, Grafana, ELK Stack Development Languages: Python, Go, Shell scripting Version Control: Git Documentation: Markdown, Confluence What We Offer Opportunity to build cutting-edge AI infrastructure Competitive salary and equity package Access to latest hardware and technologies Professional development opportunities Comprehensive health benefits Learning and conference budget Location Hong Kong (on-site, Hong Kong Science and Technology Park) Expected Impact Design and implement next-generation AI computing infrastructure Optimize resource utilization and cost efficiency Improve training speed and efficiency for AI models Build scalable and reliable systems Projects You'll Work On Building automated GPU cluster management systems Implementing efficient resource scheduling for ML workloads Optimizing distributed training infrastructure Setting up monitoring and observability systems Designing disaster recovery and backup solutions
联系方式
注:联系我时,请说是在赣州人才网上看到的。
工作地点
地址:香港香港香港沙田区香港科学园10W栋317-318
求职提示:用人单位发布虚假招聘信息,或以任何名义向求职者收取财物(如体检费、置装费、押金、服装费、培训费、身份证、毕业证等),均涉嫌违法,请求职者务必提高警惕。
top
投递简历
马上投递
更多优质岗位等你来挑选   加入赣州人才网,发现更好的自己
投递简历
马上投递
提示
该职位仅支持官方网站投递
关闭 去投递
会员中心 提示:订单支付,立即生效
天数: 0
共计: 0
支付方式:
微信支付
支付宝支付
确认 取消