MLOps Engineer

Location Zurich
Discipline: Machine Learning
Contact name: Tom Goldberg

Contact email: tom@enigma-rec.ai
Job ref: 34783
Published: about 1 month ago

The company is focused on developing advanced AI technology, using an Earth systems model to predict atmospheric events with higher accuracy and speed than traditional methods. This solution is particularly useful for industries affected by weather, such as power markets, especially during extreme weather conditions like hurricanes and cyclones. As the technology evolves, additional applications like wildfire, flood, and vegetation predictions will be introduced. The team values ambition, open communication, and rapid innovation, offering employees exciting challenges, creative autonomy, competitive pay, and stock options.

Role Overview

The company is seeking an MLOps Engineer to help scale its infrastructure for large-scale model training, evaluation, and inference. This role will involve optimizing a machine learning cluster with over 1,000 GPUs and petabyte-scale training data. The ideal candidate is driven by solving complex engineering problems in an early-stage startup environment, with the opportunity to scale infrastructure significantly.

Responsibilities

  • Design, deploy, and maintain key components of a machine learning cluster.

  • Optimize performance from architectural design to CUDA kernel fine-tuning.

  • Prototype and evaluate new technologies and architectures.

  • Continuously monitor system performance and resource usage.

  • Implement best practices for cluster efficiency and automation.

  • Troubleshoot and resolve technical issues across various layers of the system.

Required Skills

  • Extensive experience managing high-performance computing clusters.

  • Expertise in distributed file systems or object stores (e.g., MinIO, Lustre, Ceph).

  • Deep understanding of distributed model training, including strategies for large parameter models.

  • Strong troubleshooting abilities in application, OS, networking, and hardware layers.

  • Proactive approach to identifying and resolving performance bottlenecks.

  • Proven experience working within multidisciplinary teams.

Nice-to-Have

  • Experience with geospatial datasets.

  • Project leadership or management experience.