☁️ Cloud Training & HPO Pipeline
This page documents the cloud-based training and hyperparameter optimization (HPO) workflow for Aegear, leveraging RunPod and ClearML for scalable, automated model development.
🚀 Overview
Aegear supports launching training jobs and HPO experiments in the cloud using: - RunPod: GPU cloud provider for scalable training. - ClearML: Experiment tracking, orchestration, and HPO management.
This setup enables: - Automated training job launches with custom configuration. - Hyperparameter sweeps and grid search for model optimization. - Experiment tracking and result analysis via ClearML.
🛠️ Key Scripts & Components
tools/launch_runpod_training.py: Launches a single training job on RunPod with custom arguments and environment variables. ClearML integration is optional—if you skip the--task-nameargument, ClearML tracking will be disabled and the job will run standalone.tools/clearml_runpod_hpo.py: Orchestrates HPO runs, launching multiple training jobs with different hyperparameters, tracking results in ClearML. HPO pods are automatically shut down after completion to minimize cloud costs.docker/run_training.sh: Entrypoint script used inside the container for reproducible training runs.tools/train.py: Main training script, used for both local and cloud jobs.
📦 How It Works
- Prepare Docker Image
- Use the published image (
ljubobratovicrelja/aegear:latest) or build your own. - Configure Training
- Set environment variables or CLI arguments for your training job (see docker.md).
- Launch Training on RunPod
- Use
launch_runpod_training.pyto start a pod with your configuration. - Example:
python tools/launch_runpod_training.py --task-name my_exp --model-type efficient_unet --data-manifest /workspace/data/manifest.json --model-dir /workspace/models/unet --checkpoint-dir /workspace/models/unet/checkpoints --epochs 10 --batch-size 128 - Run HPO Experiments
- Use
clearml_runpod_hpo.pyto launch a grid search or sweep over hyperparameters. HPO pods are automatically shut down after each run. - Example:
python tools/clearml_runpod_hpo.py --config config/hpo_config.yaml - See an example HPO YAML config file here.
- Use
- Track Results in ClearML
- All jobs and experiments are tracked in your ClearML dashboard for analysis and comparison.
🔑 Environment & Credentials
- RunPod API Token: Required for launching pods.
- ClearML Credentials: Needed for experiment tracking and HPO.
- Docker Hub Credentials (optional): For authenticated image pulls.
Set these as environment variables before launching jobs.
📥 Data Access
Training data shards are available from:
gs://aegear-training-data/shards
📖 References & Further Reading
For questions or issues, contact the maintainer or open an issue.