3.3. Experiment Managment#
Making progress on data science projects requires a large number of experiments — attempts at tuning parameters, trying different data, improving code, collecting better metrics, etc. Keeping track of all these changes is essential, as we may want to inspect them when comparing outcomes. Recovering these conditions later will be necessary to reproduce results or resume a line of work.
There is tooling out there to make managing your experiments easier, but they sometimes require a learning curve, are too inflexible or require a server. Try them out and see what works best for your you.
3.3.1. DVC#
https://dvc.org/doc/user-guide/experiment-management
DVC allows you to track and version not only your code with git, but also large datasets and models, an extension of git LFS.
PRO:
can be run on any experiment
language-agnostic
easy to get started
CON:
storing large datasets externally can require SFTP, S3 or cloud infrastructure
3.3.2. Nextflow#
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Converting your code to a Nextflow pipeline, not only parallellizes the code, but also gives you some experiment managment.
PRO:
one easy config object
language-agnostic
easy to get started
logs input parameters and execution
creates separate output folders
caching with -resume
has executors for many platforms and conda/container support
lots of existing pipelines (e.g. nf-core)
CON:
config is Groovy/Java based
composition of different subworkflows can become messy
can be difficult to debug due to lack of typing
some performance issues when scaling up because of serialization between every step
3.3.3. Hydra#
Hydra is a Python framework on top of OmegaConf, enabling you to configure complex application, like a data processing pipeline with lots of steps and parameters. For more information, see Hydra.
PRO:
one easy config object
logs input parameters and execution
creates separate output folders
config object can be compose from the subconfigs of the separate pipeline steps
configs can be create inherited or created from function type hints
submitit plugin enables HPC job submissions with SLURM
config can be checked type checked before job submission
CON:
Python-based
there is a learning curve
no complete solution for workflow management or caching of results
the structured configs add extra boilerplate code. hydra-zen could automatically and dynamically generate structured configs, but this stack becomes even more complex and difficult to debug.
3.3.4. Hydra + DVC#
https://dvc.org/doc/user-guide/experiment-management/hydra-composition#hydra-composition
You can combine the ecosystem of DVC with the config composition of Hydra, while still staying language-agnostic.
PRO:
all PROs from DVC and most from Hydra
language-agnostic
CON:
there is a learning curve