7 Disk-based pipelines

Disk-based pipelines are a way to run a sequence of scripts in different languages by communicating only via files on disk. This is useful when you have to use different languages for different parts of the analysis and don’t want the hassle of managing an in-memory interoperability setup. Either you keep the notebook format of the exploratory analysis or you convert the code snippets to scripts.

7.1 Notebook pipelines

You can use Quarto to run code snippets in different languages in the same .qmd notebook. Our Use-case chapter is one example of this.. The polyglot powers of Quarto for executing code blocks come from tools such as knitr or Jupyter kernels. You don’t need in-memory interoperability, but it

There are frameworks that keep the notebooks and create a pipeline with it. The upside is that you can avoid converting the code snippets in the notebooks to scripts. The downside is that you have to use a specific framework and the notebooks can become very large and unwieldy. For example, Papermill can execute parameterized Jupyter notebooks in sequence and pass variables between them. nbdev and Ploomber are similar such frameworks.

7.1.1 Execute notebooks via the CLI

You can execute notebooks via the command line as a job, so you don’t have to wait for them to finish. This is useful when running experiments, as the output is saved in the notebook together with plots, available to inspect them at a later time.

Jupyter notebook via nbconvert:

jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/

Similar functionality exists for RMarkdown notebooks via Rscript:

Rscript -e "rmarkdown::render('my_notebook.Rmd',params=list(args = myarg))"

7.2 Script pipelines

The scripts in a script pipeline are a collection of the code snippets from the notebooks and can be written in different languages and executed in sequence.

These scripts have the same functionality as the code in the use-case example notebook:

if [[ ! -f usecase/data/sc_counts_reannotated_with_counts.h5ad ]]; then
  aws s3 cp \
    --no-sign-request \
    s3://openproblems-bio/public/neurips-2023-competition/sc_counts_reannotated_with_counts.h5ad \
    usecase/data/sc_counts_reannotated_with_counts.h5ad
fi

# The dataset is stored in an AnnData object, which can be loaded in Python as follows:
import anndata as ad

print("Load data")
adata = ad.read_h5ad("usecase/data/sc_counts_reannotated_with_counts.h5ad")

sm_name = "Belinostat"
control_name = "Dimethyl Sulfoxide"
cell_type = "T cells"

adata = adata[
  adata.obs["sm_name"].isin([sm_name, control_name]) &
  adata.obs["cell_type"].isin([cell_type]),
].copy()

# We will also subset the genes to the top 2000 most variable genes.

adata = adata[:, adata.var["highly_variable"]].copy()

print("Compute pseudobulk")
# Combine data in a single data frame and compute pseudobulk
import pandas as pd

combined = pd.DataFrame(
  adata.X.toarray(),
  index=adata.obs["plate_well_celltype_reannotated"],
)
combined.columns = adata.var_names
pb_X = combined.groupby(level=0).sum()

print("Construct obs for pseudobulk")

# Use 'plate_well_celltype_reannotated' as index and make sure to retain the columns 'sm_name', 'cell_type', and 'plate_name':

pb_obs = adata.obs[["sm_name", "cell_type", "plate_name", "well"]].copy()
pb_obs.index = adata.obs["plate_well_celltype_reannotated"]
pb_obs = pb_obs.drop_duplicates()

print("Create AnnData object")
pb_adata = ad.AnnData(
  X=pb_X.loc[pb_obs.index].values,
  obs=pb_obs,
  var=adata.var,
)

print("Store to disk")
pb_adata.write_h5ad("usecase/data/pseudobulk.h5ad")

cat("Loading libraries...\n")
library(anndata)
library(dplyr, warn.conflicts = FALSE)

cat("Reading data...\n")
pb_adata <- read_h5ad("usecase/data/pseudobulk.h5ad")

# Select small molecule and control:
sm_name <- "Belinostat"
control_name <- "Dimethyl Sulfoxide"

cat("Create DESeq dataset\n")
# transform counts matrix
count_data <- t(pb_adata$X)
storage.mode(count_data) <- "integer"

# create dataset
dds <- DESeq2::DESeqDataSetFromMatrix(
  countData = count_data,
  colData = pb_adata$obs,
  design = ~ sm_name + plate_name,
)

cat("Run DESeq2\n")
dds <- DESeq2::DESeq(dds)

# Get results:
res <- DESeq2::results(dds, contrast=c("sm_name", sm_name, control_name)) |>
  as.data.frame()

# Preview results:
res |>
  arrange(padj) |>
  head(10)

# Write to disk:
write.csv(res, "usecase/data/de_contrasts.csv")

You can use a shell script or your language of preference to run the pipeline in a sequential manner. This usually requires all the dependencies to be installed in one large environment.

From Bash:

#!/bin/bash

bash scripts/1_load_data.sh
python scripts/2_compute_pseudobulk.py
Rscript scripts/3_analysis_de.R

From R:

system("bash scripts/1_load_data.sh")
system("python scripts/2_compute_pseudobulk.py")
# Alternatively you can run R code here instead of calling an R script
system("Rscript scripts/3_analysis_de.R")

From Python:

import subprocess

subprocess.run("bash scripts/1_load_data.sh", shell=True)
# Alternatively you can run Python code here instead of calling a Python script
subprocess.run("python scripts/2_compute_pseudobulk.py", shell=True)
subprocess.run("Rscript scripts/3_analysis_de.R", shell=True)

7.3 Pipelines with different environments

Sometimes you might want to run scripts in different environments, as it’s too much hassle to install all dependencies in one environment. Other reasons can be that you want to reuse existing container images or you want keep parts of the pipeline separate and maintainable.

You can interleave your Bash script with environment activation functions e.g. conda activate {script_env} commands. This requires e.g. a conda .yaml file for each script environment in order to be reproducible. This can grow unwieldy as Conda environment sometimes need to be adapted for the platform and availability of accelerators like GPU’s. Another important consideration is that packages that impact the on-disk data format should be the same version across environments.

A simple and elegant, but very new solution is to use the Pixi package managment tool to manage the environments and tasks for you. The environments can be composed from multiple features containing dependencies, so you can have a scverse environment with only Python, a rverse environment with only R and even an all environment with both by adding the respective features (if such an environment is resolvable at least).

Run scripts in different environments with pixi:

pixi run -e bash scripts/1_load_data.sh
pixi run -e scverse scripts/2_compute_pseudobulk.py
pixi run -e rverse scripts/3_analysis_de.R

With the Pixi task runner, you can define these tasks in their respective environments and make them called in sequence by an overarching task. The specific lines of the pixi.toml file that allow for this are:

...
[feature.bash.tasks]
load_data = "bash book/disk_based/scripts/1_load_data.sh"
...
[feature.scverse.tasks]
compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py"
...
[feature.rverse.tasks]
analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R"
...
[tasks]
pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] }

Now you can run this multi-environment pipeline with a single command and it will always run the tasks in their respective and up-to-date environment:

pixi run pipeline

Output

Pixi task (load_data in bash): bash book/disk_based/scripts/1_load_data.sh

Pixi task (compute_pseudobulk in scverse): python book/disk_based/scripts/2_compute_pseudobulk.py
Load data
Compute pseudobulk
/app/book/disk_based/scripts/2_compute_pseudobulk.py:29: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  pb_X = combined.groupby(level=0).sum()
Construct obs for pseudobulk
Create AnnData object
Store to disk

Pixi task (analysis_de in rverse): Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R
Loading libraries...
Reading data...
Create DESeq dataset
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
Run DESeq2
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not a warning or an error]
          baseMean log2FoldChange      lfcSE      stat        pvalue
BEX5      59.24944       2.187350 0.05660399  38.64304  0.000000e+00
HIST1H1D 301.38741       1.356543 0.03092962  43.85901  0.000000e+00
STMN1    234.72112       2.224633 0.04104002  54.20642  0.000000e+00
PCSK1N    64.91604       1.899149 0.05480612  34.65214 4.147855e-263
GZMM     141.39238      -1.309959 0.03806665 -34.41224 1.654371e-259
MARCKSL1  95.82726       1.423057 0.04311798  33.00380 7.163953e-239
H1FX     376.28247       1.054890 0.03221858  32.74168 3.988563e-235
HIST1H1B  30.81805       4.317984 0.14074738  30.67896 1.086254e-206
FXYD7     61.11526       2.331406 0.07725771  30.17700 4.746707e-200
ING2      79.68893       1.218777 0.04336609  28.10437 8.663682e-174
                  padj
BEX5      0.000000e+00
HIST1H1D  0.000000e+00
STMN1     0.000000e+00
PCSK1N   1.631144e-260
GZMM     5.204651e-257
MARCKSL1 1.878150e-236
H1FX     8.962871e-233
HIST1H1B 2.135848e-204
FXYD7    8.296189e-198
ING2     1.362797e-171

You can also still run the tasks individually when debugging a step and change behavior using environment variables.

7.4 Containerized pipelines

Containers are a great way to manage the environments for your pipeline and make them reproducible on different platforms, given that you make accessible and store the container images for a long time.

You can create a Docker image with all the pixi environments and run the pipeline in multiple environments with a single container. The image is ~5GB and the pipeline can require a lot of working memory ~20GB, so make sure to increase the RAM allocated to Docker in your settings. Note that the usecase/ and book/ folders are mounted to the Docker container, so you can interactively edit the scripts and access the data.

docker pull berombau/polygloty-docker:latest
docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline

Another approach is to use multi-package containers. Tools like Multi-Package BioContainers and Seqera Containers can make this quick and easy, by allowing for custom combinations of packages.

You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a workflow framework like Viash, Nextflow or Snakemake to manage the pipeline for you.