7 Disk-based pipelines
Disk-based pipelines are a way to run a sequence of scripts in different languages by communicating only via files on disk. This is useful when you have to use different languages for different parts of the analysis and don’t want the hassle of managing an in-memory interoperability setup. Either you keep the notebook format of the exploratory analysis or you convert the code snippets to scripts.
7.1 Notebook pipelines
You can use Quarto to run code snippets in different languages in the same .qmd
notebook. Our Use-case chapter is one example of this.. The polyglot powers of Quarto for executing code blocks come from tools such as knitr or Jupyter kernels. You don’t need in-memory interoperability, but it
There are frameworks that keep the notebooks and create a pipeline with it. The upside is that you can avoid converting the code snippets in the notebooks to scripts. The downside is that you have to use a specific framework and the notebooks can become very large and unwieldy. For example, Papermill can execute parameterized Jupyter notebooks in sequence and pass variables between them. nbdev and Ploomber are similar such frameworks.
7.1.1 Execute notebooks via the CLI
You can execute notebooks via the command line as a job, so you don’t have to wait for them to finish. This is useful when running experiments, as the output is saved in the notebook together with plots, available to inspect them at a later time.
Jupyter notebook via nbconvert:
jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/
Similar functionality exists for RMarkdown notebooks via Rscript:
Rscript -e "rmarkdown::render('my_notebook.Rmd',params=list(args = myarg))"
7.2 Script pipelines
The scripts in a script pipeline are a collection of the code snippets from the notebooks and can be written in different languages and executed in sequence.
These scripts have the same functionality as the code in the use-case example notebook:
if [[ ! -f usecase/data/sc_counts_reannotated_with_counts.h5ad ]]; then
aws s3 cp \
--no-sign-request \
\
s3://openproblems-bio/public/neurips-2023-competition/sc_counts_reannotated_with_counts.h5ad
usecase/data/sc_counts_reannotated_with_counts.h5adfi
# The dataset is stored in an AnnData object, which can be loaded in Python as follows:
import anndata as ad
print("Load data")
= ad.read_h5ad("usecase/data/sc_counts_reannotated_with_counts.h5ad")
adata
= "Belinostat"
sm_name = "Dimethyl Sulfoxide"
control_name = "T cells"
cell_type
= adata[
adata "sm_name"].isin([sm_name, control_name]) &
adata.obs["cell_type"].isin([cell_type]),
adata.obs[
].copy()
# We will also subset the genes to the top 2000 most variable genes.
= adata[:, adata.var["highly_variable"]].copy()
adata
print("Compute pseudobulk")
# Combine data in a single data frame and compute pseudobulk
import pandas as pd
= pd.DataFrame(
combined
adata.X.toarray(),=adata.obs["plate_well_celltype_reannotated"],
index
)= adata.var_names
combined.columns = combined.groupby(level=0).sum()
pb_X
print("Construct obs for pseudobulk")
# Use 'plate_well_celltype_reannotated' as index and make sure to retain the columns 'sm_name', 'cell_type', and 'plate_name':
= adata.obs[["sm_name", "cell_type", "plate_name", "well"]].copy()
pb_obs = adata.obs["plate_well_celltype_reannotated"]
pb_obs.index = pb_obs.drop_duplicates()
pb_obs
print("Create AnnData object")
= ad.AnnData(
pb_adata =pb_X.loc[pb_obs.index].values,
X=pb_obs,
obs=adata.var,
var
)
print("Store to disk")
"usecase/data/pseudobulk.h5ad") pb_adata.write_h5ad(
cat("Loading libraries...\n")
library(anndata)
library(dplyr, warn.conflicts = FALSE)
cat("Reading data...\n")
<- read_h5ad("usecase/data/pseudobulk.h5ad")
pb_adata
# Select small molecule and control:
<- "Belinostat"
sm_name <- "Dimethyl Sulfoxide"
control_name
cat("Create DESeq dataset\n")
# transform counts matrix
<- t(pb_adata$X)
count_data storage.mode(count_data) <- "integer"
# create dataset
<- DESeq2::DESeqDataSetFromMatrix(
dds countData = count_data,
colData = pb_adata$obs,
design = ~ sm_name + plate_name,
)
cat("Run DESeq2\n")
<- DESeq2::DESeq(dds)
dds
# Get results:
<- DESeq2::results(dds, contrast=c("sm_name", sm_name, control_name)) |>
res as.data.frame()
# Preview results:
|>
res arrange(padj) |>
head(10)
# Write to disk:
write.csv(res, "usecase/data/de_contrasts.csv")
You can use a shell script or your language of preference to run the pipeline in a sequential manner. This usually requires all the dependencies to be installed in one large environment.
From Bash:
#!/bin/bash
bash scripts/1_load_data.sh
python scripts/2_compute_pseudobulk.py
Rscript scripts/3_analysis_de.R
From R:
system("bash scripts/1_load_data.sh")
system("python scripts/2_compute_pseudobulk.py")
# Alternatively you can run R code here instead of calling an R script
system("Rscript scripts/3_analysis_de.R")
From Python:
import subprocess
"bash scripts/1_load_data.sh", shell=True)
subprocess.run(# Alternatively you can run Python code here instead of calling a Python script
"python scripts/2_compute_pseudobulk.py", shell=True)
subprocess.run("Rscript scripts/3_analysis_de.R", shell=True) subprocess.run(
7.3 Pipelines with different environments
Sometimes you might want to run scripts in different environments, as it’s too much hassle to install all dependencies in one environment. Other reasons can be that you want to reuse existing container images or you want keep parts of the pipeline separate and maintainable.
You can interleave your Bash script with environment activation functions e.g. conda activate {script_env}
commands. This requires e.g. a conda .yaml
file for each script environment in order to be reproducible. This can grow unwieldy as Conda environment sometimes need to be adapted for the platform and availability of accelerators like GPU’s. Another important consideration is that packages that impact the on-disk data format should be the same version across environments.
A simple and elegant, but very new solution is to use the Pixi package managment tool to manage the environments and tasks for you. The environments can be composed from multiple features containing dependencies, so you can have a scverse
environment with only Python, a rverse
environment with only R and even an all
environment with both by adding the respective features (if such an environment is resolvable at least).
Run scripts in different environments with pixi
:
pixi run -e bash scripts/1_load_data.sh
pixi run -e scverse scripts/2_compute_pseudobulk.py
pixi run -e rverse scripts/3_analysis_de.R
With the Pixi task runner, you can define these tasks in their respective environments and make them called in sequence by an overarching task. The specific lines of the pixi.toml
file that allow for this are:
...
[feature.bash.tasks]
load_data = "bash book/disk_based/scripts/1_load_data.sh"
...
[feature.scverse.tasks]
compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py"
...
[feature.rverse.tasks]
analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R"
...
[tasks]
pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] }
Now you can run this multi-environment pipeline with a single command and it will always run the tasks in their respective and up-to-date environment:
pixi run pipeline
Pixi task (load_data in bash): bash book/disk_based/scripts/1_load_data.sh
Pixi task (compute_pseudobulk in scverse): python book/disk_based/scripts/2_compute_pseudobulk.py
Load data
Compute pseudobulk
/app/book/disk_based/scripts/2_compute_pseudobulk.py:29: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
pb_X = combined.groupby(level=0).sum()
Construct obs for pseudobulk
Create AnnData object
Store to disk
Pixi task (analysis_de in rverse): Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R
Loading libraries...
Reading data...
Create DESeq dataset
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
Run DESeq2
estimating size factors
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
baseMean log2FoldChange lfcSE stat pvalue
BEX5 59.24944 2.187350 0.05660399 38.64304 0.000000e+00
HIST1H1D 301.38741 1.356543 0.03092962 43.85901 0.000000e+00
STMN1 234.72112 2.224633 0.04104002 54.20642 0.000000e+00
PCSK1N 64.91604 1.899149 0.05480612 34.65214 4.147855e-263
GZMM 141.39238 -1.309959 0.03806665 -34.41224 1.654371e-259
MARCKSL1 95.82726 1.423057 0.04311798 33.00380 7.163953e-239
H1FX 376.28247 1.054890 0.03221858 32.74168 3.988563e-235
HIST1H1B 30.81805 4.317984 0.14074738 30.67896 1.086254e-206
FXYD7 61.11526 2.331406 0.07725771 30.17700 4.746707e-200
ING2 79.68893 1.218777 0.04336609 28.10437 8.663682e-174
padj
BEX5 0.000000e+00
HIST1H1D 0.000000e+00
STMN1 0.000000e+00
PCSK1N 1.631144e-260
GZMM 5.204651e-257
MARCKSL1 1.878150e-236
H1FX 8.962871e-233
HIST1H1B 2.135848e-204
FXYD7 8.296189e-198
ING2 1.362797e-171
You can also still run the tasks individually when debugging a step and change behavior using environment variables.
7.4 Containerized pipelines
Containers are a great way to manage the environments for your pipeline and make them reproducible on different platforms, given that you make accessible and store the container images for a long time.
You can create a Docker image with all the pixi
environments and run the pipeline in multiple environments with a single container. The image is ~5GB and the pipeline can require a lot of working memory ~20GB, so make sure to increase the RAM allocated to Docker in your settings. Note that the usecase/
and book/
folders are mounted to the Docker container, so you can interactively edit the scripts and access the data.
docker pull berombau/polygloty-docker:latest
docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline
Another approach is to use multi-package containers. Tools like Multi-Package BioContainers and Seqera Containers can make this quick and easy, by allowing for custom combinations of packages.
You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a workflow framework like Viash, Nextflow or Snakemake to manage the pipeline for you.