Polyglot programming for single-cell analysis

Louise Deconinck

Benjamin Rombaut

Robrecht Cannoodt

2024-09-12

Introduction

How do you interact with a package in another language?
How do you make you package useable for developers in other languages?

We will be focusing on R & Python

Summary

Interoperability between languages allows analysts to take advantage of the strengths of different ecosystems

On-disk interoperability uses standard file formats to transfer data and is typically more reliable

In-memory interoperability transfers data directly between parallel sessions and is convenient for interactive analysis

While interoperability is currently possible developers continue to improve the experience

Single-cell best practices: Interoperability

How do you interact with a package in another language?

In-memory interoperability
Disk-based interoperability

How do you make your package useable for developers in other languages?

Package-based interoperability
Best practices

Package-based interoperability

or: the question of reimplementation.

Consider the pros:
1. Discoverability
2. Can your package be useful in other domains?
3. Very user friendly
Consider the cons:
1. Think twice: is it worth it?
2. It’s a lot of work
3. How will you keep it up to date?
4. How will you ensure parity?

Package-based interoperability

Please learn both R & Python

Best practices

Work with the standards
Work with matrices, arrays and dataframes
Provide vignettes on interoperability

In-memory interoperability

Overview

Advantages & disadvantages
Pitfalls when using Python & R
Rpy2
Reticulate

in-memory interoperability advantages and disadvantages

advantages
- no need to write & read results
- useful when you need a limited amount of functions in another language
disadvantages
- not always access to all classes
- data duplication
- you need to manage the environments

Pitfalls when using Python and R

Column major vs row major matrices

In R: every dense matrix is stored as column major

Indexing

Dots and underscores

mapping in rpy2

from rpy2.robjects.packages import importr

d = {'package.dependencies': 'package_dot_dependencies',
     'package_dependencies': 'package_uscore_dependencies'}
tools = importr('tools', robject_translations = d)

Integers

library(reticulate)
bi <- reticulate::import_builtins()

bi$list(bi$range(0, 5))
# TypeError: 'float' object cannot be interpreted as an integer

library(reticulate)
bi <- reticulate::import_builtins()

bi$list(bi$range(0L, 5L))
# [1] 0 1 2 3 4

Rpy2

Accessing R from Python
- rpy2.rinterface, the low-level interface
- rpy2.robjects, the high-level interface

import rpy2
import rpy2.robjects as robjects

vector = robjects.IntVector([1,2,3])
rsum = robjects.r['sum']

rsum(vector)

IntVector with 1 elements.

Rpy2: basics

str_vector = robjects.StrVector(['abc', 'def', 'ghi'])
flt_vector = robjects.FloatVector([0.3, 0.8, 0.7])
int_vector = robjects.IntVector([1, 2, 3])
mtx = robjects.r.matrix(robjects.IntVector(range(10)), nrow=5)
print(mtx)

     [,1] [,2]
[1,]    0    5
[2,]    1    6
[3,]    2    7
[4,]    3    8
[5,]    4    9

Rpy2: numpy

import numpy as np

from rpy2.robjects import numpy2ri
from rpy2.robjects import default_converter

rd_m = np.random.random((5, 4))

with (default_converter + numpy2ri.converter).context():
    mtx = robjects.r.matrix(rd_m, nrow = 5)
    print(mtx)

[[0.92902919 0.24572561 0.13860008 0.62978886]
 [0.21979936 0.43720737 0.61998573 0.49151776]
 [0.99598827 0.35626021 0.12943936 0.15135163]
 [0.8411791  0.3295325  0.76558594 0.32489612]
 [0.2671875  0.25684477 0.88505818 0.14963159]]

Rpy2: pandas

import pandas as pd

from rpy2.robjects import pandas2ri

pd_df = pd.DataFrame({'int_values': [1,2,3],
                      'str_values': ['abc', 'def', 'ghi']})

with (default_converter + pandas2ri.converter).context():
    pd_df_r = robjects.DataFrame(pd_df)
    print(pd_df_r)

  int_values str_values
0          1        abc
1          2        def
2          3        ghi

Rpy2: sparse matrices

import scipy as sp

from anndata2ri import scipy2ri

sparse_matrix = sp.sparse.csc_matrix(rd_m)

with (default_converter + scipy2ri.converter).context():
    sp_r = scipy2ri.py2rpy(sparse_matrix)
    print(sp_r)

5 x 4 sparse Matrix of class "dgCMatrix"
                                            
[1,] 0.9290292 0.2457256 0.1386001 0.6297889
[2,] 0.2197994 0.4372074 0.6199857 0.4915178
[3,] 0.9959883 0.3562602 0.1294394 0.1513516
[4,] 0.8411791 0.3295325 0.7655859 0.3248961
[5,] 0.2671875 0.2568448 0.8850582 0.1496316

Rpy2: anndata

import anndata as ad
import scanpy.datasets as scd

import anndata2ri

adata_paul = scd.paul15()

with anndata2ri.converter.context():
    sce = anndata2ri.py2rpy(adata_paul)
    ad2 = anndata2ri.rpy2py(sce)

Rpy2: interactivity

%load_ext rpy2.ipython  # line magic that loads the rpy2 ipython extension.
                        # this extension allows the use of the following cell magic

%%R -i input -o output  # this line allows to specify inputs 
                        # (which will be converted to R objects) and outputs 
                        # (which will be converted back to Python objects) 
                        # this line is put at the start of a cell
                        # the rest of the cell will be run as R code

Reticulate

Reticulate: basics

library(reticulate)

bi <- reticulate::import_builtins()
rd <- reticulate::import("random")

example <- c(1,2,3)
bi$max(example)
# [1] 3
rd$choice(example)
# [1] 2
cat(bi$list(bi$reversed(example)))
# [1] 3 2 1

Reticulate numpy

np <- reticulate::import("numpy")

a <- np$asarray(tuple(list(1,2), list(3, 4)))
b <- np$asarray(list(5,6))
b <- np$reshape(b, newshape = tuple(1L,2L))

np$concatenate(tuple(a, b), axis=0L)
#      [,1] [,2]
# [1,]    1    2
# [2,]    3    4
# [3,]    5    6

Reticulate conversion

np <- reticulate::import("numpy", convert = FALSE)

a <- np$asarray(tuple(list(1,2), list(3, 4)))
b <- np$asarray(list(5,6))
b <- np$reshape(b, newshape = tuple(1L,2L))

np$concatenate(tuple(a, b), axis=0L)
# array([[1., 2.],
#        [3., 4.],
#        [5., 6.]])

You can explicitly convert data types:

result <- np$concatenate(tuple(a, b), axis=0L)

py_to_r(result)
#      [,1] [,2]
# [1,]    1    2
# [2,]    3    4
# [3,]    5    6

result_r <- py_to_r(result)
r_to_py(result_r)
# array([[1., 2.],
#        [3., 4.],
#        [5., 6.]])

Reticulate scanpy

library(anndata)
library(reticulate)
sc <- import("scanpy")

adata_path <- "../usecase/data/sc_counts_subset.h5ad"
adata <- anndata::read_h5ad(adata_path)

We can preprocess & analyse the data:

sc$pp$filter_cells(adata, min_genes = 200)
sc$pp$filter_genes(adata, min_cells = 3)
sc$pp$pca(adata)
sc$pp$neighbors(adata)
sc$tl$umap(adata)

adata
# AnnData object with n_obs × n_vars = 32727 × 20542
#     obs: 'dose_uM', 'timepoint_hr', 'well', 'row', 'col', 'plate_name', 'cell_id', 'cell_type', 'split', 'donor_id', 'sm_name', 'control', 'SMILES', 'sm_lincs_id', 'library_id', 'leiden_res1', 'group', 'cell_type_orig', 'plate_well_celltype_reannotated', 'cell_count_by_well_celltype', 'cell_count_by_plate_well', 'n_genes'
#     var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'n_cells'
#     uns: 'cell_type_colors', 'celltypist_celltype_colors', 'donor_id_colors', 'hvg', 'leiden_res1_colors', 'log1p', 'neighbors', 'over_clustering', 'rank_genes_groups', 'pca', 'umap'
#     obsm: 'HTO_clr', 'X_pca', 'X_umap', 'protein_counts'
#     varm: 'PCs'
#     obsp: 'connectivities', 'distances'

Disk-based interoperability

Disk-based interoperability is a strategy for achieving interoperability between tools written in different programming languages by storing intermediate results in standardized, language-agnostic file formats.

Upside:
- Simple, just add reading and witing lines
- Modular scripts
Downside:
- increased disk usage
- less direct interaction, debugging…

Important features of interoperable file formats

Compression
Sparse matrix support
Large images
Lazy chunk loading
Remote storage

General single cell file formats of interest for Python and R

File Format	Python	R	Sparse matrix	Large images	Lazy chunk loading	Remote storage
RDS	○	●	○	◐	○	○
Pickle	●	○	○	◐	○	○
CSV	●	●	○	○	○	○
JSON	●	●	○	○	○	○
TIFF	●	●	○	◐	●	◐
.npy	●	○	○	●	○	○
Parquet	●	●	○	○	●	●
Feather	●	●	●	○	●	●
Lance	●	○	●	○	●	●
HDF5	●	●	○	●	●	◐
Zarr	●	●	○	●	●	●
TileDB	●	●	●	●	●	●

Specialized single cell file formats of interest for Python and R

File Format	Python	R	Sparse matrix	Large images	Lazy chunk loading	Remote storage
Seurat RDS	○	●	○	◐	○	○
Indexed OME-TIFF	●	●	○	●	●	●
h5Seurat	●	●	○	◐	●	◐
Loom HDF5	●	●	●	○	●	◐
AnnData h5ad	●	●	●	◐	●	◐
AnnData Zarr	●	●	●	◐	●	●
TileDB-SOMA	●	●	●	◐	●	●
TileDB-BioImaging	●	●	◐	●	●	●
SpatialData Zarr	●	●	●	●	●	◐

Disk-based pipelines

Script pipeline:

#!/bin/bash

bash scripts/1_load_data.sh
python scripts/2_compute_pseudobulk.py
Rscript scripts/3_analysis_de.R

Notebook pipeline:

# Every step can be a new notebook execution with inspectable output
jupyter nbconvert --to notebook --execute my_notebook.ipynb --allow-errors --output-dir outputs/

Just stay in your language and call scripts

import subprocess

subprocess.run("bash scripts/1_load_data.sh", shell=True)
# Alternatively you can run Python code here instead of calling a Python script
subprocess.run("python scripts/2_compute_pseudobulk.py", shell=True)
subprocess.run("Rscript scripts/3_analysis_de.R", shell=True)

Pipelines with different environments

interleave with environment (de)activation functions
use rvenv
use Pixi

Pixi to manage different environments

pixi run -e bash scripts/1_load_data.sh
pixi run -e scverse scripts/2_compute_pseudobulk.py
pixi run -e rverse scripts/3_analysis_de.R

Define tasks in Pixi

...
[feature.bash.tasks]
load_data = "bash book/disk_based/scripts/1_load_data.sh"
...
[feature.scverse.tasks]
compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py"
...
[feature.rverse.tasks]
analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R"
...
[tasks]
pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] }

pixi run pipeline

Also possible to use containers

docker pull berombau/polygloty-docker:latest
docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline

Another approach is to use multi-package containers to create custom combinations of packages.
- Multi-Package BioContainers
- Seqera Containers

Workflows

You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a workflow framework like Viash, Nextflow or Snakemake to manage the pipeline for you.

See https://saeyslab.github.io/polygloty/book/workflow_frameworks/