import rpy2
import rpy2.robjects as robjects
vector = robjects.IntVector([1,2,3])
rsum = robjects.r['sum']
rsum(vector)
6 |
2024-09-12
We will be focusing on R & Python
Interoperability between languages allows analysts to take advantage of the strengths of different ecosystems
On-disk interoperability uses standard file formats to transfer data and is typically more reliable
In-memory interoperability transfers data directly between parallel sessions and is convenient for interactive analysis
While interoperability is currently possible developers continue to improve the experience
or: the question of reimplementation.
Consider the pros:
Consider the cons:
Please learn both R & Python
In R: every dense matrix is stored as column major
rpy2.rinterface
, the low-level interfacerpy2.robjects
, the high-level interfaceimport numpy as np
from rpy2.robjects import numpy2ri
from rpy2.robjects import default_converter
rd_m = np.random.random((5, 4))
with (default_converter + numpy2ri.converter).context():
mtx = robjects.r.matrix(rd_m, nrow = 5)
print(mtx)
[[0.92902919 0.24572561 0.13860008 0.62978886]
[0.21979936 0.43720737 0.61998573 0.49151776]
[0.99598827 0.35626021 0.12943936 0.15135163]
[0.8411791 0.3295325 0.76558594 0.32489612]
[0.2671875 0.25684477 0.88505818 0.14963159]]
import scipy as sp
from anndata2ri import scipy2ri
sparse_matrix = sp.sparse.csc_matrix(rd_m)
with (default_converter + scipy2ri.converter).context():
sp_r = scipy2ri.py2rpy(sparse_matrix)
print(sp_r)
5 x 4 sparse Matrix of class "dgCMatrix"
[1,] 0.9290292 0.2457256 0.1386001 0.6297889
[2,] 0.2197994 0.4372074 0.6199857 0.4915178
[3,] 0.9959883 0.3562602 0.1294394 0.1513516
[4,] 0.8411791 0.3295325 0.7655859 0.3248961
[5,] 0.2671875 0.2568448 0.8850582 0.1496316
%load_ext rpy2.ipython # line magic that loads the rpy2 ipython extension.
# this extension allows the use of the following cell magic
%%R -i input -o output # this line allows to specify inputs
# (which will be converted to R objects) and outputs
# (which will be converted back to Python objects)
# this line is put at the start of a cell
# the rest of the cell will be run as R code
np <- reticulate::import("numpy", convert = FALSE)
a <- np$asarray(tuple(list(1,2), list(3, 4)))
b <- np$asarray(list(5,6))
b <- np$reshape(b, newshape = tuple(1L,2L))
np$concatenate(tuple(a, b), axis=0L)
# array([[1., 2.],
# [3., 4.],
# [5., 6.]])
You can explicitly convert data types:
library(anndata)
library(reticulate)
sc <- import("scanpy")
adata_path <- "../usecase/data/sc_counts_subset.h5ad"
adata <- anndata::read_h5ad(adata_path)
We can preprocess & analyse the data:
sc$pp$filter_cells(adata, min_genes = 200)
sc$pp$filter_genes(adata, min_cells = 3)
sc$pp$pca(adata)
sc$pp$neighbors(adata)
sc$tl$umap(adata)
adata
# AnnData object with n_obs × n_vars = 32727 × 20542
# obs: 'dose_uM', 'timepoint_hr', 'well', 'row', 'col', 'plate_name', 'cell_id', 'cell_type', 'split', 'donor_id', 'sm_name', 'control', 'SMILES', 'sm_lincs_id', 'library_id', 'leiden_res1', 'group', 'cell_type_orig', 'plate_well_celltype_reannotated', 'cell_count_by_well_celltype', 'cell_count_by_plate_well', 'n_genes'
# var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'n_cells'
# uns: 'cell_type_colors', 'celltypist_celltype_colors', 'donor_id_colors', 'hvg', 'leiden_res1_colors', 'log1p', 'neighbors', 'over_clustering', 'rank_genes_groups', 'pca', 'umap'
# obsm: 'HTO_clr', 'X_pca', 'X_umap', 'protein_counts'
# varm: 'PCs'
# obsp: 'connectivities', 'distances'
Disk-based interoperability is a strategy for achieving interoperability between tools written in different programming languages by storing intermediate results in standardized, language-agnostic file formats.
File Format | Python | R | Sparse matrix | Large images | Lazy chunk loading | Remote storage |
---|---|---|---|---|---|---|
RDS | ○ | ● | ○ | ◐ | ○ | ○ |
Pickle | ● | ○ | ○ | ◐ | ○ | ○ |
CSV | ● | ● | ○ | ○ | ○ | ○ |
JSON | ● | ● | ○ | ○ | ○ | ○ |
TIFF | ● | ● | ○ | ◐ | ● | ◐ |
.npy | ● | ○ | ○ | ● | ○ | ○ |
Parquet | ● | ● | ○ | ○ | ● | ● |
Feather | ● | ● | ● | ○ | ● | ● |
Lance | ● | ○ | ● | ○ | ● | ● |
HDF5 | ● | ● | ○ | ● | ● | ◐ |
Zarr | ● | ● | ○ | ● | ● | ● |
TileDB | ● | ● | ● | ● | ● | ● |
File Format | Python | R | Sparse matrix | Large images | Lazy chunk loading | Remote storage |
---|---|---|---|---|---|---|
Seurat RDS | ○ | ● | ○ | ◐ | ○ | ○ |
Indexed OME-TIFF | ● | ● | ○ | ● | ● | ● |
h5Seurat | ● | ● | ○ | ◐ | ● | ◐ |
Loom HDF5 | ● | ● | ● | ○ | ● | ◐ |
AnnData h5ad | ● | ● | ● | ◐ | ● | ◐ |
AnnData Zarr | ● | ● | ● | ◐ | ● | ● |
TileDB-SOMA | ● | ● | ● | ◐ | ● | ● |
TileDB-BioImaging | ● | ● | ◐ | ● | ● | ● |
SpatialData Zarr | ● | ● | ● | ● | ● | ◐ |
Script pipeline:
#!/bin/bash
bash scripts/1_load_data.sh
python scripts/2_compute_pseudobulk.py
Rscript scripts/3_analysis_de.R
Notebook pipeline:
...
[feature.bash.tasks]
load_data = "bash book/disk_based/scripts/1_load_data.sh"
...
[feature.scverse.tasks]
compute_pseudobulk = "python book/disk_based/scripts/2_compute_pseudobulk.py"
...
[feature.rverse.tasks]
analysis_de = "Rscript --no-init-file book/disk_based/scripts/3_analysis_de.R"
...
[tasks]
pipeline = { depends-on = ["load_data", "compute_pseudobulk", "analysis_de"] }
docker pull berombau/polygloty-docker:latest
docker run -it -v $(pwd)/usecase:/app/usecase -v $(pwd)/book:/app/book berombau/polygloty-docker:latest pixi run pipeline
You can go a long way with a folder of notebooks or scripts and the right tools. But as your project grows more bespoke, it can be worth the effort to use a workflow framework like Viash, Nextflow or Snakemake to manage the pipeline for you.
See https://saeyslab.github.io/polygloty/book/workflow_frameworks/