6 Interoperable file formats
It’s important to differentiate between language specific file formats and language agnostic file formats. For example, most languages can serialize objects to disk. R has the .RDS
file format and Python has the .pickle/.pkl
file format, but these are not interoperable.
While the file formats themselves can be backwards compatible and you can read in old objects, it’s not necessarily the case newer versions of some package can read in files from older versions e.g. SeuratObject versions. Another possible problem is that when working with older versions of Python or R you can have errors reading in serialized objects created by the latest language version.
For a file format to solve these issues and be language agnostic, it should have a mature standard describing how the data is stored on disk. This standard should be implemented in multiple languages, documented and tested for compatibility. Some file formats have a reference implementation in C, which can be used to create bindings for other languages. Most also have a status page that list the implementations and which version or how much of the standard they support.
Note that there can be a difference in memory layouts and on-disk layouts of the same object. Apache Arrow is a memory layout standard important for in-memory interoperability. It can be save to disk as a Feather file, but also as a Parquet file.
6.1 General file formats
Table 6.1 shows some of the features of file formats of interest for Python and R and weather they they lack support (○), support it partially (◐) or have full support (●).
- Fast compression is important for large datasets and most of these formats support it, as it can reduce the storage requirements. Standard compressions techniques like zip can always be applied to text-based formats like CSV and JSON, but standardized support within the file format leads to faster and higher compression performance.
- Sparse matrix support is important for single-cell data, as it can also greatly reduce the storage requirements. If a part of the pipeline does not have this support and needs to make the matrix dense, the memory requirements will increase rapidly.
- Support for large images like microscopy images is important for bioimaging and spatial omics datasets. Some formats support small images or transcript coordinates, but do not support large whole-slide images or multiscale and labeled multidimensional images. This is indicated with ◐.
- For large datasets and whole-slide images, support for subsetting and lazy loading of chunks or tiles is also needed. This also allows for out-of-core and parallel processing of larger-than-memory datasets.
- Some file formats have built-in support for remote storage like a web server or S3. This can be important for large datasets that do not fit on a local disk or for distributed computing. This can also be used to share data between collaborators without having to download the whole dataset. Some formats can be accessed remotely, but only for public data (SpatialData) or with additional setup (HDF5). This is indicated with ◐. Mounting a remote folder via SFTP or rclone and downloading the data on demand only works for exploded file formats like Zarr or by supporting byte range requests.
6.2 Specialized file formats
Table 6.2 shows some specialized file standards that are built on top of the more general file formats. One example is the standardized way AnnData serializes a sparse matrix within Zarr to support large single cell data. Note that while a format can support a feature in theory, it does not mean it’s available in practice. For example, the TileDB implementations do not yet fully support large spatial omics datasets in a standardized way like SpatialData does. SpatialData does not yet fully support reading and writing using a remote private S3 dataset.
File Format | Python | R | Sparse matrix | Large images | Lazy chunk loading | Remote storage |
---|---|---|---|---|---|---|
Seurat RDS | ○ | ● | ○ | ◐ | ○ | ○ |
Indexed OME-TIFF | ● | ● | ○ | ● | ● | ● |
h5Seurat | ● | ● | ○ | ◐ | ● | ◐ |
Loom HDF5 | ● | ● | ● | ○ | ● | ◐ |
AnnData h5ad | ● | ● | ● | ◐ | ● | ◐ |
AnnData Zarr | ● | ● | ● | ◐ | ● | ● |
TileDB-SOMA | ● | ● | ● | ◐ | ● | ● |
TileDB-BioImaging | ● | ● | ◐ | ● | ● | ● |
SpatialData Zarr | ● | ● | ● | ● | ● | ◐ |