Parquet-based serialisation for AnnData/MuData

Parquet-based serialisation for AnnData/MuData#

Serialise AnnData and MuData objects into directories with Parquet files.

Some imports first:

[1]:
import os
from pathlib import Path
import pytest

import mudata
from mudata import AnnData, MuData

from pqdata.io.write import write_anndata, write_mudata
from pqdata.io.read import read_anndata, read_mudata
[2]:
mudata.set_options(pull_on_update=False)
[2]:
<mudata._core.config.set_options at 0x14f174510>

Some data to work with:

[3]:
!mkdir -p data/pbmc5k_citeseq
!wget 'https://github.com/gtca/h5xx-datasets/blob/main/datasets/minipbcite.h5mu?raw=true' -O data/pbmc5k_citeseq/minipbcite.h5mu
[4]:
data = Path("data/pbmc5k_citeseq/")
mdata = mudata.read(data / "minipbcite.h5mu")

MuData Parquet-based I/O#

Saving a MuData object on disk:

[5]:
mudata_pq = data / "minipbcite.pqdata"
write_mudata(mdata, mudata_pq)

Files have to be overwritten explicitly with overwrite=True:

[6]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
    write_mudata(mdata, mudata_pq)

# overwrite explicitly
write_mudata(mdata, mudata_pq, overwrite=True)

The structure is something that resembles .h5mu files but with directories instead of HDF5 groups and .parquet files instead of HDF5 datasets:

[7]:
!tree data/pbmc5k_citeseq/minipbcite.pqdata
data/pbmc5k_citeseq/minipbcite.pqdata
├── mod
│   ├── prot
│   │   ├── X.parquet
│   │   ├── layers
│   │   │   └── counts.parquet
│   │   ├── obs.parquet
│   │   ├── obsm
│   │   │   ├── X_pca.parquet
│   │   │   └── X_umap.parquet
│   │   ├── obsp
│   │   │   ├── connectivities.parquet
│   │   │   └── distances.parquet
│   │   ├── uns
│   │   │   └── pca
│   │   │       ├── variance.parquet
│   │   │       └── variance_ratio.parquet
│   │   ├── uns.json
│   │   ├── var.parquet
│   │   └── varm
│   │       └── PCs.parquet
│   └── rna
│       ├── X.parquet
│       ├── obs.parquet
│       ├── obsm
│       │   ├── X_pca.parquet
│       │   └── X_umap.parquet
│       ├── obsp
│       │   ├── connectivities.parquet
│       │   └── distances.parquet
│       ├── uns
│       │   ├── celltype_colors.parquet
│       │   ├── leiden_colors.parquet
│       │   ├── pca
│       │   │   ├── variance.parquet
│       │   │   └── variance_ratio.parquet
│       │   └── rank_genes_groups
│       │       ├── logfoldchanges.parquet
│       │       ├── names.parquet
│       │       ├── pvals.parquet
│       │       ├── pvals_adj.parquet
│       │       └── scores.parquet
│       ├── uns.json
│       ├── var.parquet
│       └── varm
│           └── PCs.parquet
├── obs.parquet
├── obsm
│   ├── X_mofa.parquet
│   ├── X_mofa_umap.parquet
│   ├── X_umap.parquet
│   └── X_wnn_umap.parquet
├── obsmap
│   ├── prot.parquet
│   └── rna.parquet
├── obsp
│   ├── connectivities.parquet
│   ├── distances.parquet
│   ├── wnn_connectivities.parquet
│   └── wnn_distances.parquet
├── pqdata.json
├── var.parquet
├── varm
│   └── LFs.parquet
└── varmap
    ├── prot.parquet
    └── rna.parquet

21 directories, 46 files

Reading data back is straightforward:

[8]:
mdata_from_pqdata = read_mudata(mudata_pq)
mdata_from_pqdata
[8]:
MuData object with n_obs × n_vars = 411 × 56
  obs:      'louvain', 'leiden', 'leiden_wnn', 'celltype'
  var:      'feature_types', 'gene_ids', 'highly_variable'
  obsm:     'X_wnn_umap', 'X_umap', 'X_mofa_umap', 'X_mofa'
  varm:     'LFs'
  obsp:     'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances'
  2 modalities
    prot:   411 x 29
      var:  'gene_ids', 'feature_types', 'highly_variable'
      uns:  'neighbors', 'pca', 'umap'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      layers:       'counts'
      obsp: 'connectivities', 'distances'
    rna:    411 x 27
      obs:  'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:  'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:  'hvg', 'leiden', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'celltype_colors', 'leiden_colors'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      obsp: 'connectivities', 'distances'

AnnData Parquet-based I/O#

This also works for individual modalities. Let’s try it using the prot modality as an example.

[9]:
adata = mdata["prot"]
adata
[9]:
AnnData object with n_obs × n_vars = 411 × 29
    var: 'gene_ids', 'feature_types', 'highly_variable'
    uns: 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

Saving an AnnData object on disk:

[10]:
adata_pq = data / "minipbcite_prot.pqdata"
write_anndata(adata, adata_pq)

Files have to be overwritten explicitly with overwrite=True:

[11]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
    write_anndata(adata, adata_pq)

# overwrite explicitly
write_anndata(adata, adata_pq, overwrite=True)

The structure is something that resembles .h5ad files but with directories instead of HDF5 groups and .pq files instead of HDF5 datasets:

[12]:
!tree data/pbmc5k_citeseq/minipbcite_prot.pqdata
data/pbmc5k_citeseq/minipbcite_prot.pqdata
├── X.parquet
├── layers
│   └── counts.parquet
├── obs.parquet
├── obsm
│   ├── X_pca.parquet
│   └── X_umap.parquet
├── obsp
│   ├── connectivities.parquet
│   └── distances.parquet
├── uns
│   └── pca
│       ├── variance.parquet
│       └── variance_ratio.parquet
├── uns.json
├── var.parquet
└── varm
    └── PCs.parquet

7 directories, 12 files

Reading data back is straightforward:

[13]:
adata_from_pq = read_anndata(adata_pq)
adata_from_pq
[13]:
AnnData object with n_obs × n_vars = 411 × 29
    var: 'gene_ids', 'feature_types', 'highly_variable'
    uns: 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'