Parquet-based serialisation for AnnData/MuData#
Serialise AnnData
and MuData
objects into directories with Parquet files.
Some imports first:
[1]:
import os
from pathlib import Path
import pytest
import mudata
from mudata import AnnData, MuData
from pqdata.io.write import write_anndata, write_mudata
from pqdata.io.read import read_anndata, read_mudata
[2]:
mudata.set_options(pull_on_update=False)
[2]:
<mudata._core.config.set_options at 0x14f174510>
Some data to work with:
[3]:
!mkdir -p data/pbmc5k_citeseq
!wget 'https://github.com/gtca/h5xx-datasets/blob/main/datasets/minipbcite.h5mu?raw=true' -O data/pbmc5k_citeseq/minipbcite.h5mu
[4]:
data = Path("data/pbmc5k_citeseq/")
mdata = mudata.read(data / "minipbcite.h5mu")
MuData Parquet-based I/O#
Saving a MuData
object on disk:
[5]:
mudata_pq = data / "minipbcite.pqdata"
write_mudata(mdata, mudata_pq)
Files have to be overwritten explicitly with overwrite=True
:
[6]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
write_mudata(mdata, mudata_pq)
# overwrite explicitly
write_mudata(mdata, mudata_pq, overwrite=True)
The structure is something that resembles .h5mu
files but with directories instead of HDF5 groups and .parquet
files instead of HDF5 datasets:
[7]:
!tree data/pbmc5k_citeseq/minipbcite.pqdata
data/pbmc5k_citeseq/minipbcite.pqdata
├── mod
│ ├── prot
│ │ ├── X.parquet
│ │ ├── layers
│ │ │ └── counts.parquet
│ │ ├── obs.parquet
│ │ ├── obsm
│ │ │ ├── X_pca.parquet
│ │ │ └── X_umap.parquet
│ │ ├── obsp
│ │ │ ├── connectivities.parquet
│ │ │ └── distances.parquet
│ │ ├── uns
│ │ │ └── pca
│ │ │ ├── variance.parquet
│ │ │ └── variance_ratio.parquet
│ │ ├── uns.json
│ │ ├── var.parquet
│ │ └── varm
│ │ └── PCs.parquet
│ └── rna
│ ├── X.parquet
│ ├── obs.parquet
│ ├── obsm
│ │ ├── X_pca.parquet
│ │ └── X_umap.parquet
│ ├── obsp
│ │ ├── connectivities.parquet
│ │ └── distances.parquet
│ ├── uns
│ │ ├── celltype_colors.parquet
│ │ ├── leiden_colors.parquet
│ │ ├── pca
│ │ │ ├── variance.parquet
│ │ │ └── variance_ratio.parquet
│ │ └── rank_genes_groups
│ │ ├── logfoldchanges.parquet
│ │ ├── names.parquet
│ │ ├── pvals.parquet
│ │ ├── pvals_adj.parquet
│ │ └── scores.parquet
│ ├── uns.json
│ ├── var.parquet
│ └── varm
│ └── PCs.parquet
├── obs.parquet
├── obsm
│ ├── X_mofa.parquet
│ ├── X_mofa_umap.parquet
│ ├── X_umap.parquet
│ └── X_wnn_umap.parquet
├── obsmap
│ ├── prot.parquet
│ └── rna.parquet
├── obsp
│ ├── connectivities.parquet
│ ├── distances.parquet
│ ├── wnn_connectivities.parquet
│ └── wnn_distances.parquet
├── pqdata.json
├── var.parquet
├── varm
│ └── LFs.parquet
└── varmap
├── prot.parquet
└── rna.parquet
21 directories, 46 files
Reading data back is straightforward:
[8]:
mdata_from_pqdata = read_mudata(mudata_pq)
mdata_from_pqdata
[8]:
MuData object with n_obs × n_vars = 411 × 56 obs: 'louvain', 'leiden', 'leiden_wnn', 'celltype' var: 'feature_types', 'gene_ids', 'highly_variable' obsm: 'X_wnn_umap', 'X_umap', 'X_mofa_umap', 'X_mofa' varm: 'LFs' obsp: 'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances' 2 modalities prot: 411 x 29 var: 'gene_ids', 'feature_types', 'highly_variable' uns: 'neighbors', 'pca', 'umap' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'connectivities', 'distances' rna: 411 x 27 obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype' var: 'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std' uns: 'hvg', 'leiden', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'celltype_colors', 'leiden_colors' obsm: 'X_pca', 'X_umap' varm: 'PCs' obsp: 'connectivities', 'distances'
AnnData Parquet-based I/O#
This also works for individual modalities. Let’s try it using the prot
modality as an example.
[9]:
adata = mdata["prot"]
adata
[9]:
AnnData object with n_obs × n_vars = 411 × 29
var: 'gene_ids', 'feature_types', 'highly_variable'
uns: 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'counts'
obsp: 'connectivities', 'distances'
Saving an AnnData
object on disk:
[10]:
adata_pq = data / "minipbcite_prot.pqdata"
write_anndata(adata, adata_pq)
Files have to be overwritten explicitly with overwrite=True
:
[11]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
write_anndata(adata, adata_pq)
# overwrite explicitly
write_anndata(adata, adata_pq, overwrite=True)
The structure is something that resembles .h5ad
files but with directories instead of HDF5 groups and .pq
files instead of HDF5 datasets:
[12]:
!tree data/pbmc5k_citeseq/minipbcite_prot.pqdata
data/pbmc5k_citeseq/minipbcite_prot.pqdata
├── X.parquet
├── layers
│ └── counts.parquet
├── obs.parquet
├── obsm
│ ├── X_pca.parquet
│ └── X_umap.parquet
├── obsp
│ ├── connectivities.parquet
│ └── distances.parquet
├── uns
│ └── pca
│ ├── variance.parquet
│ └── variance_ratio.parquet
├── uns.json
├── var.parquet
└── varm
└── PCs.parquet
7 directories, 12 files
Reading data back is straightforward:
[13]:
adata_from_pq = read_anndata(adata_pq)
adata_from_pq
[13]:
AnnData object with n_obs × n_vars = 411 × 29
var: 'gene_ids', 'feature_types', 'highly_variable'
uns: 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'counts'
obsp: 'connectivities', 'distances'