# Parquet-based serialisation for AnnData/MuData

Serialise `AnnData` and `MuData` objects into directories with [Parquet files](https://parquet.apache.org/).

Some imports first:

In [1]:
import os
from pathlib import Path
import pytest

import mudata
from mudata import AnnData, MuData

from pqdata.io.write import write_anndata, write_mudata
from pqdata.io.read import read_anndata, read_mudata

In [2]:
mudata.set_options(pull_on_update=False)

<mudata._core.config.set_options at 0x14f174510>

Some data to work with:

In [3]:
!mkdir -p data/pbmc5k_citeseq
!wget 'https://github.com/gtca/h5xx-datasets/blob/main/datasets/minipbcite.h5mu?raw=true' -O data/pbmc5k_citeseq/minipbcite.h5mu

In [4]:
data = Path("data/pbmc5k_citeseq/")
mdata = mudata.read(data / "minipbcite.h5mu")

## MuData Parquet-based I/O

Saving a `MuData` object on disk:

In [5]:
mudata_pq = data / "minipbcite.pqdata"
write_mudata(mdata, mudata_pq)

Files have to be overwritten explicitly with `overwrite=True`:

In [6]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
    write_mudata(mdata, mudata_pq)

# overwrite explicitly
write_mudata(mdata, mudata_pq, overwrite=True)

The structure is something that resembles `.h5mu` files but with directories instead of HDF5 groups and `.parquet` files instead of HDF5 datasets:

In [7]:
!tree data/pbmc5k_citeseq/minipbcite.pqdata

[1;36mdata/pbmc5k_citeseq/minipbcite.pqdata[0m
├── [1;36mmod[0m
│   ├── [1;36mprot[0m
│   │   ├── X.parquet
│   │   ├── [1;36mlayers[0m
│   │   │   └── counts.parquet
│   │   ├── obs.parquet
│   │   ├── [1;36mobsm[0m
│   │   │   ├── X_pca.parquet
│   │   │   └── X_umap.parquet
│   │   ├── [1;36mobsp[0m
│   │   │   ├── connectivities.parquet
│   │   │   └── distances.parquet
│   │   ├── [1;36muns[0m
│   │   │   └── [1;36mpca[0m
│   │   │       ├── variance.parquet
│   │   │       └── variance_ratio.parquet
│   │   ├── uns.json
│   │   ├── var.parquet
│   │   └── [1;36mvarm[0m
│   │       └── PCs.parquet
│   └── [1;36mrna[0m
│       ├── X.parquet
│       ├── obs.parquet
│       ├── [1;36mobsm[0m
│       │   ├── X_pca.parquet
│       │   └── X_umap.parquet
│       ├── [1;36mobsp[0m
│       │   ├── connectivities.parquet
│       │   └── distances.parquet
│       ├── [1;36muns[0m
│       │   ├── celltype_colors.parquet
│       │   ├── leiden_colors.parquet
│       

Reading data back is straightforward:

In [8]:
mdata_from_pqdata = read_mudata(mudata_pq)
mdata_from_pqdata

## AnnData Parquet-based I/O

This also works for individual modalities. Let's try it using the `prot` modality as an example.

In [9]:
adata = mdata["prot"]
adata

AnnData object with n_obs × n_vars = 411 × 29
    var: 'gene_ids', 'feature_types', 'highly_variable'
    uns: 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

Saving an `AnnData` object on disk:

In [10]:
adata_pq = data / "minipbcite_prot.pqdata"
write_anndata(adata, adata_pq)

Files have to be overwritten explicitly with `overwrite=True`:

In [11]:
# check that throws FileExistsError as expected
with pytest.raises(FileExistsError):
    write_anndata(adata, adata_pq)

# overwrite explicitly
write_anndata(adata, adata_pq, overwrite=True)

The structure is something that resembles `.h5ad` files but with directories instead of HDF5 groups and `.pq` files instead of HDF5 datasets:

In [12]:
!tree data/pbmc5k_citeseq/minipbcite_prot.pqdata

[1;36mdata/pbmc5k_citeseq/minipbcite_prot.pqdata[0m
├── X.parquet
├── [1;36mlayers[0m
│   └── counts.parquet
├── obs.parquet
├── [1;36mobsm[0m
│   ├── X_pca.parquet
│   └── X_umap.parquet
├── [1;36mobsp[0m
│   ├── connectivities.parquet
│   └── distances.parquet
├── [1;36muns[0m
│   └── [1;36mpca[0m
│       ├── variance.parquet
│       └── variance_ratio.parquet
├── uns.json
├── var.parquet
└── [1;36mvarm[0m
    └── PCs.parquet

7 directories, 12 files


Reading data back is straightforward:

In [13]:
adata_from_pq = read_anndata(adata_pq)
adata_from_pq

AnnData object with n_obs × n_vars = 411 × 29
    var: 'gene_ids', 'feature_types', 'highly_variable'
    uns: 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'