[1]:

from pathlib import Path
import json
import pyarrow.parquet as pq

[2]:

import mudata
mudata.set_options(pull_on_update=False)

[2]:

<mudata._core.config.set_options at 0x35f684a40>

Parquet-based serialisation#

anndata library implements multiple ways to serialise the data including the default HDF5-based H5AD files as well as Zarr files.

Same interfaces are available in the mudata library — for H5MU files and Zarr files.

While largely following the same format specification (AnnData, MuData), different serialisation formats can provide benefits in various applications and scenarios.

pqdata implements a similar serialisation strategy that is quite natural as well as can be benefitial for memory and space efficiency and integration with certain analytical tools. It is based on Parquet files, and hence this format is termed here as pqdata files.

For a sneak peek into the usage and the benefits of Parquest files for analytics, you might check the following references:

Why parquet files are my preferred API for bulk open data (January 2023)
Extracting, converting, and querying data in local files using clickhouse-local (January 2023)
Reflecting on Apache Arrow in 2022 (December 2022)
Holy Duck! Fast Analysis with DuckDB + Pyarrow (April 2022)
SQL Query on MinIO (February 2022)
SQL Query on Parquet Files with DataFusion (January 2022)
DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB (December 2021)
Querying Parquet with Precision using DuckDB (June 2021)

Import writers:

[3]:

from pqdata import write_anndata, write_mudata

… and readers:

[4]:

from pqdata import read_anndata, read_mudata

First, let’s download a multimodal dataset in the H5MU format:

[5]:

data = Path("data")

[6]:

import mudatasets
mdata = mudatasets.load("pbmc5k_citeseq", files=["minipbcite.h5mu"], data_dir=data, backed=False)

■ File minipbcite.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/minipbcite.h5mu
■ Checksum is validated (md5) for minipbcite.h5mu
■ Loading minipbcite.h5mu...

Writing#

Write the MuData object:

[7]:

file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata"
write_mudata(mdata, file)

Notice that the data is not overwritten by default by this writer:

[8]:

try:
    # need to provide overwrite=True to overwrite
    write_mudata(mdata, file)
except FileExistsError as e:
    print(e)

[Errno 17] File exists: 'data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata'

File size#

We can also notice smaller file size:

[9]:

! du -sh data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata

9.0M    data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata

[10]:

! du -sh data/pbmc5k_citeseq/minipbcite.h5mu

 16M    data/pbmc5k_citeseq/minipbcite.h5mu

On larger datasets, the difference will be more noticeable. Moreoever, other compression strategies can be used to get even smaller files:

[11]:

file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata_zstd.pqdata"
write_mudata(mdata, file, compression="zstd")

Sometimes it’s easier to share the data as a single file, and archiving and compression will make the volume even smaller:

[12]:

! tar -czf minipbcite.tar.gz data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata

[13]:

! du -sh minipbcite.tar.gz

7.9M    minipbcite.tar.gz

.pqdata structure#

We can see how the dataset is stored on disc:

[14]:

! tree data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata

data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
├── mod
│   ├── prot
│   │   ├── X.parquet
│   │   ├── layers
│   │   │   └── counts.parquet
│   │   ├── obs.parquet
│   │   ├── obsm
│   │   │   ├── X_pca.parquet
│   │   │   └── X_umap.parquet
│   │   ├── obsp
│   │   │   ├── connectivities.parquet
│   │   │   └── distances.parquet
│   │   ├── uns
│   │   │   └── pca
│   │   │       ├── variance.parquet
│   │   │       └── variance_ratio.parquet
│   │   ├── uns.json
│   │   ├── var.parquet
│   │   └── varm
│   │       └── PCs.parquet
│   └── rna
│       ├── X.parquet
│       ├── obs.parquet
│       ├── obsm
│       │   ├── X_pca.parquet
│       │   └── X_umap.parquet
│       ├── obsp
│       │   ├── connectivities.parquet
│       │   └── distances.parquet
│       ├── uns
│       │   ├── celltype_colors.parquet
│       │   ├── leiden_colors.parquet
│       │   ├── pca
│       │   │   ├── variance.parquet
│       │   │   └── variance_ratio.parquet
│       │   └── rank_genes_groups
│       │       ├── logfoldchanges.parquet
│       │       ├── names.parquet
│       │       ├── pvals.parquet
│       │       ├── pvals_adj.parquet
│       │       └── scores.parquet
│       ├── uns.json
│       ├── var.parquet
│       └── varm
│           └── PCs.parquet
├── obs.parquet
├── obsm
│   ├── X_mofa.parquet
│   ├── X_mofa_umap.parquet
│   ├── X_umap.parquet
│   └── X_wnn_umap.parquet
├── obsmap
│   ├── prot.parquet
│   └── rna.parquet
├── obsp
│   ├── connectivities.parquet
│   ├── distances.parquet
│   ├── wnn_connectivities.parquet
│   └── wnn_distances.parquet
├── pqdata.json
├── var.parquet
├── varm
│   └── LFs.parquet
└── varmap
    ├── prot.parquet
    └── rna.parquet

21 directories, 46 files

Parquet files#

High-dimensional objects that were NumPy arrays or pandas DataFrames in memory are now stored in individual Parquet files.

Moreover, serialising these objects largely relies on pyarrow.

[15]:

pq.read_table(file / "obs.parquet").to_pandas().head()

[15]:

	louvain	leiden	leiden_wnn	celltype
CAGCCAGGTCTCGACG-1	1	1	0	CD4+ naïve T
TTCTTCCTCTCGGTAA-1	1	1	0	CD4+ naïve T
CGGGTCAAGAGAGGTA-1	1	1	1	CD4+ naïve T
TACCCGTCATAATCCG-1	1	1	1	CD4+ naïve T
TGGGTTAGTGAATTAG-1	2	2	1	CD4+ naïve T

Embeddings, e.g. UMAP:

[16]:

pq.read_table(file / "obsm/X_umap.parquet")

[16]:

pyarrow.Table
1: float
2: float
----
1: [[9.66083,10.850063,12.345492,12.538155,12.528112,...,-1.9723414,-1.5232953,-1.7778809,-1.7384567,-1.5827981]]
2: [[5.779709,6.7745605,5.6794906,6.2592025,4.833545,...,-2.152379,-2.1579309,-2.3050048,-2.199118,-2.308012]]

[17]:

pq.read_table(file / "obsm/X_umap.parquet").to_pandas().to_numpy()[:10]

[17]:

array([[ 9.66083  ,  5.779709 ],
       [10.850063 ,  6.7745605],
       [12.345492 ,  5.6794906],
       [12.538155 ,  6.2592025],
       [12.528112 ,  4.833545 ],
       [13.152369 ,  5.566309 ],
       [ 9.910739 ,  9.146889 ],
       [ 8.418465 ,  5.2574263],
       [12.064896 ,  5.3718014],
       [11.305964 , -7.5169115]], dtype=float32)

Dense count matrix with normalised values, which contains protein names (.mod['prot'].var_names) as column names:

[18]:

pq.read_table(file / "mod/prot/X.parquet").to_pandas().tail()

[18]:

	CD3_TotalSeqB	CD4_TotalSeqB	CD8a_TotalSeqB	CD11b_TotalSeqB	CD14_TotalSeqB	CD15_TotalSeqB	CD16_TotalSeqB	CD19_TotalSeqB	CD20_TotalSeqB	CD25_TotalSeqB	...	CD86_TotalSeqB	CD127_TotalSeqB	CD137_TotalSeqB	CD197_TotalSeqB	CD274_TotalSeqB	CD278_TotalSeqB	CD335_TotalSeqB	PD-1_TotalSeqB	HLA-DR_TotalSeqB	TIGIT_TotalSeqB
406	-1.075961	1.245624	0.258909	2.032457	0.591571	1.429440	0.137301	2.361204	1.693235	1.284448	...	4.252487	0.327570	1.309700	0.107906	0.710895	0.206146	2.295595	2.710157	13.345463	2.335479
407	-0.038257	8.728498	2.102803	1.870541	0.782583	2.021527	2.680047	1.295704	1.716122	1.112039	...	1.883393	0.701129	2.403748	2.701393	1.573005	3.247963	2.305770	2.435769	13.624533	2.287940
408	0.254669	10.289223	1.933862	2.539970	0.168840	2.607727	2.538013	0.113553	3.241333	2.223520	...	6.140669	1.847788	1.193516	0.819007	1.382885	0.913976	2.043625	0.751728	14.481518	2.834265
409	0.243213	13.866776	0.072664	1.043179	-0.051139	4.473774	1.771171	1.584222	4.893187	4.736353	...	8.189964	2.338869	2.845908	1.140788	-0.275009	2.504928	4.931875	3.187832	19.352989	0.957393
410	2.687545	14.526915	3.441904	1.239545	2.281008	2.358840	1.231699	1.176513	3.691090	1.004248	...	11.924853	1.308427	0.306832	3.636075	1.498553	3.189132	0.258857	4.084857	18.031036	2.995944

5 rows × 29 columns

Sparse count matrix with counts are stored in the COO format:

[19]:

pq.read_table(file / "mod/prot/layers/counts.parquet").to_pandas().tail()

[19]:

	row	col	data
11098	410	23	2.0
11099	410	24	6.0
11100	410	26	6.0
11101	410	27	888.0
11102	410	28	3.0

For example, we can count proteins with a simple SQL query:

[20]:

import duckdb
duckdb.query(f'''
    SELECT COUNT(*)
    FROM '{file}/mod/prot/var.parquet'
''').fetchall()

[20]:

[(29,)]

If you want to read more about the Python API of DuckDB, you can find an intro here.

JSON files#

Simpler objects such as scalar values are serialised into JSON files.

[21]:

with open(file / "mod/rna/uns.json") as uns_file:
    uns = json.load(uns_file)

    print(json.dumps(uns, indent=4))

{
    "hvg": {
        "flavor": "seurat"
    },
    "leiden": {
        "params": {
            "n_iterations": -1,
            "random_state": 0,
            "resolution": 0.75
        }
    },
    "neighbors": {
        "connectivities_key": "connectivities",
        "distances_key": "distances",
        "params": {
            "method": "umap",
            "metric": "euclidean",
            "n_neighbors": 15,
            "random_state": 0
        }
    },
    "pca": {
        "params": {
            "use_highly_variable": true,
            "zero_center": true
        }
    },
    "rank_genes_groups": {
        "params": {
            "corr_method": "benjamini-hochberg",
            "groupby": "leiden",
            "method": "t-test_overestim_var",
            "reference": "rest",
            "use_raw": true
        }
    },
    "umap": {
        "params": {
            "a": 0.5830300205483709,
            "b": 1.334166992455648,
            "random_state": 11
        }
    }
}

Reading#

AnnData reader will return an in-memory AnnData object, and MuData reader will return an in-memory MuData object:

[22]:

read_mudata(file)

[22]:

MuData object with n_obs × n_vars = 411 × 56
  obs:      'louvain', 'leiden', 'leiden_wnn', 'celltype'
  var:      'feature_types', 'gene_ids', 'highly_variable'
  obsm:     'X_wnn_umap', 'X_umap', 'X_mofa_umap', 'X_mofa'
  varm:     'LFs'
  obsp:     'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances'
  2 modalities
    prot:   411 x 29
      var:  'gene_ids', 'feature_types', 'highly_variable'
      uns:  'neighbors', 'pca', 'umap'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      layers:       'counts'
      obsp: 'connectivities', 'distances'
    rna:    411 x 27
      obs:  'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:  'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:  'hvg', 'leiden', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'celltype_colors', 'leiden_colors'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      obsp: 'connectivities', 'distances'

File reading speed#

Read speads are not compromised and are actually improved:

[23]:

import mudata

[24]:

%%timeit
mudata.read("data/pbmc5k_citeseq/minipbcite.h5mu")

93.9 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[25]:

%%timeit
read_mudata(file)

45.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Together, these methods in pqdata make the Parquet-based serialisation compatible with the whole scverse ecosystem while also providing benefits for out-of-memory analytics on large datasets.

Parquet-based serialisation

Contents