[1]:
from pathlib import Path
import json
import pyarrow.parquet as pq
[2]:
import mudata
mudata.set_options(pull_on_update=False)
[2]:
<mudata._core.config.set_options at 0x35f684a40>

Parquet-based serialisation#

anndata library implements multiple ways to serialise the data including the default HDF5-based H5AD files as well as Zarr files.

Same interfaces are available in the mudata library — for H5MU files and Zarr files.

While largely following the same format specification (AnnData, MuData), different serialisation formats can provide benefits in various applications and scenarios.

pqdata implements a similar serialisation strategy that is quite natural as well as can be benefitial for memory and space efficiency and integration with certain analytical tools. It is based on Parquet files, and hence this format is termed here as pqdata files.

For a sneak peek into the usage and the benefits of Parquest files for analytics, you might check the following references:

Import writers:

[3]:
from pqdata import write_anndata, write_mudata

… and readers:

[4]:
from pqdata import read_anndata, read_mudata

First, let’s download a multimodal dataset in the H5MU format:

[5]:
data = Path("data")
[6]:
import mudatasets
mdata = mudatasets.load("pbmc5k_citeseq", files=["minipbcite.h5mu"], data_dir=data, backed=False)
■ File minipbcite.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/minipbcite.h5mu
■ Checksum is validated (md5) for minipbcite.h5mu
■ Loading minipbcite.h5mu...

Writing#

Write the MuData object:

[7]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata"
write_mudata(mdata, file)

Notice that the data is not overwritten by default by this writer:

[8]:
try:
    # need to provide overwrite=True to overwrite
    write_mudata(mdata, file)
except FileExistsError as e:
    print(e)
[Errno 17] File exists: 'data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata'

File size#

We can also notice smaller file size:

[9]:
! du -sh data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
9.0M    data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
[10]:
! du -sh data/pbmc5k_citeseq/minipbcite.h5mu
 16M    data/pbmc5k_citeseq/minipbcite.h5mu

On larger datasets, the difference will be more noticeable. Moreoever, other compression strategies can be used to get even smaller files:

[11]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata_zstd.pqdata"
write_mudata(mdata, file, compression="zstd")

Sometimes it’s easier to share the data as a single file, and archiving and compression will make the volume even smaller:

[12]:
! tar -czf minipbcite.tar.gz data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
[13]:
! du -sh minipbcite.tar.gz
7.9M    minipbcite.tar.gz

.pqdata structure#

We can see how the dataset is stored on disc:

[14]:
! tree data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata
├── mod
│   ├── prot
│   │   ├── X.parquet
│   │   ├── layers
│   │   │   └── counts.parquet
│   │   ├── obs.parquet
│   │   ├── obsm
│   │   │   ├── X_pca.parquet
│   │   │   └── X_umap.parquet
│   │   ├── obsp
│   │   │   ├── connectivities.parquet
│   │   │   └── distances.parquet
│   │   ├── uns
│   │   │   └── pca
│   │   │       ├── variance.parquet
│   │   │       └── variance_ratio.parquet
│   │   ├── uns.json
│   │   ├── var.parquet
│   │   └── varm
│   │       └── PCs.parquet
│   └── rna
│       ├── X.parquet
│       ├── obs.parquet
│       ├── obsm
│       │   ├── X_pca.parquet
│       │   └── X_umap.parquet
│       ├── obsp
│       │   ├── connectivities.parquet
│       │   └── distances.parquet
│       ├── uns
│       │   ├── celltype_colors.parquet
│       │   ├── leiden_colors.parquet
│       │   ├── pca
│       │   │   ├── variance.parquet
│       │   │   └── variance_ratio.parquet
│       │   └── rank_genes_groups
│       │       ├── logfoldchanges.parquet
│       │       ├── names.parquet
│       │       ├── pvals.parquet
│       │       ├── pvals_adj.parquet
│       │       └── scores.parquet
│       ├── uns.json
│       ├── var.parquet
│       └── varm
│           └── PCs.parquet
├── obs.parquet
├── obsm
│   ├── X_mofa.parquet
│   ├── X_mofa_umap.parquet
│   ├── X_umap.parquet
│   └── X_wnn_umap.parquet
├── obsmap
│   ├── prot.parquet
│   └── rna.parquet
├── obsp
│   ├── connectivities.parquet
│   ├── distances.parquet
│   ├── wnn_connectivities.parquet
│   └── wnn_distances.parquet
├── pqdata.json
├── var.parquet
├── varm
│   └── LFs.parquet
└── varmap
    ├── prot.parquet
    └── rna.parquet

21 directories, 46 files

Parquet files#

High-dimensional objects that were NumPy arrays or pandas DataFrames in memory are now stored in individual Parquet files.

Moreover, serialising these objects largely relies on pyarrow.

[15]:
pq.read_table(file / "obs.parquet").to_pandas().head()
[15]:
louvain leiden leiden_wnn celltype
CAGCCAGGTCTCGACG-1 1 1 0 CD4+ naïve T
TTCTTCCTCTCGGTAA-1 1 1 0 CD4+ naïve T
CGGGTCAAGAGAGGTA-1 1 1 1 CD4+ naïve T
TACCCGTCATAATCCG-1 1 1 1 CD4+ naïve T
TGGGTTAGTGAATTAG-1 2 2 1 CD4+ naïve T

Embeddings, e.g. UMAP:

[16]:
pq.read_table(file / "obsm/X_umap.parquet")
[16]:
pyarrow.Table
1: float
2: float
----
1: [[9.66083,10.850063,12.345492,12.538155,12.528112,...,-1.9723414,-1.5232953,-1.7778809,-1.7384567,-1.5827981]]
2: [[5.779709,6.7745605,5.6794906,6.2592025,4.833545,...,-2.152379,-2.1579309,-2.3050048,-2.199118,-2.308012]]
[17]:
pq.read_table(file / "obsm/X_umap.parquet").to_pandas().to_numpy()[:10]
[17]:
array([[ 9.66083  ,  5.779709 ],
       [10.850063 ,  6.7745605],
       [12.345492 ,  5.6794906],
       [12.538155 ,  6.2592025],
       [12.528112 ,  4.833545 ],
       [13.152369 ,  5.566309 ],
       [ 9.910739 ,  9.146889 ],
       [ 8.418465 ,  5.2574263],
       [12.064896 ,  5.3718014],
       [11.305964 , -7.5169115]], dtype=float32)

Dense count matrix with normalised values, which contains protein names (.mod['prot'].var_names) as column names:

[18]:
pq.read_table(file / "mod/prot/X.parquet").to_pandas().tail()
[18]:
CD3_TotalSeqB CD4_TotalSeqB CD8a_TotalSeqB CD11b_TotalSeqB CD14_TotalSeqB CD15_TotalSeqB CD16_TotalSeqB CD19_TotalSeqB CD20_TotalSeqB CD25_TotalSeqB ... CD86_TotalSeqB CD127_TotalSeqB CD137_TotalSeqB CD197_TotalSeqB CD274_TotalSeqB CD278_TotalSeqB CD335_TotalSeqB PD-1_TotalSeqB HLA-DR_TotalSeqB TIGIT_TotalSeqB
406 -1.075961 1.245624 0.258909 2.032457 0.591571 1.429440 0.137301 2.361204 1.693235 1.284448 ... 4.252487 0.327570 1.309700 0.107906 0.710895 0.206146 2.295595 2.710157 13.345463 2.335479
407 -0.038257 8.728498 2.102803 1.870541 0.782583 2.021527 2.680047 1.295704 1.716122 1.112039 ... 1.883393 0.701129 2.403748 2.701393 1.573005 3.247963 2.305770 2.435769 13.624533 2.287940
408 0.254669 10.289223 1.933862 2.539970 0.168840 2.607727 2.538013 0.113553 3.241333 2.223520 ... 6.140669 1.847788 1.193516 0.819007 1.382885 0.913976 2.043625 0.751728 14.481518 2.834265
409 0.243213 13.866776 0.072664 1.043179 -0.051139 4.473774 1.771171 1.584222 4.893187 4.736353 ... 8.189964 2.338869 2.845908 1.140788 -0.275009 2.504928 4.931875 3.187832 19.352989 0.957393
410 2.687545 14.526915 3.441904 1.239545 2.281008 2.358840 1.231699 1.176513 3.691090 1.004248 ... 11.924853 1.308427 0.306832 3.636075 1.498553 3.189132 0.258857 4.084857 18.031036 2.995944

5 rows × 29 columns

Sparse count matrix with counts are stored in the COO format:

[19]:
pq.read_table(file / "mod/prot/layers/counts.parquet").to_pandas().tail()
[19]:
row col data
11098 410 23 2.0
11099 410 24 6.0
11100 410 26 6.0
11101 410 27 888.0
11102 410 28 3.0

For example, we can count proteins with a simple SQL query:

[20]:
import duckdb
duckdb.query(f'''
    SELECT COUNT(*)
    FROM '{file}/mod/prot/var.parquet'
''').fetchall()
[20]:
[(29,)]

If you want to read more about the Python API of DuckDB, you can find an intro here.

JSON files#

Simpler objects such as scalar values are serialised into JSON files.

[21]:
with open(file / "mod/rna/uns.json") as uns_file:
    uns = json.load(uns_file)

    print(json.dumps(uns, indent=4))
{
    "hvg": {
        "flavor": "seurat"
    },
    "leiden": {
        "params": {
            "n_iterations": -1,
            "random_state": 0,
            "resolution": 0.75
        }
    },
    "neighbors": {
        "connectivities_key": "connectivities",
        "distances_key": "distances",
        "params": {
            "method": "umap",
            "metric": "euclidean",
            "n_neighbors": 15,
            "random_state": 0
        }
    },
    "pca": {
        "params": {
            "use_highly_variable": true,
            "zero_center": true
        }
    },
    "rank_genes_groups": {
        "params": {
            "corr_method": "benjamini-hochberg",
            "groupby": "leiden",
            "method": "t-test_overestim_var",
            "reference": "rest",
            "use_raw": true
        }
    },
    "umap": {
        "params": {
            "a": 0.5830300205483709,
            "b": 1.334166992455648,
            "random_state": 11
        }
    }
}

Reading#

AnnData reader will return an in-memory AnnData object, and MuData reader will return an in-memory MuData object:

[22]:
read_mudata(file)
[22]:
MuData object with n_obs × n_vars = 411 × 56
  obs:      'louvain', 'leiden', 'leiden_wnn', 'celltype'
  var:      'feature_types', 'gene_ids', 'highly_variable'
  obsm:     'X_wnn_umap', 'X_umap', 'X_mofa_umap', 'X_mofa'
  varm:     'LFs'
  obsp:     'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances'
  2 modalities
    prot:   411 x 29
      var:  'gene_ids', 'feature_types', 'highly_variable'
      uns:  'neighbors', 'pca', 'umap'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      layers:       'counts'
      obsp: 'connectivities', 'distances'
    rna:    411 x 27
      obs:  'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'
      var:  'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
      uns:  'hvg', 'leiden', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'celltype_colors', 'leiden_colors'
      obsm: 'X_pca', 'X_umap'
      varm: 'PCs'
      obsp: 'connectivities', 'distances'

File reading speed#

Read speads are not compromised and are actually improved:

[23]:
import mudata
[24]:
%%timeit
mudata.read("data/pbmc5k_citeseq/minipbcite.h5mu")
93.9 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[25]:
%%timeit
read_mudata(file)
45.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Together, these methods in pqdata make the Parquet-based serialisation compatible with the whole scverse ecosystem while also providing benefits for out-of-memory analytics on large datasets.