Low-level access to `.pqdata`#

Akin h5py and zarr libraries, pqdata implements a simple lower-level access interface for the contents of the .pqdata directories.

Some imports first:

[1]:

import numpy as np
from pathlib import Path
from pyarrow import parquet as pa

[2]:

import pqdata

Prepare the data:

[3]:

data = Path("data")

[4]:

import mudatasets
import mudata
mudata.set_options(pull_on_update=False)
mdata = mudatasets.load("pbmc5k_citeseq", files=["minipbcite.h5mu"], data_dir=data, backed=False)

■ File minipbcite.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/minipbcite.h5mu
■ Checksum is validated (md5) for minipbcite.h5mu
■ Loading minipbcite.h5mu...

[5]:

file = data / "pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata"
pqdata.write_mudata(mdata, file)

`open()`#

open() is a simple entry point:

[6]:

f = pqdata.open(file)
f

[6]:

ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata)

Cf. zarr.open, h5py.File

“Opening” the file is fast as it doesn’t do much apart from remembering the location it is pointed at:

[7]:

f.path

[7]:

PosixPath('data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata')

It returns an object that can be traversed in a straightforward fashion.

E.g. individual tables can be reached:

[8]:

f["mod"]["rna"]["X"]

[8]:

ParquetArray(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/X.parquet): shape (411,27), type (NKG7:float, KLRC2:float, GNLY:float, IGHM:float, IRF8:float, CD8B:float, CD79A:float, CD14:float, MS4A7:float, CCL5:float, FCGR3A:float, IL4R:float, IGHD:float, S100A8:float, LYZ:float, CD8A:float, FOXP3:float, IL2RA:float, TCL1A:float, TCF4:float, ITGAM:float, TRAC:float, IL7R:float, CST3:float, ITGB1:float, MS4A1:float, KLF4:float)

as well as collections:

[9]:

list(f["mod"]["rna"]["uns"])

[9]:

['umap',
 'leiden',
 'celltype_colors',
 'leiden_colors',
 'pca',
 'neighbors',
 'hvg',
 'rank_genes_groups']

Note that simple structures and scalars stored in JSON files are actually read into memory during traversing:

[10]:

f["mod"]["rna"]["uns"]["pca"]["params"]

[10]:

{'use_highly_variable': True, 'zero_center': True}

Lightweight objects store information about the absolute (system path) and relative (to the .pqdata file) location of the data:

[11]:

print("f['mod']['rna']['X']")
print(f"  root: {f["mod"]["rna"]["X"].root}")
print(f"  name: {f["mod"]["rna"]["X"].name}")
print(f"  path: {f["mod"]["rna"]["X"].path}")

f['mod']['rna']['X']
  root: /rna
  name: /rna/X
  path: data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/X.parquet

[12]:

table = pa.read_table(f["mod"]["rna"]["obsm"]["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

Generally, the original type of the object that was used to make the table is stored in .schema.metadata:

[13]:

table.schema.metadata

[13]:

{b'array': b'{"shape": [411, 2], "class": {"module": "numpy", "name": "ndarray"}}'}

open() can also work on modalities embedded inside multimodal containers:

[14]:

rna = pqdata.open(f["mod"]["rna"].path)
rna

[14]:

ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna)

[15]:

table = pa.read_table(rna["obsm"]["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

And generally, any part of the hierarchy of the file:

[16]:

rna_obsm = pqdata.open(rna["obsm"].path)
rna_obsm

[16]:

ParquetStorage(data/pbmc5k_citeseq/pbmc5k_citeseq_mudata.pqdata/mod/rna/obsm)

[17]:

table = pa.read_table(rna_obsm["X_umap"].path)
umap_embedding = table.to_pandas().to_numpy()

`read_elem()`#

Libraries like anndata store custom encoding types with the schema defined by the specification and provide the read_elem() interface to access individual objects like arrays, matrices, tables, etc.

With pqdata, objects like that are stored as tables in Parquet files*, and the original object class is preserved in metadata. This way there’s no need to define custom encoding types and schemas as this is handled by Parquet files.

* simpler entities like scalars are stored in JSON files

[18]:

from pqdata.core import read_elem

[19]:

rna_umap = read_elem(rna_obsm["X_umap"])
assert np.array_equal(umap_embedding, rna_umap)

Low-level access to .pqdata

Contents

Low-level access to `.pqdata`#

`open()`#

`read_elem()`#

Low-level access to .pqdata

Contents

Low-level access to .pqdata#

open()#

read_elem()#

Low-level access to `.pqdata`#

`open()`#

`read_elem()`#