Note
Go to the end to download the full example code.
Command Line Interface
This example demonstrates skerch functionality directly accessible
from the command line interface (CLI).
Note that all CLI functionality can also be directly accessed from Python. Check the API docs for comprehensive information, e.g.:
To run these commands, we simply assume that skerch is accessible to the
available python interpreter. The following (standard) imports are only
needed to run this example:
import os
from tempfile import TemporaryDirectory
from skerch.__main__ import main_wrapper as skerch_main
CLI help documentation
The following command provides information about positional and optional CLI arguments:
python -m skerch -h
And the corresponding output looks like this:
skerch_main(["-h"])
usage: cli.py [-h] [--apost_n APOST_N] [--apost_err APOST_ERR] [--is_complex]
[--lop_shape LOP_SHAPE] [--hdf5dir HDF5DIR] [--lo LO] [--ro RO]
[--inner INNER] [--partsize PARTSIZE] [--dtype DTYPE]
[--in_path IN_PATH] [--out_path OUT_PATH] [--ok_flag OK_FLAG]
[--delete_subfiles]
[COMMAND]
skerch CLI
positional arguments:
[COMMAND] Determines which functionality to run:
{'create_hdf5_layout_lop', 'post_bounds',
'merge_hdf5'}
options:
-h, --help show this help message and exit
--apost_n APOST_N Number of a-posteriori measurements intended.
--apost_err APOST_ERR
A-posteriori target error, from 0 (no error) to 1 (0x
- 2x).
--is_complex If given, bounds are given for complex data.
--lop_shape LOP_SHAPE
Matrix shape in the form 'height,width' as positive
integers.
--hdf5dir HDF5DIR Directory to create the HDF5 layout.
--lo LO Number of left outer measurements.
--ro RO Number of right outer measurements.
--inner INNER Number of inner measurements.
--partsize PARTSIZE How many entries will each HDF5 sub-file have.
--dtype DTYPE Datatype of HDF5 layout to be created.
--in_path IN_PATH Input path for the file to be processed.
--out_path OUT_PATH Output path for the file to be processed.
--ok_flag OK_FLAG If given, all HDF5 flags are checked to equal this.
--delete_subfiles If given, HDF5 subfiles are deleted upon merging to
monolithic.
A-posteriori error bounds
It is possible to efficiently estimate the Frobenius distance between
any two linear operators via sketches (see e.g. skerch.a_posteriori
or e.g. Sketched Low-Rank Decompositions for more information and
examples on how to run this estimation with skerch).
In a nutshell, we apply the same random “test” sketch to both operators, and compare the distance between measurements, which becomes is a proxy for the distance between the operators.
Since this is a randomized estimation, it is subject to error, and there are probabilistic bounds that allow us to know the probability that a given error may have occurred. Interestingly, this probability does not depend on the size of the operators, but on the number of “test” measurements performed (and whether the operators are real-valued or complex).
The following CLI call allows us to quickly check these probabilistic bounds for a given configuration (30 complex measurements):
python -m skerch post_bounds --apost_n=30 --apost_err=0.5 --is_complex
The following Python code is equivalent:
skerch_main(["post_bounds", "--apost_n=30", "--apost_err=0.5", "--is_complex"])
{'LOWER: P(err<=0.5x)': 0.0030445096757934554, 'HIGHER: P(err>=1.5x)': 0.05865709397802224}
This can be interpreted as follows: If we performed 30 test measurements and got an error estimate of \(\hat{\varepsilon}\), the probability of the actual error \(\varepsilon\) being outside of the \((0.5\hat{\varepsilon}, 1.5\hat{\varepsilon})\) range is as provided.
Creating HDF5 layout for distributed sketches
HDF5 files allow to efficiently read and write
large numerical arrays in an out-of-core, distributed fashion.
This is useful to perform sketched decompositions of (very) large linear
operators, since both storage and measurements can be distributed across
different processes or machines (see skerch.hdf5 and
Out-of-core Operations via HDF5 for details on how to work with
these files using skerch).
The following skerch CLI call allows to conveniently create a HDF5
layout to store sketched measurements from a linear operator of given
lop_shape and dtype:
python -m skerch create_hdf5_layout_lop --lop_shape=100,200 \
--dtype=complex128 --partsize=10 --lo=30 --ro=30 --inner=60
Equivalent python code (up to use of tmpdir):
tmpdir = TemporaryDirectory()
skerch_main(
[
"create_hdf5_layout_lop",
f"--hdf5dir={tmpdir.name}",
"--lop_shape=100,200",
"--dtype=complex128",
"--partsize=10",
"--lo=30",
"--ro=30",
"--inner=60",
]
)
{'dirpath': '/tmp/tmp3i4i3s5l', 'lop_shape': (100, 200), 'lop_dtype': 'complex128', 'partsize': 10, 'lo_meas': 30, 'ro_meas': 30, 'inner_meas': 60}
Note that if hdf5dir is given, it must exist and be empty (if not
given, a temporary directory will be suggested).
The optional lo, ro, inner parameters determine whether layouts for
(respectively) left-outer, right-outer and inner sketches will be created.
Another important detail is that, in order to facilitate concurrent writing,
the overall HDF5 layout is divided in smaller chunks, in what is known
as a HDF5 virtual dataset.
In this example, each chunk contains partsize=10 measurements, so we
end up with 3 chunks for lo, ro and 6 for inner.
Merging distributed HDF5 sketches
Although decentralized measurement and storage via HDF5 virtual datasets has many advantages, some operations may require to process the measurements in a centralized fashion. For instance, many operative systems do not allow a single process to keep thousands of files open at the same time. Also, many numerical routines may not feature an out-of-core, in-place implementation.
The skerch solution is to merge all individual HDF5 chunks from the
virtual dataset into a single, centralized HDF5 file of the same size.
It will still have the same contents, but instead of being a collection of
HDF5 files bundled into a virtual dataset, it will be a single, monolithic
HDF5 file with contiguous data. The following command merges the previously
created left-outer measurement layout:
python -m skerch merge_hdf5 --delete_subfiles --ok_flag=initialized \
--in_path /tmp/tmp4fswvvk2/leftouter_ALL.h5
Equivalent python code (up to tmpdir):
skerch_main(
[
"merge_hdf5",
"--delete_subfiles",
"--ok_flag=initialized",
"--in_path",
os.path.join(tmpdir.name, "leftouter_ALL.h5"),
]
)
tmpdir.cleanup()
Merged all sub-files of /tmp/tmp3i4i3s5l/leftouter_ALL.h5 into monolithic /tmp/tmp3i4i3s5l/leftouter_ALL.h5
Also deleted sub-files.
Note the following:
If the
--delete_subfilesflag is provided, each “chunk” file file will be deleted after being merged. This ensures disk usage remains almost constant.If
--ok_flagis provided, the script will check that all HDF5 flags match this value before proceeding. This can be used to ensure that all distributed measurements have been performed before merging/deleting.The
--out_pathflag can also be provided to set the location of the merged HDF5 file. If none provided, the path of the former virtual dataset is used (and it becomes an actual monolithic dataset instead of a bundle of virtual references to the chunk files). In either case, this CLI call returns the path of the merged dataset.
Total running time of the script: (0 minutes 3.347 seconds)