The command-line zs
tool¶
The zs
tool can be used from the command-line to create, view,
and check ZS files.
The main zs
command on its own isn’t very useful. It can tell
you what version you have – these docs were built with:
$ zs --version
0.10.0
And it can tell you what subcommands are available:
$ zs --help
ZS: a space-efficient file format format for distributing, archiving,
and querying large data sets.
Usage:
zs <subcommand> [<args>...]
zs --version
zs --help
Available subcommands:
zs dump Get contents of a .zs file.
zs info Get general metadata about a .zs file.
zs validate Check a .zs file for validity.
zs make Create a new .zs file with specified contents.
For details, use 'zs <subcommand> --help'.
These subcommands are documented further below.
Note
In case you have the Python zs
package installed,
but somehow do not have the zs
executable available on your
path, then it can also be invoked as python -m zs
. E.g., these
two commands do the same thing:
$ zs dump myfile.zs
$ python -m zs dump myfile.zs
zs make
¶
zs make
allows you to create ZS files. In its simplest form, it
just reads in a text file, and writes out a ZS file, treating each
line as a separate record.
For example, if we have this data file (a tiny excerpt from the Web 1T dataset released by Google; note that the last whitespace in each line is a tab character):
$ cat tiny-4grams.txt
not done explicitly . 42
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
not done fast , 52
not done fast enough 71
Then we can compress it into a ZS file by running:
$ zs make '{"corpus": "doc-example"}' tiny-4grams.txt tiny-4grams.zs --codec deflate
zs: Opening new ZS file: tiny-4grams.zs
zs: Reading input file: tiny-4grams.txt
zs: Blocks written: 1
zs: Blocks written: 2
zs: Updating header...
zs: Done.
The first argument specifies some arbitrary metadata that will be saved into the ZS file, in the form of a JSON string; the second argument names the file we want to convert; and the third argument names the file we want to create.
The --codec
argument lets us choose which compression method we
use; usually you should stick with the default (which is lzma), but
until readthedocs.org responds to our bug report we can’t use lzma
here in the docs. Sorry.
Note
You must ensure that your file is sorted before running
zs make
. (If you don’t, then it will error out and scold you.)
GNU sort is very useful for this task – but don’t forget to set
LC_ALL=C
in your environment before calling sort, to make sure
that it uses ASCIIbetical ordering instead of something
locale-specific.
When your file is too large to fit into RAM, GNU sort will spill the data onto disk in temporary files. When your file is too large to fit onto disk, then a useful incantation is:
gunzip -c myfile.gz | env LC_ALL=C sort --compress-program=lzop \
| zs make "{...}" - myfile.zs
The --compress-program
option tells sort to automatically
compress and decompress the temporary files using the lzop
utility, so that you never end up with uncompressed data on
disk. (gzip
also works, but will be slower.)
Many other options are also available:
$ zs make --help
Create a new .zs file.
Usage:
zs make <metadata> <input_file> <new_zs_file>
zs make [--terminator TERMINATOR | --length-prefixed=TYPE]
[-j PARALLELISM]
[--no-spinner]
[--branching-factor=FACTOR]
[--approx-block-size=SIZE]
[--codec=CODEC] [-z COMPRESS-LEVEL]
[--no-default-metadata]
[--]
<metadata> <input_file> <new_zs_file>
zs make --help
Arguments:
<metadata> Arbitrary JSON-encoded metadata that will be stored in your
new ZS file. This must be a JSON "object", i.e., the
outermost characters have to be {}. If you're just messing
about, then you can just use "{}" here and be done, but for
any file that will live for long then we strongly recommend
adding more details about what this file is. See the
"Metadata conventions" section of the ZS manual for more
information.
<input_file> A file containing the records to be packed into the
new .zs file. Use "-" for stdin. Records must already be
sorted in ASCIIbetical order. You may want to do something
like:
cat myfile.txt | env LC_ALL=C sort | zs make - myfile.zs
<new_zs_file> The file to create. Conventionally uses the file extension
".zs".
Input file options:
--terminator=TERMINATOR Treat the input file as containing a series of
records separated by TERMINATOR. Standard Python
string escapes are supported (e.g., "\x00" for
NUL-terminated records). The default is
appropriate for standard Unix/OS X text files. If
your have a text file with Windows-style line
endings, then you'll want to use "\r\n"
instead. [default: \n]
--length-prefixed=TYPE Treat the input file as containing a series of
records containing arbitrary binary data, each
prefixed by its length in bytes, with this length
encoded according to TYPE. (Valid options:
uleb128, u64le.)
Processing options:
-j PARALLELISM The number of CPUs to use for compression.
[default: guess]
--no-spinner Disable the progress meter.
Output file options:
--branching-factor=FACTOR Number of keys in each *index* block.
[default: 1024]
--approx-block-size=SIZE Approximate *uncompressed* size of the records in
each *data* block, in bytes. [default: 393216]
--codec=CODEC Compression algorithm. (Valid options: none,
deflate, lzma.) [default: lzma]
-z COMPRESS-LEVEL, --compress-level=COMPRESS-LEVEL
Degree of compression to use. Interpretation
depends on the codec in use:
deflate: An integer between 1 and 9.
(Default: 6)
lzma: One of the strings 0, 0e, 1, or 1e.
The number (0 versus 1) indicates the history
size used in the compression -- there's no
point in using 1 or 1e unless you also
increase --approx-block-size. The presence of
the "e" turns on "extreme" mode, which is
several times slower, but may produce
substantially smaller files. (Default: 0e)
--no-default-metadata By default, 'zs make' adds an extra "build-info"
key to the metadata, recording the time, host,
user who created the file, and zs library
version. This option disables this behaviour.
zs info
¶
zs info
displays some general information about a ZS file. For example:
$ zs info tiny-4grams.zs
{
"root_index_offset": 380,
"root_index_length": 41,
"total_file_length": 421,
"codec": "deflate",
"data_sha256": "403b706aa1f8f5d1d2ffd2765507239bd5a5025bde3f89df8035f8a5b9348b11",
"metadata": {
"build-info": {
"user": "njs",
"host": "branna.vorpus.org",
"time": "2014-04-29T12:41:59.660529Z",
"version": "zs 0.9.0"
},
"corpus": "doc-example"
},
"statistics": {
"root_index_level": 1
}
}
The most interesting part of this output might be the "metadata"
field, which contains arbitrary metadata describing the file. Here we
see that our custom key was indeed added, and that zs make
also
added some default metadata. (If we wanted to suppress this we could
have used the --no-default-metadata
option.) The "data_sha256"
field is, as you might expect, a SHA-256 hash of the data contained
in this file – two ZS files will have the same value here if and
only if they contain exactly the same logical records, regardless of
compression and other details of physical file layout. The "codec"
field tells us which kind of compression was used. The other fields
have to do with more obscure technical
aspects of the ZS file format; see the documentation for the
ZS
class and the file format specification
for details.
zs info
is fast, even on arbitrarily large files, because it
looks at only the header and the root index; it doesn’t have to
uncompress the actual data. If you find a large ZS file on the web
and want to see its metadata before downloading it, you can pass an
HTTP URL to zs info
directly on the command line, and it will
download only as much of the file as it needs to.
zs info
doesn’t take many options:
$ zs info --help
Display general information from a .zs file's header.
Usage:
zs info [--metadata-only] [--] <zs_file>
zs info --help
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Options:
-m, --metadata-only Output only the file's metadata, not any general
information about it.
Output will be valid JSON.
zs dump
¶
So zs info
tells us about the contents of a ZS file, but how
do we get our data back out? That’s the job of zs dump
. In the
simplest case, it simply dumps the whole file to standard output, with
one record per line – the inverse of zs make
. For example, this
lets us “uncompress” our ZS file to recover the original file:
$ zs dump tiny-4grams.zs
not done explicitly . 42
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
not done fast , 52
not done fast enough 71
But we can also extract just a subset of the data. For example, we can
pull out a single line (notice the use of \t
to specify a tab
character – Python-style backslash character sequences are fully
supported):
$ zs dump tiny-4grams.zs --prefix="not done extensive testing\t"
not done extensive testing 749
Or a set of related ngrams:
$ zs dump tiny-4grams.zs --prefix="not done extensive "
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
Or any arbitrary range:
$ zs dump tiny-4grams.zs --start="not done ext" --stop="not done fast"
not done extensive research 225
not done extensive testing 749
not done extensive tests 87
not done extremely well 41
not done fairly . 61
Just like zs info
, zs dump
is fast – it reads only the data
it needs to to satisfy your query. (Of course, if you request the
whole file, then it will read the whole file – but it does this in an
optimized way; see the -j
option if you want to tune how many CPUs
it uses for decompression.) And just like zs info
, zs dump
can directly take an HTTP URL on the command line, and will download
only as much data as it has to.
We also have several options to let us control the output format. ZS files allow records to contain arbitrary data, which means that it’s possible to have a record that contains a newline embedded in it. So we might prefer to use some other character to mark the ends of records, like NUL:
$ zs dump tiny-4grams.zs --terminator="\x00"
...but putting the output from that into these docs would be hard to read. Instead we’ll demonstrate with something sillier:
$ zs dump tiny-4grams.zs --terminator="XYZZY" --prefix="not done extensive "
not done extensive research 225XYZZYnot done extensive testing 749XYZZYnot done extensive tests 87XYZZY
Of course, this will still have a problem if any of our records
contained the string “XYZZY” – in fact, our records could in theory
contain anything we might choose to use as a terminator, so if we
have an arbitrary ZS file whose contents we know nothing about, then
none of the options we’ve seen so far is guaranteed to work. The
safest approach is to instead use a format in which each record is
explicitly prefixed by its length. zs dump
can produce
length-prefixed output with lengths encoded in either u64le or uleb128
format (see Integer representations for details about what
these are).
$ zs dump tiny-4grams.zs --prefix="not done extensive " --length-prefixed=u64le | hd
/bin/sh: 1: hd: not found
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe
Obviously this is mostly intended for when you want to read the data into another program. For example, if you had a ZS file that was compressed using the lzma codec and you wanted to convert it to the deflate codec, the easiest and safest way to do that is with a command like:
$ zs dump --length-prefixed=uleb128 myfile-lzma.zs | \
zs make --length-prefixed=uleb128 --codec=deflate \
"$(zs info -m myfile-lzma.zs)" - myfile-deflate.zs
If you’re using Python, of course, the most convenient way to read a
ZS file into your program is not to use zs dump
at all, but to use
the zs
library API directly.
Full options:
$ zs dump --help
Unpack some or all of the contents of a .zs file.
Usage:
zs dump <zs_file>
zs dump [--start=START] [--stop=STOP] [--prefix=PREFIX]
[--terminator=TERMINATOR | --length-prefixed=TYPE]
[-j PARALLELISM]
[-o FILE]
[--] <zs_file>
zs dump --help
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Selection options:
--start=START Output only records which are >= START.
--stop=STOP Do not output any records which are >= STOP.
--prefix=PREFIX Output only records which begin with PREFIX.
Python string escapes (e.g., "\n", "\x00") are allowed. All comparisons
are performed using ASCIIbetical ordering.
Processing options:
-j PARALLELISM The number of CPUs to use for decompression. Note
that if you know that you are only reading a small
number of records, then -j0 may be the fastest
option, since it reduces startup overhead.
[default: guess]
Output options:
-o FILE, --output=FILE Output to the given file, or "-" for stdout.
[default: -]
Record framing options:
--terminator=TERMINATOR String used to terminate records in output. Python
string escapes are allowed (e.g., "\n", "\x00").
[default: \n]
--length-prefixed=TYPE Instead of terminating records with a marker,
prefix each record with its length, encoded as
TYPE. (Options: uleb128, u64le)
ZS files are organized as a collection of records, which may contain
arbitrary data. By default, these are output as individual lines. However,
this may not be a great idea if you have records which themselves contain
newline characters. As an alternative, you can request that they instead be
terminated by some arbitrary string, or else request that each record be
prefixed by its length, encoded in either unsigned little-endian base-128
(uleb128) format or unsigned little-endian 64-bit (u64le) format.
Warning
Due to limitations in the multiprocessing module in
Python 2, zs dump
can be poorly behaved if you hit control-C
(e.g., refusing to exit).
On a Unix-like platform, if you have a zs dump
that is ignoring
control-C, then try hitting control-Z and then running kill
%zs
.
The easy workaround to this problem is to use Python 3 to run
zs
. The not so easy workaround is to implement a custom process
pool manager for Python 2 – patches accepted!
zs validate
¶
This command can be used to fully validate a ZS file for self-consistency and compliance with the specification (see On-disk layout of ZS files); this makes it rather useful to anyone trying to write new software to generate ZS files.
It is also useful because it verifies the SHA-256 checksum and all of
the per-block checksums, providing extremely strong protection against
errors caused by disk failures, cosmic rays, and other such
annoyances. However, this is not usually necessary, since the zs
commands and the zs
library interface never return any data
unless it passes a 64-bit checksum. With ZS you can be sure that your
results have not been corrupted by hardware errors, even if you never
run zs validate
at all.
Full options:
$ zs validate --help
Check a .zs file for errors or data corruption.
Usage:
zs validate [-j PARALLELISM] [--] <zs_file>
Arguments:
<zs_file> Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.
Options:
-j PARALLELISM The number of CPUs to use for decompression.
[default: guess]