# The command-line zs tool¶

The zs tool can be used from the command-line to create, view, and check ZS files.

The main zs command on its own isn’t very useful. It can tell you what version you have – these docs were built with:

$zs --version 0.10.0+dev  And it can tell you what subcommands are available: $ zs --help
ZS: a space-efficient file format format for distributing, archiving,
and querying large data sets.

Usage:
zs <subcommand> [<args>...]
zs --version
zs --help

Available subcommands:
zs dump      Get contents of a .zs file.
zs validate  Check a .zs file for validity.
zs make      Create a new .zs file with specified contents.

For details, use 'zs <subcommand> --help'.


These subcommands are documented further below.

Note

In case you have the Python zs package installed, but somehow do not have the zs executable available on your path, then it can also be invoked as python -m zs. E.g., these two commands do the same thing:

$zs dump myfile.zs$ python -m zs dump myfile.zs


## zs make¶

zs make allows you to create ZS files. In its simplest form, it just reads in a text file, and writes out a ZS file, treating each line as a separate record.

For example, if we have this data file (a tiny excerpt from the Web 1T dataset released by Google; note that the last whitespace in each line is a tab character):

$cat tiny-4grams.txt not done explicitly . 42 not done extensive research 225 not done extensive testing 749 not done extensive tests 87 not done extremely well 41 not done fairly . 61 not done fast , 52 not done fast enough 71  Then we can compress it into a ZS file by running: $ zs make '{"corpus": "doc-example"}' tiny-4grams.txt tiny-4grams.zs
zs: Opening new ZS file: tiny-4grams.zs
zs: Blocks written: 1
zs: Blocks written: 2
zs: Done.


The first argument specifies some arbitrary metadata that will be saved into the ZS file, in the form of a JSON string; the second argument names the file we want to convert; and the third argument names the file we want to create.

Note

You must ensure that your file is sorted before running zs make. (If you don’t, then it will error out and scold you.) GNU sort is very useful for this task – but don’t forget to set LC_ALL=C in your environment before calling sort, to make sure that it uses ASCIIbetical ordering instead of something locale-specific.

When your file is too large to fit into RAM, GNU sort will spill the data onto disk in temporary files. When your file is too large to fit onto disk, then a useful incantation is:

gunzip -c myfile.gz | env LC_ALL=C sort --compress-program=lzop \
| zs make "{...}" - myfile.zs


The --compress-program option tells sort to automatically compress and decompress the temporary files using the lzop utility, so that you never end up with uncompressed data on disk. (gzip also works, but will be slower.)

Many other options are also available:

$zs make --help Create a new .zs file. Usage: zs make <metadata> <input_file> <new_zs_file> zs make [--terminator TERMINATOR | --length-prefixed=TYPE] [-j PARALLELISM] [--no-spinner] [--branching-factor=FACTOR] [--approx-block-size=SIZE] [--codec=CODEC] [-z COMPRESS-LEVEL] [--no-default-metadata] [--] <metadata> <input_file> <new_zs_file> zs make --help Arguments: <metadata> Arbitrary JSON-encoded metadata that will be stored in your new ZS file. This must be a JSON "object", i.e., the outermost characters have to be {}. If you're just messing about, then you can just use "{}" here and be done, but for any file that will live for long then we strongly recommend adding more details about what this file is. See the "Metadata conventions" section of the ZS manual for more information. <input_file> A file containing the records to be packed into the new .zs file. Use "-" for stdin. Records must already be sorted in ASCIIbetical order. You may want to do something like: cat myfile.txt | env LC_ALL=C sort | zs make - myfile.zs <new_zs_file> The file to create. Conventionally uses the file extension ".zs". Input file options: --terminator=TERMINATOR Treat the input file as containing a series of records separated by TERMINATOR. Standard Python string escapes are supported (e.g., "\x00" for NUL-terminated records). The default is appropriate for standard Unix/OS X text files. If your have a text file with Windows-style line endings, then you'll want to use "\r\n" instead. [default: \n] --length-prefixed=TYPE Treat the input file as containing a series of records containing arbitrary binary data, each prefixed by its length in bytes, with this length encoded according to TYPE. (Valid options: uleb128, u64le.) Processing options: -j PARALLELISM The number of CPUs to use for compression. [default: guess] --no-spinner Disable the progress meter. Output file options: --branching-factor=FACTOR Number of keys in each *index* block. [default: 1024] --approx-block-size=SIZE Approximate *uncompressed* size of the records in each *data* block, in bytes. [default: 393216] --codec=CODEC Compression algorithm. (Valid options: none, deflate, lzma.) [default: lzma] -z COMPRESS-LEVEL, --compress-level=COMPRESS-LEVEL Degree of compression to use. Interpretation depends on the codec in use: deflate: An integer between 1 and 9. (Default: 6) lzma: One of the strings 0, 0e, 1, or 1e. The number (0 versus 1) indicates the history size used in the compression -- there's no point in using 1 or 1e unless you also increase --approx-block-size. The presence of the "e" turns on "extreme" mode, which is several times slower, but may produce substantially smaller files. (Default: 0e) --no-default-metadata By default, 'zs make' adds an extra "build-info" key to the metadata, recording the time, host, user who created the file, and zs library version. This option disables this behaviour.  ## zs info¶ zs info displays some general information about a ZS file. For example: $ zs info tiny-4grams.zs
{
"root_index_offset": 380,
"root_index_length": 41,
"total_file_length": 421,
"codec": "deflate",
"data_sha256": "403b706aa1f8f5d1d2ffd2765507239bd5a5025bde3f89df8035f8a5b9348b11",
"build-info": {
"time": "2014-04-29T12:41:59.660529Z",
"user": "njs",
"version": "zs 0.9.0",
"host": "branna.vorpus.org"
},
"corpus": "doc-example"
},
"statistics": {
"root_index_level": 1
}
}


The most interesting part of this output might be the "metadata" field, which contains arbitrary metadata describing the file. Here we see that our custom key was indeed added, and that zs make also added some default metadata. (If we wanted to suppress this we could have used the --no-default-metadata option.) The "data_sha256" field is, as you might expect, a SHA-256 hash of the data contained in this file – two ZS files will have the same value here if and only if they contain exactly the same logical records, regardless of compression and other details of physical file layout. The "codec" field tells us which kind of compression was used. The other fields have to do with more obscure technical aspects of the ZS file format; see the documentation for the ZS class and the file format specification for details.

zs info is fast, even on arbitrarily large files, because it looks at only the header and the root index; it doesn’t have to uncompress the actual data. If you find a large ZS file on the web and want to see its metadata before downloading it, you can pass an HTTP URL to zs info directly on the command line, and it will download only as much of the file as it needs to.

zs info doesn’t take many options:

$zs info --help Display general information from a .zs file's header. Usage: zs info [--metadata-only] [--] <zs_file> zs info --help Arguments: <zs_file> Path or URL pointing to a .zs file. An argument beginning with the four characters "http" will be treated as a URL. Options: -m, --metadata-only Output only the file's metadata, not any general information about it. Output will be valid JSON.  ## zs dump¶ So zs info tells us about the contents of a ZS file, but how do we get our data back out? That’s the job of zs dump. In the simplest case, it simply dumps the whole file to standard output, with one record per line – the inverse of zs make. For example, this lets us “uncompress” our ZS file to recover the original file: $ zs dump tiny-4grams.zs
not done explicitly .	42
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61
not done fast ,	52
not done fast enough	71


But we can also extract just a subset of the data. For example, we can pull out a single line (notice the use of \t to specify a tab character – Python-style backslash character sequences are fully supported):

$zs dump tiny-4grams.zs --prefix="not done extensive testing\t" not done extensive testing 749  Or a set of related ngrams: $ zs dump tiny-4grams.zs --prefix="not done extensive "
not done extensive research	225
not done extensive testing	749
not done extensive tests	87


Or any arbitrary range:

$zs dump tiny-4grams.zs --start="not done ext" --stop="not done fast" not done extensive research 225 not done extensive testing 749 not done extensive tests 87 not done extremely well 41 not done fairly . 61  Just like zs info, zs dump is fast – it reads only the data it needs to to satisfy your query. (Of course, if you request the whole file, then it will read the whole file – but it does this in an optimized way; see the -j option if you want to tune how many CPUs it uses for decompression.) And just like zs info, zs dump can directly take an HTTP URL on the command line, and will download only as much data as it has to. We also have several options to let us control the output format. ZS files allow records to contain arbitrary data, which means that it’s possible to have a record that contains a newline embedded in it. So we might prefer to use some other character to mark the ends of records, like NUL: $ zs dump tiny-4grams.zs --terminator="\x00"


...but putting the output from that into these docs would be hard to read. Instead we’ll demonstrate with something sillier:

$zs dump tiny-4grams.zs --terminator="XYZZY" --prefix="not done extensive " not done extensive research 225XYZZYnot done extensive testing 749XYZZYnot done extensive tests 87XYZZY  Of course, this will still have a problem if any of our records contained the string “XYZZY” – in fact, our records could in theory contain anything we might choose to use as a terminator, so if we have an arbitrary ZS file whose contents we know nothing about, then none of the options we’ve seen so far is guaranteed to work. The safest approach is to instead use a format in which each record is explicitly prefixed by its length. zs dump can produce length-prefixed output with lengths encoded in either u64le or uleb128 format (see Integer representations for details about what these are). $ zs dump tiny-4grams.zs --prefix="not done extensive " --length-prefixed=u64le | hd
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe


Obviously this is mostly intended for when you want to read the data into another program. For example, if you had a ZS file that was compressed using the lzma codec and you wanted to convert it to the deflate codec, the easiest and safest way to do that is with a command like:

$zs dump --length-prefixed=uleb128 myfile-lzma.zs | \ zs make --length-prefixed=uleb128 --codec=deflate \ "$(zs info -m myfile-lzma.zs)" - myfile-deflate.zs


If you’re using Python, of course, the most convenient way to read a ZS file into your program is not to use zs dump at all, but to use the zs library API directly.

Full options:

$zs dump --help Unpack some or all of the contents of a .zs file. Usage: zs dump <zs_file> zs dump [--start=START] [--stop=STOP] [--prefix=PREFIX] [--terminator=TERMINATOR | --length-prefixed=TYPE] [-j PARALLELISM] [-o FILE] [--] <zs_file> zs dump --help Arguments: <zs_file> Path or URL pointing to a .zs file. An argument beginning with the four characters "http" will be treated as a URL. Selection options: --start=START Output only records which are >= START. --stop=STOP Do not output any records which are >= STOP. --prefix=PREFIX Output only records which begin with PREFIX. Python string escapes (e.g., "\n", "\x00") are allowed. All comparisons are performed using ASCIIbetical ordering. Processing options: -j PARALLELISM The number of CPUs to use for decompression. Note that if you know that you are only reading a small number of records, then -j0 may be the fastest option, since it reduces startup overhead. [default: guess] Output options: -o FILE, --output=FILE Output to the given file, or "-" for stdout. [default: -] Record framing options: --terminator=TERMINATOR String used to terminate records in output. Python string escapes are allowed (e.g., "\n", "\x00"). [default: \n] --length-prefixed=TYPE Instead of terminating records with a marker, prefix each record with its length, encoded as TYPE. (Options: uleb128, u64le) ZS files are organized as a collection of records, which may contain arbitrary data. By default, these are output as individual lines. However, this may not be a great idea if you have records which themselves contain newline characters. As an alternative, you can request that they instead be terminated by some arbitrary string, or else request that each record be prefixed by its length, encoded in either unsigned little-endian base-128 (uleb128) format or unsigned little-endian 64-bit (u64le) format.  Warning Due to limitations in the multiprocessing module in Python 2, zs dump can be poorly behaved if you hit control-C (e.g., refusing to exit). On a Unix-like platform, if you have a zs dump that is ignoring control-C, then try hitting control-Z and then running kill %zs. The easy workaround to this problem is to use Python 3 to run zs. The not so easy workaround is to implement a custom process pool manager for Python 2 – patches accepted! ## zs validate¶ This command can be used to fully validate a ZS file for self-consistency and compliance with the specification (see On-disk layout of ZS files); this makes it rather useful to anyone trying to write new software to generate ZS files. It is also useful because it verifies the SHA-256 checksum and all of the per-block checksums, providing extremely strong protection against errors caused by disk failures, cosmic rays, and other such annoyances. However, this is not usually necessary, since the zs commands and the zs library interface never return any data unless it passes a 64-bit checksum. With ZS you can be sure that your results have not been corrupted by hardware errors, even if you never run zs validate at all. Full options: $ zs validate --help
Check a .zs file for errors or data corruption.

Usage:
zs validate [-j PARALLELISM] [--] <zs_file>

Arguments:
<zs_file>  Path or URL pointing to a .zs file. An argument beginning with
the four characters "http" will be treated as a URL.

Options:
-j PARALLELISM             The number of CPUs to use for decompression.
[default: guess]