The command-line zs tool

The zs tool can be used from the command-line to create, view, and check ZS files.

The main zs command on its own isn’t very useful – it can tell you check what version you have:

$ zs --version
0.9.0

And it can tell you what subcommands are available:

$ zs --help
ZS: a space-efficient file format format for distributing, archiving,
and querying large data sets.

Usage:
  zs <subcommand> [<args>...]
  zs --version
  zs --help

Available subcommands:
  zs dump      Get contents of a .zs file.
  zs info      Get general metadata about a .zs file.
  zs validate  Check a .zs file for validity.
  zs make      Create a new .zs file with specified contents.

For details, use 'zs <subcommand> --help'.

These subcommands are documented further below.

Note

In case you have the Python zs package installed, but somehow do not have the zs executable available on your path, then it can also be invoked as python -m zs. E.g., these two commands do the same thing:

$ zs dump myfile.zs
$ python -m zs dump myfile.zs

zs make

zs make allows you to create ZS files. In its simplest form, it just reads in a text file, and writes out a ZS file, treating each line as a separate record.

For example, if we have this data file (a tiny excerpt from the Web 1T dataset released by Google; note that the last whitespace in each line is a tab character):

$ cat tiny-4grams.txt
not done explicitly .	42
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61
not done fast ,	52
not done fast enough	71

Then we can compress it into a ZS file by running:

$ zs make '{"corpus": "doc-example"}' tiny-4grams.txt tiny-4grams.zs
zs: Opening new ZS file: tiny-4grams.zs
zs: Reading input file: tiny-4grams.txt
zs: Blocks written: 1
zs: Blocks written: 2
zs: Blocks written: 2
zs: Updating header...
zs: Done.

The first argument specifies some arbitrary metadata that will be saved into the ZS file, in the form of a JSON string, the second argument names the file we want to convert, and the third argument names the file we want to create.

Note

You must ensure that your file is sorted before running zs make. (If you don’t, then it will error out and scold you.) GNU sort is very useful for this task – but don’t forget to set LC_ALL=C in your environment before calling sort, to make sure that it uses ASCIIbetical ordering instead of something locale-specific.

When your file is too large to fit into RAM, GNU sort will spill the data onto disk in temporary files. When your file is too large to fit onto disk, then a useful incantation is:

gunzip -c myfile.gz | env LC_ALL=C sort --compress-program=lzop \
   | zs make "{...}" - myfile.zs

The --compress-program option tells sort to automatically compress and decompress the temporary files using the lzop utility, so that you never end up with uncompressed data on disk. (gzip also works, but will be slower.)

Many other options are also available:

$ zs make --help
Create a new .zs file.

Usage:
  zs make <metadata> <input_file> <new_zs_file>
  zs make [--terminator TERMINATOR | --length-prefixed=TYPE]
          [-j PARALLELISM]
          [--no-spinner]
          [--branching-factor=FACTOR]
          [--approx-block-size=SIZE]
          [--codec=CODEC] [-z COMPRESS-LEVEL]
          [--no-default-metadata]
          [--]
          <metadata> <input_file> <new_zs_file>
  zs make --help

Arguments:

  <metadata>      Arbitrary JSON-encoded metadata that will be stored in your
                  new ZS file. This must be a JSON "object", i.e., the
                  outermost characters have to be {}. If you're just messing
                  about, then you can just use "{}" here and be done, but for
                  any file that will live for long then we strongly recommend
                  adding more details about what this file is. See the
                  "Metadata conventions" section of the ZS manual for more
                  information.

  <input_file>    A file containing the records to be packed into the
                  new .zs file. Use "-" for stdin. Records must already be
                  sorted in ASCIIbetical order. You may want to do something
                  like:
                    cat myfile.txt | env LC_ALL=C sort | zs make - myfile.zs

  <new_zs_file>  The file to create. Conventionally uses the file extension
                 ".zs".

Input file options:
  --terminator=TERMINATOR    Treat the input file as containing a series of
                             records separated by TERMINATOR. Standard Python
                             string escapes are supported (e.g., "\x00" for
                             NUL-terminated records). The default is
                             appropriate for standard Unix/OS X text files. If
                             your have a text file with Windows-style line
                             endings, then you'll want to use "\r\n"
                             instead. [default: \n]
  --length-prefixed=TYPE     Treat the input file as containing a series of
                             records containing arbitrary binary data, each
                             prefixed by its length in bytes, with this length
                             encoded according to TYPE. (Valid options:
                             uleb128, u64le.)

Processing options:
  -j PARALLELISM             The number of CPUs to use for compression.
                             [default: guess]
  --no-spinner               Disable the progress meter.

Output file options:
  --branching-factor=FACTOR  Number of keys in each *index* block.
                             [default: 1024]
  --approx-block-size=SIZE   Approximate *uncompressed* size of the records in
                             each *data* block, in bytes. [default: 131072]
  --codec=CODEC              Compression algorithm. (Valid options: none,
                             deflate, bz2, lzma.) [default: bz2]
  -z COMPRESS-LEVEL, --compress-level=COMPRESS-LEVEL
                             Degree of compression to use. Interpretation
                             depends on the codec in use:
                               deflate: An integer between 1 and 9.
                                 (Default: 6)
                               bz2: An integer between 1 and 9. (Default: 9)
                               lzma: One of the strings 0, 0e, 1, or 1e.
                                 Note that 0 and 1 are several times faster
                                 than 0e and 1e, though at some cost in
                                 compression ratio. Note also that there is no
                                 benefit to using 1 or 1e unless you also
                                 increase --approx-block-size. (Default: 0e)
  --no-default-metadata      By default, 'zs make' adds an extra "build-info"
                             key to the metadata, recording the time, host,
                             user who created the file, and zs library
                             version. This option disables this behaviour.

zs info

zs info displays some general information about a ZS file. For example:

$ zs info tiny-4grams.zs
{
    "root_index_offset": 397, 
    "root_index_length": 85, 
    "total_file_length": 482, 
    "codec": "bz2", 
    "data_sha256": "403b706aa1f8f5d1d2ffd2765507239bd5a5025bde3f89df8035f8a5b9348b11", 
    "metadata": {
        "corpus": "doc-example", 
        "build-info": {
            "host": "branna.vorpus.org", 
            "user": "njs", 
            "time": "2014-04-24T19:20:22.168359Z"
        }
    }, 
    "statistics": {
        "root_index_level": 1
    }
}

The most interesting part of this output might be the "metadata" field, which contains arbitrary metadata describing the file. Here we see that our custom key was indeed added, and that zs make also added some default metadata. (If we wanted to suppress this we could have used the --no-default-metadata option.) The "data_sha256" field is, as you might expect, a SHA-256 hash of the data contained in this file – two ZS files will have the same value here if and only if they contain exactly the same logical records, regardless of compression and other details of physical file layout. The "codec" field tells us which kind of compression was used (this file uses the bzip2 format); if we wanted something different we could have passed --codec to zs make. The other fields have to do with more obscure technical aspects of the ZS file format; see the documentation for the ZS class and the file format specification for details.

zs info is fast, even on arbitrarily large files, because it looks at only the header and the root index; it doesn’t have to uncompress the actual data. If you find a large ZS file on the web and want to see its metadata before downloading it, you can pass an HTTP URL to zs info directly on the command line, and it will download only as much of the file as it needs to.

zs info doesn’t take many options:

$ zs info --help
Display general information from a .zs file's header.

Usage:
  zs info [--metadata-only] [--] <zs_file>
  zs info --help

Arguments:
  <zs_file>  Path or URL pointing to a .zs file. An argument beginning with
             the four characters "http" will be treated as a URL.

Options:
  -m, --metadata-only   Output only the file's metadata, not any general
                        information about it.

Output will be valid JSON.

zs dump

So zs info tells us about the contents of a ZS file, but how do we get our data back out? That’s the job of zs dump. In the simplest case, it simply dumps the whole file to standard output, with one record per line – the inverse of zs make. For example, this lets us “uncompress” our ZS file to recover the original file:

$ zs dump tiny-4grams.zs
not done explicitly .	42
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61
not done fast ,	52
not done fast enough	71

But we can also extract just a subset of the data. For example, we can pull out a single line (notice the use of \t to specify a tab character – Python-style backslash character sequences are fully supported):

$ zs dump tiny-4grams.zs --prefix="not done extensive testing\t"
not done extensive testing	749

Or a set of related ngrams:

$ zs dump tiny-4grams.zs --prefix="not done extensive "
not done extensive research	225
not done extensive testing	749
not done extensive tests	87

Or any arbitrary range:

$ zs dump tiny-4grams.zs --start="not done ext" --stop="not done fast"
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61

Just like zs info, zs dump is fast – it reads only the data it needs to to satisfy your query. (Of course, if you request the whole file, then it will read the whole file – but it does this in an optimized way; see the -j option if you want to tune how many CPUs it uses for decompression.) And just like zs info, zs dump can directly take an HTTP URL on the command line, and will download only as much data as it has to.

We also have several options to let us control the output format. ZS files allow records to contain arbitrary data, which means that it’s possible to have a record that contains a newline embedded in it. So we might prefer to use some other character to mark the ends of records, like NUL:

$ zs dump tiny-4grams.zs --terminator="\x00"

...but putting the output from that into these docs would be hard to read. Instead we’ll demonstrate with something sillier:

$ zs dump tiny-4grams.zs --terminator="XYZZY" --prefix="not done extensive "
not done extensive research	225XYZZYnot done extensive testing	749XYZZYnot done extensive tests	87XYZZY

Of course, this will still have a problem if any of our records contained the string “XYZZY” – in fact, our records could in theory contain anything we might choose to use as a terminator, so if we have an arbitrary ZS file whose contents we know nothing about, then none of the options we’ve seen so far is guaranteed to work. The safest approach is to instead use a format in which each record is explicitly prefixed by its length. zs dump can produce length-prefixed output with lengths encoded in either u64le or uleb128 format (see Integer representations for details about what these are).

$ zs dump tiny-4grams.zs --prefix="not done extensive " --length-prefixed=u64le | hd
00000000  1f 00 00 00 00 00 00 00  6e 6f 74 20 64 6f 6e 65  |........not done|
00000010  20 65 78 74 65 6e 73 69  76 65 20 72 65 73 65 61  | extensive resea|
00000020  72 63 68 09 32 32 35 1e  00 00 00 00 00 00 00 6e  |rch.225........n|
00000030  6f 74 20 64 6f 6e 65 20  65 78 74 65 6e 73 69 76  |ot done extensiv|
00000040  65 20 74 65 73 74 69 6e  67 09 37 34 39 1b 00 00  |e testing.749...|
00000050  00 00 00 00 00 6e 6f 74  20 64 6f 6e 65 20 65 78  |.....not done ex|
00000060  74 65 6e 73 69 76 65 20  74 65 73 74 73 09 38 37  |tensive tests.87|
00000070

Obviously this is mostly intended for when you want to read the data into another program. For example, if you have a ZS file that was compressed using the bz2 codec and you want to convert it to the deflate codec, the easiest and safest way to do that is with a command like:

$ zs dump --length-prefixed=uleb128 myfile-bz2.zs | \
  zs make --length-prefixed=uleb128 --codec=deflate \
      "$(zs info -m myfile-bz2.zs)" - myfile-deflate.zs

If you’re using Python, of course, the most convenient way to read a ZS file is not to use zs dump at all, but to use the zs library API directly.

Full options:

$ zs dump --help
Unpack some or all of the contents of a .zs file.

Usage:
  zs dump <zs_file>
  zs dump [--start=START] [--stop=STOP] [--prefix=PREFIX]
          [--terminator=TERMINATOR | --length-prefixed=TYPE]
          [-j PARALLELISM]
          [-o FILE]
          [--] <zs_file>
  zs dump --help

Arguments:
  <zs_file>  Path or URL pointing to a .zs file. An argument beginning with
             the four characters "http" will be treated as a URL.

Selection options:
  --start=START            Output only records which are >= START.
  --stop=STOP              Do not output any records which are >= STOP.
  --prefix=PREFIX          Output only records which begin with PREFIX.

  Python string escapes (e.g., "\n", "\x00") are allowed. All comparisons
  are performed using ASCIIbetical ordering.

Processing options:
  -j PARALLELISM           The number of CPUs to use for decompression. Note
                           that if you know that you are only reading a small
                           number of records, then -j0 may be the fastest
                           option, since it reduces startup overhead.
                           [default: guess]

Output options:
  -o FILE, --output=FILE   Output to the given file, or "-" for stdout.
                           [default: -]

Record framing options:
  --terminator=TERMINATOR  String used to terminate records in output. Python
                           string escapes are allowed (e.g., "\n", "\x00").
                           [default: \n]
  --length-prefixed=TYPE   Instead of terminating records with a marker,
                           prefix each record with its length, encoded as
                           TYPE. (Options: uleb128, u64le)

  ZS files are organized as a collection of records, which may contain
  arbitrary data. By default, these are output as individual lines. However,
  this may not be a great idea if you have records which themselves contain
  newline characters. As an alternative, you can request that they instead be
  terminated by some arbitrary string, or else request that each record be
  prefixed by its length, encoded in either unsigned little-endian base-128
  (uleb128) format or unsigned little-endian 64-bit (u64le) format.

Warning

Due to limitations in the multiprocessing module in Python 2, zs dump can be poorly behaved if you hit control-C (e.g., refusing to exit).

On a Unix-like platform, if you have a zs dump that is ignoring control-C, then try hitting control-Z and then running kill %zs.

The easy workaround to this problem is to use Python 3 to run zs. The not so easy workaround is to implement a custom process pool manager for Python 2 – patches accepted!

zs validate

This command can be used to fully validate a ZS file for self-consistency and compliance with the specification (see On-disk layout of ZS files); this makes it rather useful to anyone trying to write new software to generate ZS files.

It is also useful because it verifies the SHA-256 checksum and all of the per-block checksums, providing extremely strong protection against errors caused by disk failures, cosmic rays, and other such annoyances. However, this is not usually necessary, since the zs commands and the zs library interface never return any data unless it passes a 64-bit checksum. With ZS you can be sure that your results have not been corrupted by hardware errors, even if you never run zs validate at all.

Full options:

$ zs validate --help
Check a .zs file for errors or data corruption.

Usage:
  zs validate [-j PARALLELISM] [--] <zs_file>

Arguments:
  <zs_file>  Path or URL pointing to a .zs file. An argument beginning with
             the four characters "http" will be treated as a URL.

Options:
  -j PARALLELISM             The number of CPUs to use for decompression.
                             [default: guess]