The zs library for Python

Quickstart

Using the example file we created when demonstrating zs make, we can write:

In [1]: from zs import ZS

In [2]: z = ZS("example/tiny-4grams.zs")

In [3]: for record in z:
   ...:     print(record.decode("utf-8"))
   ...: 
not done explicitly .	42
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61
not done fast ,	52
not done fast enough	71

# Notice that on Python 3.x, we search using byte strings, and we get
# byte strings back.
# (On Python 2.x, byte strings are the same as regular strings.)
In [4]: for record in z.search(prefix=b"not done extensive testing\t"):
   ...:     print(record.decode("utf-8"))
   ...: 
not done extensive testing	749

In [5]: for record in z.search(prefix=b"not done extensive "):
   ...:     print(record.decode("utf-8"))
   ...: 
not done extensive research	225
not done extensive testing	749
not done extensive tests	87

In [6]: for record in z.search(start=b"not done ext", stop=b"not done fast"):
   ...:     print(record.decode("utf-8"))
   ...: 
not done extensive research	225
not done extensive testing	749
not done extensive tests	87
not done extremely well	41
not done fairly .	61

Error reporting

zs defines two exception types.

exception zs.ZSError

Exception class used for most errors encountered in the ZS package. (Though we do sometimes raise exceptions of the standard Python types like IOError, ValueError, etc.)

exception zs.ZSCorrupt

A subclass of ZSError, used specifically for errors that indicate a malformed or corrupted ZS file.

Reading

Reading ZS files is accomplished by instantiating an object of type ZS:

class zs.ZS(path=None, url=None, parallelism='guess', index_block_cache=32)

Object representing a .zs file opened for reading.

Parameters:
  • path – A string containing an on-disk file to be opened. Exactly one of path or url must be specified.
  • url – An HTTP (or HTTPS) URL pointing to a .zs file, which will be accessed directly from the server. The server must support Range: queries. Exactly one of path or url must be specified.
  • parallelism

    When querying a ZS file, there are always at least 2 threads working in parallel: the main thread, where you iterate over the results and presumably do something with them, and a second thread used for IO. In addition, we can spawn any number of worker processes which will be used internally for decompression and other CPU-intensive tasks. parallelism=1 means to spawn 1 worker process; if you want to perform decompression and other such tasks in serial in your main thread, then use parallelism=0. The default of parallelism="guess" means to spawn one worker process per available CPU.

    Note that if you know that you are going to read just a few records on each search, then parallelism=0 may be slightly faster; this saves the overhead of setting up the worker processes, and they only really help when doing large bulk reads.

  • index_block_cache – The number of index blocks to keep cached in memory. This speeds up repeated queries. Larger values provide better caching, but take more memory. Make sure that this is at least as large as your file’s root_index_level, or else the cache will be useless.

This object can be used as a context manager, e.g.:

with ZS("./my/favorite.zs") as zs_obj:
    ...

is equivalent to:

zs_obj = ZS("./my/favorite.zs")
try:
    ...
finally:
    zs_obj.close()

Basic searches

class zs.ZS
search(start=None, stop=None, prefix=None)

Iterate over all records matching the given query.

A record is considered to “match” if:

  • start <= record, and
  • record < stop, and
  • record.startswith(prefix)

Any or all of the arguments can be left as None, in which case the corresponding check or checks are not performed.

Note the asymmetry between start and stop – this is analogous to other Python constructs which use half-open [start, stop) ranges, like range().

If no arguments are given, iterates over the entire contents of the .zs file.

Records are always returned in sorted order.

__iter__()

Equivalent to zs_obj.search().

File attributes and metadata

ZS objects provides a number of read-only attributes that give general information about the ZS file:

class zs.ZS
metadata

A .zs file can contain arbitrary metadata in the form of a JSON-encoded dictionary. This attribute contains this metadata in unpacked form.

root_index_offset

The file offset of the root index block, as stored in the header.

root_index_length

The length of the root index block, as stored in the header.

total_file_length

The proper length of this file, as stored in the header.

codec

The compression codec used on this file, as a byte string.

data_sha256

A strong hash of the underlying data records contained in this file. If two files have the same value here, then they are guaranteed to represent exactly the same data (i.e., return the same records to the same queries), though they might be stored using different compression algorithms, have different metadata, etc.

root_index_level

The level of the root index.

Starting from scratch, finding an arbitrary record in a ZS file requires that we fetch the header, fetch the root block, and then fetch this many blocks to traverse the index tree. So that’s a total of root_index_level + 2 fetches. (On local disk, each “fetch” is a disk seek; over HTTP, each “fetch” is a round-trip to the server.) For later queries on the same ZS object, at least the header and root will be cached, and (if you’re lucky) other blocks may be as well.

Fast bulk operations

If you want to perform some computation on many records (e.g., all the records in your file), then these functions are the most efficient way to do that.

class zs.ZS
block_map(fn, start=None, stop=None, prefix=None, args=(), kwargs={})

Apply a given function – in parallel – to records matching a given query. This function is lazy – if you don’t iterate over the results, then the function might not be called on all of them.

Using this method (or its friend, block_exec()) is the best way to perform large bulk operations on ZS files.

The way to think about how it works is, first we find all records matching the given query:

matches = zs_obj.search(start=start, stop=stop, prefix=prefix)

and then we divide the resulting list of records up into arbitrarily sized chunks, and for each chunk we call the given function, and yield the result:

while there are matches:
    chunk = list(get arbitrarily many matches)
    yield fn(chunk, *args, **kwargs)

But, there is a trick: in fact many copies of the function are run in parallel in different worker processes, and then the results are passed back to the main process for you to do whatever you want with. (Think “poor-man’s map-reduce”.)

This means that your fn, args, kwargs, and return values must all be pickleable. In particular, fn probably has to either be a global function in a named module, or else an object with a __call__ method that is an instance of a globally defined class in a named module. (Sorry, I didn’t make the rules. Feel free to submit patches to use a more featureful serialization library like ‘dill’, esp. if you can demonstrate that they don’t add too much overhead.)

This will be most efficient if fn performs non-trivial work, and especially if it can avoid returning large/complicated structures from fn – after all, the whole idea is that the code that’s looping over the results from block_map() should have less work to do than it would if it were just calling search() directly.

If you manage to take this to the extreme where you have nothing to return from block_map() (maybe your fn is writing to a database or something), then you can use block_exec() instead to save a bit of boilerplate.

If you pass parallelism=0 when creating your ZS object, then this method will perform all work within the main process. This makes debugging a lot easier, because it will let you get real backtraces if (when) your fn crashes.

block_exec(fn, start=None, stop=None, prefix=None, args=(), kwargs={})

Eager version of block_map().

This is equivalent to calling block_map(), iterating over the results, and throwing them all away.

High-level operations

class zs.ZS
dump(out_file, start=None, stop=None, prefix=None, terminator=b'\n', length_prefixed=None)

Decompress a given range of the .zs file to another file. This is performed in the most efficient available way.

Parameters:
  • terminator (byte string) – A terminator appended to the end of each record. Default is a newline. (Ignored if length_prefixed is given.)
  • length_prefixed – If given, records are output in a length-prefixed format, and terminator is ignored. Valid values are the strings "uleb128" or "u64le", or None.

See search() for the definition of start, stop, and prefix.

On Python 3, out_file must be opened in binary mode.

For a convenient command-line interface to this method, see zs dump.

validate()

Validate this .zs file for correctness.

This method does an exhaustive check of the current file, to validate it for self-consistency and compliance with the ZS specification. It should catch all cases of disk corruption (with high probability), and all cases of incorrectly constructed files.

This reads and decompresses the entire file, so may take some time.

For a convenient command-line interface to this method, see zs validate.

Writing

In case you want a little more control over ZS file writing than you can get with the zs make command-line utility (see zs make), you can also access the underlying ZS-writing code directly from Python by instantiating a ZSWriter object.

class zs.ZSWriter(path, metadata, branching_factor, parallelism='guess', codec='lzma', codec_kwargs={}, show_spinner=True, include_default_metadata=True)
add_data_block(records)

Append the given set of records to the ZS file as a single data block.

(See On-disk layout of ZS files for details on what a data block is.)

Parameters:records – A list of byte strings giving the contents of each record.
add_file_contents(file_handle, approx_block_size, terminator=b'\n', length_prefixed=None)

Split the contents of file_handle into records, and write them to the ZS file.

The arguments determine how the contents of the file are divided into records and blocks.

Parameters:
  • file_handle – A file-like object whose contents are read. This file is always closed.
  • approx_block_size – The approximate size of each data block, in bytes, before compression is applied.
  • terminator – A byte string containing a terminator appended to the end of each record. Default is a newline.
  • length_prefixed – If given, records are output in a length-prefixed format, and terminator is ignored. Valid values are the strings "uleb128" or "u64le", or None.
finish()

Declare this file finished.

This method writes out the root block, updates the header, etc.

Importantly, we do not write out the correct magic number until this method completes, so no ZS reader will be willing to read your file until this is called (see Magic number).

Do not call this method unless you are sure you have added the right records. (In particular, you definitely don’t want to call this from a finally block, or automatically from a with block context manager.)

Calls close().

close()

Close the file and terminate all background processing.

Further operations on this ZSWriter object will raise an error.

If you call this method before calling finish(), then you will not have a working ZS file.

This object can be used as a context manager in a with block, in which case close() will be called automatically, but finish() will not be.

closed

Boolean attribute indicating whether this ZSWriter is closed.