The Python Quants

bcolz – HPC data storage and retrieval with Python

Dr. Yves J. Hilpisch

The Python Quants GmbH

analytics@pythonquants.com

www.pythonquants.com

bcolz is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.

In [1]:
import bcolz

ctable Example

The first example is based on the ctable class for data in table format. The example data set is 1 GB in size.

In [2]:
N = 100000 * 1000
print N
100000000

In-Memory Storage

We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected.

In [3]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
                    dtype="i4, i8",
                    count=N,
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 25.3 s, sys: 176 ms, total: 25.5 s
Wall time: 24 s

It takes about 24 sec to generate the ctable object from a generator via the fromiter method. The in-memory size is about 150 MB only, which translates in to a compression ratio of 7.45.

In [4]:
ct
Out[4]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

You can now implement fast numerical opterations on this data object (note that the outout is a carray object).

In [5]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.57 s, sys: 679 ms, total: 5.24 s
Wall time: 1e+03 ms

Out[5]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Disk-Based Storage

The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir parameter.

In [6]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
                    dtype="i4, i8",
                    count=N, rootdir='ct',
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 25.1 s, sys: 299 ms, total: 25.4 s
Wall time: 32.7 s

With about 30 sec the generation takes a bit longer to store the data on disk – everything else (especially the object handling) remaining the same however.

In [7]:
ct
Out[7]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'ct'
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.

In [8]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.25 s, sys: 611 ms, total: 4.87 s
Wall time: 955 ms

Out[8]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Let us finally verify system disk usage.

In [9]:
!du -hs ct
# system disk usage
158M	ct

In [10]:
!rm -r ct

carray Example

This example is about mid data which does not fit (in general) into memory (without compression).

In [11]:
import numpy as np

We generte as basis a NumPy ndarray object of size 32 MB.

In [12]:
n = 2000
a = np.arange(n * n).reshape(n, n) 
a.nbytes
Out[12]:
32000000

In-Memory Storage

Let us first again work in-memory. Our carray object contains 4,000 versions of the ndarray object.

In [13]:
%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 24.4 s, sys: 252 ms, total: 24.6 s
Wall time: 24.6 s

The in-memory generation of the object takes about 25 sec. The carray object has stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.

In [14]:
ca
Out[14]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray objects.

In [15]:
%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])
CPU times: user 96 ms, sys: 164 ms, total: 260 ms
Wall time: 260 ms

Out[15]:
array([[  0.00000000e+00,   2.00000000e+00,   5.41421356e+00, ...,
          3.98805369e+06,   3.99204870e+06,   3.99604571e+06],
       [  4.00004472e+06,   4.00404573e+06,   4.00804874e+06, ...,
          1.59760722e+07,   1.59840672e+07,   1.59920642e+07],
       [  1.60000632e+07,   1.60080643e+07,   1.60160673e+07, ...,
          3.59640864e+07,   3.59760814e+07,   3.59880785e+07],
       ..., 
       [  3.97603600e+12,   3.97603999e+12,   3.97604398e+12, ...,
          3.98400403e+12,   3.98400802e+12,   3.98401201e+12],
       [  3.98401600e+12,   3.98401999e+12,   3.98402399e+12, ...,
          3.99199201e+12,   3.99199601e+12,   3.99200001e+12],
       [  3.99200400e+12,   3.99200800e+12,   3.99201199e+12, ...,
          3.99998800e+12,   3.99999200e+12,   3.99999600e+12]])

Another approach is to use the eval function of bcolz.

In [16]:
x = ca[:10000]  # 10,000 rows as sub-set
In [17]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
CPU times: user 615 ms, sys: 29 ms, total: 644 ms
Wall time: 189 ms

Out[17]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Disk-Based Storage

Disk-based storage of multiple versions of the array object. We write the object 4000 times to disk in a single carray object.

In [18]:
%%time
it = 4000
ca = bcolz.carray(a, rootdir='ca',
                 cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 28.6 s, sys: 5.09 s, total: 33.7 s
Wall time: 47.6 s

It takes only about 1 min to compress and store 120 GB worth of data on disk. The compression ratio in this case is again 130+.

In [19]:
ca
Out[19]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'ca'
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Simple numerical operations are easy to implement.

In [20]:
%time np.sum(ca[:1000] + ca[4000:5000])
CPU times: user 29 ms, sys: 6 ms, total: 35 ms
Wall time: 243 ms

Out[20]:
3999998000000

Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.

In [21]:
x = ca[:10000]  # 10,000 rows as sub-set

First, with an in-memory carray results object.

In [22]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))
CPU times: user 582 ms, sys: 18 ms, total: 600 ms
Wall time: 262 ms

Out[22]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Second, with an on-disk results object. The time difference is not that huge.

In [23]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir='out')
CPU times: user 898 ms, sys: 74 ms, total: 972 ms
Wall time: 1.46 s

Out[23]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'out'
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Finally, we verify system disk usage.

In [24]:
!du -hs ca
# system disk usage
985M	ca

In [25]:
!du -hs out
40M	out

In [26]:
!rm -r ca
!rm -r out