The Python Quants

TsTables – High Frequency Times Series Data with PyTables

Dr. Yves J. Hilpisch

The Python Quants GmbH

analytics@pythonquants.com

www.pythonquants.com

TsTables is a Python library by Andy Fiedler built on top of the popular PyTables HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.

Sample Time Series Data

In [1]:
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
from datetime import datetime 
%matplotlib inline

Let us generate a decent amount of sample data points.

In [2]:
no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2

We generate one second data.

In [3]:
dr = pd.date_range('2014-1-1', periods=no, freq='1s')
In [4]:
dr
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-12-14 05:19:59]
Length: 30000000, Freq: S, Timezone: None

In memory generation is quite quick.

In [5]:
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
        vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 6.07 s, sys: 1.64 s, total: 7.72 s
Wall time: 7.72 s

In [6]:
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
In [7]:
df.count()
Out[7]:
ts1    30000000
ts2    30000000
ts3    30000000
dtype: int64

The starting values of the three time series.

In [8]:
df.head()
Out[8]:
ts1 ts2 ts3
2014-01-01 00:00:00 100.000000 100.000000 100.000000
2014-01-01 00:00:01 99.961414 99.978325 100.013036
2014-01-01 00:00:02 99.974208 99.969026 99.999040
2014-01-01 00:00:03 99.971811 99.941457 99.962136
2014-01-01 00:00:04 99.962411 99.927141 99.974794

And a plot of the time series data (every 100000th point).

In [9]:
df[::100000].plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1081883d0>

Storage and Retrieval with TsTables

To store the time series data in a PyTables table we first define the table structure.

In [10]:
class TS(tb.IsDescription):
    timestamp = tb.Int64Col(pos=0)
    ts1 = tb.Float64Col(pos=1)
    ts2 = tb.Float64Col(pos=2)
    ts3 = tb.Float64Col(pos=3)

Second, open a database file and create the table object.

In [11]:
h5 = tb.open_file('ts.h5','w')

TsTables adds a new function create_ts to PyTables.

In [12]:
ts = h5.create_ts('/','TS', TS)

Third, we append the time series data to the table object.

In [13]:
%time ts.append(df)
CPU times: user 4.64 s, sys: 9.72 s, total: 14.4 s
Wall time: 26.3 s

In [14]:
ls -n *.h5
-rw-r--r--  1 501  20  935432352 Feb 21 16:32 ts.h5

The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).

In [15]:
read_start_dt = datetime(2014, 2, 1, 0, 0)
read_end_dt = datetime(2014, 2, 5, 23, 59)

TsTables tries to make such an operation as fast as possible.

In [16]:
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 87.1 ms, sys: 103 ms, total: 190 ms
Wall time: 273 ms

Conveniently, the returned object is a pandas DataFrame.

In [17]:
rows.count()
Out[17]:
ts1    431941
ts2    431941
ts3    431941
dtype: int64
In [18]:
rows.head()
Out[18]:
ts1 ts2 ts3
2014-02-01 00:00:00 73.421261 31.897070 131.119853
2014-02-01 00:00:01 73.442255 31.886632 131.166441
2014-02-01 00:00:02 73.448322 31.878721 131.160483
2014-02-01 00:00:03 73.463187 31.890018 131.112523
2014-02-01 00:00:04 73.448316 31.898887 131.076204
In [19]:
rows[::100].plot()
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1087b9c90>
In [20]:
h5.close()
In [21]:
!rm ts.h5