The Python Quants

Python Big Data Platform

Brief Overview and Introduction

Dr. Yves J. Hilpisch

The Python Quants GmbH

analytics@pythonquants.com

www.pythonquants.com

General Aspects and Trials

The Python Quant Platform is developed and maintained by The Python Quants GmbH. It offers Web-/browser-based data and financial analytics for individuals, teams and organizations. Free registrations are possible under http://trial.quant-platform.com.

PQP Registration

You can freely choose your your_user_name and password. You can then login under http://analytics.quant-platform.com, using trial as company in combination with your_user_name and password.

PQP login

Please note that trial/test accounts are only for illustration purposes and they can be closed at any time (with all data, code, etc. be permanently deleted).

Please read also the Terms & Conditions as well as our Privacy Policy.

If you have questions about the platform or any troubles, you can reach us under platform@pythonquants.com.

Platform Components & Features

At the moment, the Python Quant Platform comprises the following components and features:

  • IPython Notebook: interactive data and financial analytics in the browser with full Python integration and much more (cf. IPython home page).
  • Anaconda Python Distribution: complete Python stack for financial, scientific and data analytics workflows/applications (cf. Anaconda page); you can easily switch between Python 2.7 and 3.4.
  • R Stack: for statistical analyses, integrated via rpy2 and IPython Notebook
  • DX Analytics: our library for advanced financial and derivatives analytics with Python based on Monte Carlo simulation.
  • File Manager: a GUI-based File Manager to upload, download, copy, remove, rename files on the platform.
  • Chat/Forum: there is a simple chat/forum application available via which you can share thoughts, documents and more.
  • Collaboration: the platform features user/group administration as well as file sharing via public folders.
  • Linux Server: the platform is powered by Linux servers to which you have full shell access.
  • Deployment: the platform is easily scalable since it is cloud-based and can also be easily deployed on your own servers (via Docker containers).

PQP Overview

IPython Notebook

In the left panel of the platform, you find the current working path indicated (in black) as well as the current folder and file structure (as links in purple). Note that in this panel only IPython Notebook files are displayed. Here you can navigate the current folder structure by clicking on a link. Clicking on the double points ".." brings you one level up in the structure. Clicking the refresh button right next to the double points updates the folder/file structure. Clicking on a file link opens the IPython Notebook file.

Basic Approach

You find a link to open a new notebook on top of the left panel. With IPython notebooks, like with this one, you can interactively code Python and do data/financial analytics.

In [1]:
print ("Hello Quant World.")
Hello Quant World.

In [2]:
# simple calculations
3 + 4 * 2
Out[2]:
11
In [3]:
# working with NumPy arrays
import numpy as np
rn = np.random.standard_normal(100)
rn[:10]
Out[3]:
array([ 1.02513028,  0.17631207,  1.24145549, -0.91062576, -0.95066727,
       -0.24643119, -1.78595937,  0.46583419, -0.04754132, -0.46144837])
In [4]:
# plotting
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(rn.cumsum())
plt.grid(True)

IPython Notebook as a system shell.

In [5]:
!ls -n
total 44
-rwxr-x--- 1 1141 8 31230 Oct  1 13:00 Python_Big_Data_Platform.ipynb
-rwxr-x--- 1 1141 8  2035 Oct  1 12:40 dx_example.py
-rw-r--r-- 1 1141 8  4808 Sep 26 15:23 perf_tests.ipynb

In [6]:
!mkdir test
In [7]:
!ls
Python_Big_Data_Platform.ipynb	dx_example.py  perf_tests.ipynb  test

In [8]:
!rm -r test

IPython Notebook as a media integrator. Here a talk by Yves about "Interactive Analytics of Large Financial Data Sets" with Python & IPython.

In [9]:
from IPython.display import YouTubeVideo
In [10]:
YouTubeVideo(id="XyqlduIcc2g", width=700, height=400)
Out[10]:

Efficient Financial Analytics

Combining the pandas library with IPython Notebook makes for a powerful financial analytics environment.

In [11]:
import pandas as pd
import pandas.io.data as web
In [12]:
AAPL = web.DataReader('AAPL', data_source='google')
  # reads data from Google Finance
AAPL['42d'] = pd.rolling_mean(AAPL['Close'], 42)
AAPL['252d'] = pd.rolling_mean(AAPL['Close'], 252)
  # 42d and 252d trends
In [13]:
AAPL[['Close', '42d', '252d']].plot(figsize=(10, 5))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6cdf3417d0>

DX Analytics

DX Analytics is a Python library for advanced financial and derivatives analytics written by The Python Quants. It is particularly suited to model multi-risk derivatives and to do a consistent valuation of portfolios of complex derivatives. It mainly uses Monte Carlo simulation since it is the only numerical method capable of valuing and risk managing complex, multi-risk derivatives books.

An example with an European maximum call option on two underlyings.

In [14]:
import dx
%run dx_example.py
  # sets up market environments
  # and defines derivative instrument
In [15]:
max_call.payoff_func
  # payoff of a maximum call option
  # on two underlyings (European exercise)
Out[15]:
"np.maximum(np.maximum(maturity_value['gbm'], maturity_value['jd']) - 34., 0)"
In [16]:
max_call.vega('jd')
  # numerical Vega with respect
  # to one risk factor
Out[16]:
5.9487999999999985

We are going to generate a Vega surface for one risk factor with respect to the initial values of both risk factors.

In [17]:
asset_1 = np.arange(28., 46.1, 2.)
asset_2 = asset_1
a_1, a_2 = np.meshgrid(asset_1, asset_2)
value = np.zeros_like(a_1)
In [18]:
%%time
vega_gbm = np.zeros_like(a_1)
for i in range(np.shape(vega_gbm)[0]):
    for j in range(np.shape(vega_gbm)[1]):
        max_call.update('gbm', initial_value=a_1[i, j])
        max_call.update('jd', initial_value=a_2[i, j])
        vega_gbm[i, j] = max_call.vega('gbm')
CPU times: user 3.91 s, sys: 2 ms, total: 3.91 s
Wall time: 3.91 s

In [19]:
dx.plot_greeks_3d([a_1, a_2, vega_gbm], ['gbm', 'jd', 'vega gbm'])
  # Vega surface plot

Parallel Processing

Monte Carlo simulation is a computationally demanding task that nowadays in the financial industry is implemented generally on a large scale (eg for Value-at-Risk or Credit-Value-Adjustment calculations).

In [20]:
import math

This function simulates a geometric Brownian motion.

In [21]:
def simulate_geometric_brownian_motion(p):
    M, I = p
      # time steps, paths
    S0 = 100; r = 0.05; sigma = 0.2; T = 1.0
      # model parameters
    dt = T / M
    paths = np.zeros((M + 1, I))
    paths[0] = S0
    for t in range(1, M + 1):
        paths[t] = paths[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt +
                    sigma * math.sqrt(dt) * np.random.standard_normal(I))
    return paths

An example simulation with the function.

In [22]:
%time paths = simulate_geometric_brownian_motion((50, 100000))
  # example simulation
CPU times: user 294 ms, sys: 1 ms, total: 295 ms
Wall time: 295 ms

In [23]:
plt.plot(paths[:, :10]); plt.grid()

Now using the multiprocessing module of Python.

In [24]:
from time import time
import multiprocessing as mp
In [25]:
I = 7500  # number of paths
M = 50  # number of time steps
t = 100 # number of tasks/simulations
In [26]:
# running with a max of 8 cores
times = []
for w in range(1, 9):
    t0 = time()
    pool = mp.Pool(processes=w)
    result = pool.map(simulate_geometric_brownian_motion, t * [(M, I), ])
    times.append(time() - t0)

And the performance results visualized.

In [27]:
plt.plot(range(1, 9), times)
plt.plot(range(1, 9), times, 'ro')
plt.grid(True)
plt.xlabel('number of threads')
plt.ylabel('time in seconds')
plt.title('%d Monte Carlo simulations' % t)
Out[27]:
<matplotlib.text.Text at 0x7f6ccc07e0d0>

PQP Overview

Statistics with R

We analyze the statistical correlation between the EURO STOXX 50 stock index and the VSTOXX volatility index.

First the EURO STOXX 50 data.

In [28]:
import pandas as pd
cols = ['Date', 'SX5P', 'SX5E', 'SXXP', 'SXXE',
        'SXXF', 'SXXA', 'DK5F', 'DKXF', 'DEL']
es_url = 'http://www.stoxx.com/download/historical_values/hbrbcpe.txt'
try:
    es = pd.read_csv(es_url,  # filename
                     header=None,  # ignore column names
                     index_col=0,  # index column (dates)
                     parse_dates=True,  # parse these dates
                     dayfirst=True,  # format of dates
                     skiprows=4,  # ignore these rows
                     sep=';',  # data separator
                     names=cols)  # use these column names

    # deleting the helper column
    del es['DEL']
except:
    # read stored data if there is no Internet connection
    es = pd.HDFStore('data/SX5E.h5', 'r')['SX5E']

Second, the VSTOXX data.

In [29]:
vs_url = 'http://www.stoxx.com/download/historical_values/h_vstoxx.txt'

try:
    vs = pd.read_csv(vs_url,  # filename
                     index_col=0,  # index column (dates)
                     parse_dates=True,  # parse date information
                     dayfirst=True, # day before month
                     header=2)  # header/column names
except:
    # read stored data if there is no Internet connection
    vs = pd.HDFStore('data/V2TX.h5', 'r')['V2TX']

Bridging to R from within IPython Notebook and pushing Python data to the R run-time.

In [30]:
%load_ext rpy2.ipython
In [31]:
import numpy as np
# log returns for the major indices' time series data
datv = pd.DataFrame({'SX5E' : es['SX5E'], 'V2TX': vs['V2TX']}).dropna()
rets = np.log(datv / datv.shift(1)).dropna()
ES = rets['SX5E'].values
VS = rets['V2TX'].values
In [32]:
%Rpush ES VS

Plotting with R in IPython Notebook.

In [33]:
%R plot(ES, VS, pch=19, col='blue'); grid(); title("Log returns ES50 & VSTOXX")

Linear regression with R.

In [34]:
%R c = coef(lm(VS~ES))
Out[34]:
<FloatVector - Python:0x10b3d2b90 / R:0x101be3b50>
[-0.000069, -2.753413]
In [35]:
%R plot(ES, VS, pch=19, col='blue'); grid(); abline(c, col='red', lwd=5)

Pulling data from R to Python

In [36]:
%Rpull c
In [37]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(9, 6))
plt.plot(ES, VS, 'b.')
plt.plot(ES, c[0] + c[1] * ES, 'r', lw=3)
plt.grid(); plt.xlabel('ES'); plt.ylabel('VS')
Out[37]:
<matplotlib.text.Text at 0x10ca23610>

Bayesian Statistics

The example we use is a "classical" pair traiding strategy, namely with gold and stocks of gold mining companies both represented by ETFs with symbols

We use zipline and PyMC3 for the analysis.

In [38]:
import numpy as np
import pymc as pm
import zipline
import pytz
import datetime as dt

First, we load the data from the Web.

In [39]:
try:
    datg = zipline.data.load_from_yahoo(stocks=['GLD', 'GDX'], 
             end=dt.datetime(2014, 3, 15, 0, 0, 0, 0, pytz.utc)).dropna()
except:
    datg = pd.HDFStore('data/gold.h5', 'r')['datg']
%matplotlib inline
datg.plot(figsize=(9, 5))
GLD
GDX

Out[39]:
<matplotlib.axes.AxesSubplot at 0x113bd51d0>

A scatter plot of the value pairs over time and a simple linear regression.

In [40]:
import matplotlib as mpl; import matplotlib.pyplot as plt
mpl_dates = mpl.dates.date2num(data.index)
plt.figure(figsize=(9, 5))
plt.scatter(datg['GDX'], datg['GLD'], c=mpl_dates, marker='o')
reg = np.polyfit(datg['GDX'], datg['GLD'], 1)
plt.plot(datg['GDX'], np.polyval(reg, datg['GDX']), 'r-', lw=3)
plt.grid(True); plt.xlabel('GDX'); plt.ylabel('GLD')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
             format=mpl.dates.DateFormatter('%d %b %y'))
Out[40]:
<matplotlib.colorbar.Colorbar instance at 0x7f941e855638>

We implement a Bayesian random walk model (I).

In [41]:
model_randomwalk = pm.Model()
with model_randomwalk:
    sigma_alpha, log_sigma_alpha = \
            model_randomwalk.TransformedVar('sigma_alpha', 
                            pm.Exponential.dist(1. / .02, testval=.1), 
                            pm.logtransform)
    sigma_beta, log_sigma_beta = \
            model_randomwalk.TransformedVar('sigma_beta', 
                            pm.Exponential.dist(1. / .02, testval=.1),
                            pm.logtransform)

We implement a Bayesian random walk model (II).

In [42]:
from pymc.distributions.timeseries import GaussianRandomWalk

# take samples of 50 elements each
subsample_alpha = 50
subsample_beta = 50

with model_randomwalk:
    alpha = GaussianRandomWalk('alpha', sigma_alpha**-2, 
                               shape=len(datg) / subsample_alpha)
    beta = GaussianRandomWalk('beta', sigma_beta**-2, 
                              shape=len(datg) / subsample_beta)

    alpha_r = np.repeat(alpha, subsample_alpha)
    beta_r = np.repeat(beta, subsample_beta)

We implement a Bayesian random walk model (III).

In [43]:
with model_randomwalk:
    # define regression
    regression = alpha_r + beta_r * datg.GDX.values[:1950]
    
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('GLD', mu=regression, 
                sd=sd, observed=datg.GLD.values[:1950])

We implement a Bayesian random walk model (IV).

In [44]:
import warnings; warnings.simplefilter('ignore')
import scipy.optimize as sco
with model_randomwalk:
    # first optimize random walk
    start = pm.find_MAP(vars=[alpha, beta], fmin=sco.fmin_l_bfgs_b)
    
    # sampling
    step = pm.NUTS(scaling=start)
    trace_rw = pm.sample(100, step, start=start, progressbar=False)

The plot of the regression coefficients over time.

In [45]:
part_dates = np.linspace(min(mpl_dates), max(mpl_dates), 39)
fig, ax1 = plt.subplots(figsize=(10, 5))
plt.plot(part_dates, np.mean(trace_rw['alpha'], axis=0),
         'b', lw=2.5, label='alpha')
for i in range(45, 55):
    plt.plot(part_dates, trace_rw['alpha'][i], 'b-.', lw=0.75)
plt.xlabel('date'); plt.ylabel('alpha'); plt.axis('tight')
plt.grid(True); plt.legend(loc=2)
ax1.xaxis.set_major_formatter(mpl.dates.DateFormatter('%d %b %y') )
ax2 = ax1.twinx()
plt.plot(part_dates, np.mean(trace_rw['beta'], axis=0),
         'r', lw=2.5, label='beta')
for i in range(45, 55):
    plt.plot(part_dates, trace_rw['beta'][i], 'r-.', lw=0.75)
plt.ylabel('beta'); plt.legend(loc=4); fig.autofmt_xdate()

The plot of the regression lines over time.

In [46]:
plt.figure(figsize=(10, 5))
plt.scatter(datg['GDX'], datg['GLD'], c=mpl_dates, marker='o')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
             format=mpl.dates.DateFormatter('%d %b %y'))
plt.grid(True); plt.xlabel('GDX'); plt.ylabel('GLD')
x = np.linspace(min(datg['GDX']), max(datg['GDX'])) 
for i in range(39):
    alpha_rw = np.mean(trace_rw['alpha'].T[i])
    beta_rw = np.mean(trace_rw['beta'].T[i]) 
    plt.plot(x, alpha_rw + beta_rw * x, color=plt.cm.jet(256 * i / 39))

Vector Auto Regression

Let us apply Multi-Variate Auto Regression to the financial time series data we have:

  • GLD
  • GDX
  • EURO STOXX 50
  • VSTOXX

Let us resample and join the data sets.

In [47]:
datv.index = datv.index.tz_localize(pytz.utc)
datf = datg.join(datv, how='left')
datf = datf.resample('1M', how='last')
  # resampling to monthly data
# datf = datf / datf.ix[0] * 100
  # uncomment for normalized starting values
# datf = np.log(datf / datf.shift(1)).dropna()
  # uncomment for log return based analysis

The starting values of the time series data we use.

In [48]:
datf.head()
Out[48]:
GDX GLD SX5E V2TX
Date
2006-05-31 00:00:00+00:00 36.91 64.23 3637.17 23.0529
2006-06-30 00:00:00+00:00 36.77 61.23 3648.92 18.3282
2006-07-31 00:00:00+00:00 36.82 63.16 3691.87 18.5171
2006-08-31 00:00:00+00:00 38.56 62.29 3808.70 16.1689
2006-09-30 00:00:00+00:00 33.87 59.47 3899.41 16.2455

We use the VAR class of the statsmodels library.

In [49]:
from statsmodels.tsa.api import VAR
model = VAR(datf)
In [50]:
lags = 5
  # number of lags used for fitting
results = model.fit(maxlags=lags, ic='bic')
  # model fitted to data

The summary statistics of the model fit.

In [51]:
results.summary()
Out[51]:
  Summary of Regression Results  
=================================
Model:                        VAR
Method:                       OLS
Date:           Do, 02, Okt, 2014
Time:                    11:11:18
--------------------------------------------------------------------
No. of Equations:         4.00000    BIC:                    18.7332
Nobs:                     94.0000    HQIC:                   18.4106
Log likelihood:          -1368.55    FPE:                7.95948e+07
AIC:                      18.1921    Det(Omega_mle):     6.46927e+07
--------------------------------------------------------------------
Results for equation GDX
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           3.656812         7.741225            0.472           0.638
L1.GDX          0.951813         0.052654           18.077           0.000
L1.GLD         -0.019076         0.025755           -0.741           0.461
L1.SX5E        -0.000279         0.001442           -0.193           0.847
L1.V2TX         0.050521         0.076826            0.658           0.512
==========================================================================

Results for equation GLD
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           4.570986        13.017883            0.351           0.726
L1.GDX          0.030915         0.088544            0.349           0.728
L1.GLD          0.959570         0.043310           22.156           0.000
L1.SX5E        -0.000580         0.002425           -0.239           0.812
L1.V2TX         0.047100         0.129193            0.365           0.716
==========================================================================

Results for equation SX5E
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const         673.405496       294.957168            2.283           0.025
L1.GDX          0.716665         2.006218            0.357           0.722
L1.GLD         -1.660332         0.981323           -1.692           0.094
L1.SX5E         0.872341         0.054954           15.874           0.000
L1.V2TX        -5.246971         2.927232           -1.792           0.076
==========================================================================

Results for equation V2TX
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           3.538398         9.677262            0.366           0.716
L1.GDX          0.005024         0.065822            0.076           0.939
L1.GLD          0.002697         0.032196            0.084           0.933
L1.SX5E         0.000171         0.001803            0.095           0.924
L1.V2TX         0.820594         0.096040            8.544           0.000
==========================================================================

Correlation matrix of residuals
             GDX       GLD      SX5E      V2TX
GDX     1.000000  0.813748  0.061648 -0.158729
GLD     0.813748  1.000000 -0.058480 -0.081772
SX5E    0.061648 -0.058480  1.000000 -0.789066
V2TX   -0.158729 -0.081772 -0.789066  1.000000


Historical data and forecasts.

In [52]:
results.plot_forecast(50, figsize=(8, 8), offset='M')
  # historical/input data and
  # forecasts given model fit