Data¶

class pymc3.data.Data(name, value, *, dims=None, export_index_as_coords=False)¶

Data container class that wraps the theano SharedVariable class and lets the model be aware of its inputs and outputs.

Parameters

name: str: The name for this variable
value: {List, np.ndarray, pd.Series, pd.Dataframe}: A value to associate with this variable
dims: {str, tuple of str}, optional, default=None: Dimension names of the random variables (as opposed to the shapes of these random variables). Use this when value is a pandas Series or DataFrame. The dims will then be the name of the Series / DataFrame’s columns. See ArviZ documentation for more information about dimensions and coordinates: https://arviz-devs.github.io/arviz/notebooks/Introduction.html
export_index_as_coords: bool, optional, default=False: If True, the Data container will try to infer what the coordinates should be if there is an index in value.

Examples

>>> import pymc3 as pm
>>> import numpy as np
>>> # We generate 10 datasets
>>> true_mu = [np.random.randn() for _ in range(10)]
>>> observed_data = [mu + np.random.randn(20) for mu in true_mu]

>>> with pm.Model() as model:
...     data = pm.Data('data', observed_data[0])
...     mu = pm.Normal('mu', 0, 10)
...     pm.Normal('y', mu=mu, sigma=1, observed=data)

>>> # Generate one trace for each dataset
>>> traces = []
>>> for data_vals in observed_data:
...     with model:
...         # Switch out the observed dataset
...         pm.set_data({'data': data_vals})
...         traces.append(pm.sample())

To set the value of the data container variable, check out pymc3.model.set_data().

For more information, take a look at this example notebook https://docs.pymc.io/notebooks/data_container.html

class pymc3.data.GeneratorAdapter(generator)¶: Helper class that helps to infer data type of generator with looking at the first item, preserving the order of the resulting generator

class pymc3.data.Minibatch(data, batch_size=128, dtype=None, broadcastable=None, name='Minibatch', random_seed=42, update_shared_f=None, in_memory_size=None)¶

Multidimensional minibatch that is pure TensorVariable

Parameters

data: np.ndarray: initial data
batch_size: ``int`` or ``List[int|tuple(size, random_seed)]``: batch size for inference, random seed is needed for child random generators
dtype: ``str``: cast data to specific type
broadcastable: tuple[bool]: change broadcastable pattern that defaults to (False, ) * ndim
name: ``str``: name for tensor, defaults to “Minibatch”
random_seed: ``int``: random seed that is used by default
update_shared_f: ``callable``: returns ndarray that will be carefully stored to underlying shared variable you can use it to change source of minibatches programmatically
in_memory_size: ``int`` or ``List[int|slice|Ellipsis]``: data size for storing in theano.shared

Notes

Below is a common use case of Minibatch with variational inference. Importantly, we need to make PyMC3 “aware” that a minibatch is being used in inference. Otherwise, we will get the wrong \(logp\) for the model. the density of the model logp that is affected by Minibatch. See more in the examples below. To do so, we need to pass the total_size parameter to the observed node, which correctly scales the density of the model logp that is affected by Minibatch. See more in the examples below.

Examples

Consider we have data as follows:

>>> data = np.random.rand(100, 100)

if we want a 1d slice of size 10 we do

>>> x = Minibatch(data, batch_size=10)

Note that your data is cast to floatX if it is not integer type But you still can add the dtype kwarg for Minibatch if you need more control.

If we want 10 sampled rows and columns [(size, seed), (size, seed)] we can use

>>> x = Minibatch(data, batch_size=[(10, 42), (10, 42)], dtype='int32')
>>> assert str(x.dtype) == 'int32'

Or, more simply, we can use the default random seed = 42 [size, size]

>>> x = Minibatch(data, batch_size=[10, 10])

In the above, x is a regular TensorVariable that supports any math operations:

>>> assert x.eval().shape == (10, 10)

You can pass the Minibatch x to your desired model:

>>> with pm.Model() as model:
...     mu = pm.Flat('mu')
...     sd = pm.HalfNormal('sd')
...     lik = pm.Normal('lik', mu, sd, observed=x, total_size=(100, 100))

Then you can perform regular Variational Inference out of the box

>>> with model:
...     approx = pm.fit()

Important note: :class:Minibatch has shared, and minibatch attributes you can call later:

>>> x.set_value(np.random.laplace(size=(100, 100)))

and minibatches will be then from new storage it directly affects x.shared. A less convenient convenient, but more explicit, way to achieve the same thing:

>>> x.shared.set_value(pm.floatX(np.random.laplace(size=(100, 100))))

The programmatic way to change storage is as follows I import partial for simplicity >>> from functools import partial >>> datagen = partial(np.random.laplace, size=(100, 100)) >>> x = Minibatch(datagen(), batch_size=10, update_shared_f=datagen) >>> x.update_shared()

To be more concrete about how we create a minibatch, here is a demo: 1. create a shared variable

>>> shared = theano.shared(data)

take a random slice of size 10:

>>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype('int64')

take the resulting slice:
```
>>> minibatch = shared[ridx]
```

That’s done. Now you can use this minibatch somewhere else. You can see that the implementation does not require a fixed shape for the shared variable. Feel free to use that if needed. FIXME: What is “that” which we can use here? A fixed shape? Should this say “but feel free to put a fixed shape on the shared variable, if appropriate?”

Suppose you need to make some replacements in the graph, e.g. change the minibatch to testdata

>>> node = x ** 2  # arbitrary expressions on minibatch `x`
>>> testdata = pm.floatX(np.random.laplace(size=(1000, 10)))

Then you should create a dict with replacements:

>>> replacements = {x: testdata}
>>> rnode = theano.clone(node, replacements)
>>> assert (testdata ** 2 == rnode.eval()).all()

FIXME: In the following, what is the **reason* to replace the Minibatch variable with its shared variable? And in the following, the rnode is a new node, not a modification of a previously existing node, correct?* To replace a minibatch with its shared variable you should do the same things. The Minibatch variable is accessible through the minibatch attribute. For example

>>> replacements = {x.minibatch: x.shared}
>>> rnode = theano.clone(node, replacements)

For more complex slices some more code is needed that can seem not so clear

>>> moredata = np.random.rand(10, 20, 30, 40, 50)

The default total_size that can be passed to PyMC3 random node is then (10, 20, 30, 40, 50) but can be less verbose in some cases

Advanced indexing, total_size = (10, Ellipsis, 50)

>>> x = Minibatch(moredata, [2, Ellipsis, 10])

We take the slice only for the first and last dimension

>>> assert x.eval().shape == (2, 20, 30, 40, 10)

Skipping a particular dimension, total_size = (10, None, 30):

>>> x = Minibatch(moredata, [2, None, 20])
>>> assert x.eval().shape == (2, 20, 20, 40, 50)

Mixing both of these together, total_size = (10, None, 30, Ellipsis, 50):

>>> x = Minibatch(moredata, [2, None, 20, Ellipsis, 10])
>>> assert x.eval().shape == (2, 20, 20, 40, 10)

Attributes

shared: shared tensor: Used for storing data
minibatch: minibatch tensor: Used for training

clone()¶

Return a new Variable like self.

Returns

Variable instance: A new Variable instance (or subclass instance) with no owner or index.

Notes

Tags are copied to the returned instance.

Name is copied to the returned instance.

pymc3.data.get_data(filename)¶

Returns a BytesIO object for a package data file.

Parameters

filename: str: file to load

Returns

BytesIO of the data