Convolutional Neural Network From the Ground Up

Liang Bo Wang (亮亮), 2015-06-07

PyCon APAC 2015, 2015-06-07

HDF5 Use Case

By Liang2 under CC 4.0 BY license

Esc to overview
to navigate

A quick note before my talk

HDF5 is NOT HDFS

If you are expecting ...

Wrong place. Talk of click detection by Vpon or sentimental analysis by Willy suits better. Seriously.

Unexpected code fight and online judge

About Me

Many good HDF5 intros out there already

  • The API is clean and easy to learn oneself
  • Details are in the book (據說天瓏有賣)
  • I'm not going(able) to create another one
  • Concepts and my use cases

HDF5 Intro

HDF = Hierarchical Data Format

Readable by most programming languages

Keep data organized and connected by design

    import h5py
f = h5py.File("weather.hdf5", "a")
f['/taipei_1/humidity'] = np.array(...)
f['/taipei_1/humidity'].attrs['rec_date'] = utc_timestamp
f['/taipei_1/temperature'] = np.array(...)
dataset = f['/taipei_5/humidity']
    

Group and dataset

'''f['/taipei_1/these_days/humidity']
      ---------------------========
        group     group    dataset
'''
tpe_grp = f.create_group('taipei_1')
these_days = tpe_grp.create_group('these_days')
humidity = these_days.create_dataset('humidity', ...)
     

Datasets creation

dset = f.create_dataset(
  "demo",
  (10, 10),           # dset.shape, can be multi-dim
  dtype=np.float64,   # dset.dtype like Numpy
  fillvalue=42        # default value
)
dset[:5, :5] = np.arange(25).reshape((5, -1))

Dataset partial reading

out = dset[:5, :5]

# More efficiently
out = np.empty((100, 50), dtype=np.float32)
dset.read_direct(out, np.s_[:, 0:50])

Chunks

f.create_dataset(
    'classA_heart_beat',
    (100, 3600),    # 1hr heart rate rec for every 1sec
    # 1 chunk = 8 * 3600 bytes = 28.8KB
    chunks=(1, 3600), dtype=np.int8)
    

Compression and filter

create_dataset(.., compression="gzip", shuffle=True)

Datasets - metadata

dset = f.create_dataset(...)
dset.attrs['title'] = 'PyCon APAC'
dset.attrs['year'] = 2015
    

Tricky things - Strings

>>> dt = h5py.special_dtype(vlen=str)
>>> dt
dtype(('|O4', [(({'type': }, 'vlen'), '|O4')]))
    

Good things - Concurrency

How about
X vs HDF5?

Pickle the awesome

  • Want something to save in no-brain? Use pickle
  • Once addicted no one gets back

class MyConfig:
    def __init__(self, ...):
        self.x = x
        self.y = y
        # ...

c = MyConfig()
pickle.dump(c)
c = pickle.load(pickle_pth)

Pickle vs HDF5

Numpy save / load

.npy, .npz are bascially a ZIP format.

>>> np.save('/tmp/123', np.array([[1, 2, 3], [4, 5, 6]]))
>>> np.load('/tmp/123.npy')
array([[1, 2, 3],
       [4, 5, 6]])

You said to save data? Database?

Database vs HDF5

Why not makes HDF5 support those awesomeness?

Some optimization comparision on their official site

Back on this later. Simply put, it reduces the cross-platform ability

HDF5
Use Case

Daily - Pandas X HDF5

Large Numpy array X HDF5

Something like this

Original way

Handled by HDF5

    
with h5py.File(hdf5_pth, 'w') as f:
    for batch in img_batches:
        region_data = f['raw_img_xx'][batch.region]
        # ...
        f['heatmap_img_xx'][batch.region] = outcome

Deep learning x HDF5

Input viewed as 32 width x 32 height x 3 channels (depth)
Output viewed as 1 x 1 x 10. Spatial information is preserved.

Learning process

# evaluate loss and gradient, internally use self.W
loss, grad = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)

# perform parameter update
self.W -= grad * learning_rate

Parameter update method comparison

Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Tue to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed.

Ref: CS231n Note: Neural Network 3 and Image credit: Alec Radford.

How to track the learning progress for all parameters

RPi2 X HDF5

Long term monitoring

  • Continuous storage of sensor data
  • Group by date
  • Monitor and alert

Recap

Thank You!

Fork me on Github