HDF5 Use Case, Liang2

PyCon APAC 2015, 2015-06-07

HDF5 Use Case

By Liang² under CC 4.0 BY license

Esc to overview
← → to navigate

A quick note before my talk

HDF5 is NOT HDFS

If you are expecting ...

HDFS or Hadoop
Big data
Deep learning or how AI will rule the world
IoT lol

Wrong place. Talk of click detection by Vpon or sentimental analysis by Willy suits better. Seriously.

Unexpected code fight and online judge

About Me

Master student at
Bioinfo & Biostat Core Lab, NTU CGM
R / Python
Taiwan R co-organizer
PyCon APAC 2014-15/TW staff
Former intern at Microsoft Research Asia
Happy to chat about Bioinfo ~~and anime~~

Many good HDF5 intros out there already

The API is clean and easy to learn oneself
Details are in the book （據說天瓏有賣）
I'm not going(able) to create another one
Concepts and my use cases

Ref: Python and HDF5, Oreilly

HDF5 Intro

HDF = Hierarchical Data Format

Structure data with meta data
Focus on fast (partial) I/O
A stable and structural way to store large numerical data thru continuous development for 20+ years
BSD-like license
Commonly seen in scientific research projects. Ex LHC, NASA
Current format version: HDF5

Readable by most programming languages

Python: h5py and PyTables. Talk most about h5py today
R: rhdf5 (on Bioconductor)
Lua: HDF5 or Torch-hdf5
Matlab: built-in support and .mat file bases on HDF5
Mathematica: built-in support

Keep data organized and connected by design

    import h5py
f = h5py.File("weather.hdf5", "a")
f['/taipei_1/humidity'] = np.array(...)
f['/taipei_1/humidity'].attrs['rec_date'] = utc_timestamp
f['/taipei_1/temperature'] = np.array(...)
dataset = f['/taipei_5/humidity']

Group and dataset

'''f['/taipei_1/these_days/humidity']
      ---------------------========
        group     group    dataset
'''
tpe_grp = f.create_group('taipei_1')
these_days = tpe_grp.create_group('these_days')
humidity = these_days.create_dataset('humidity', ...)

Datasets creation

dset = f.create_dataset(
  "demo",
  (10, 10),           # dset.shape, can be multi-dim
  dtype=np.float64,   # dset.dtype like Numpy
  fillvalue=42        # default value
)
dset[:5, :5] = np.arange(25).reshape((5, -1))

Dataset partial reading

out = dset[:5, :5]

# More efficiently
out = np.empty((100, 50), dtype=np.float32)
dset.read_direct(out, np.s_[:, 0:50])

Chunks

Chunks help jumping over the HDF5 file faster
A chunk can only be loaded fully (and cached)
Chunk too large or small don't help, creating large index overhead

f.create_dataset(
    'classA_heart_beat',
    (100, 3600),    # 1hr heart rate rec for every 1sec
    # 1 chunk = 8 * 3600 bytes = 28.8KB
    chunks=(1, 3600), dtype=np.int8)

Compression and filter

For cross-platform compatibility, use GZIP though slower
Filter further increase the efficiency of compression

create_dataset(.., compression="gzip", shuffle=True)

Datasets - metadata

dset = f.create_dataset(...)
dset.attrs['title'] = 'PyCon APAC'
dset.attrs['year'] = 2015

Tricky things - Strings

Fixed-length ASCII / Unicode strings are supported.
Variable length? Converted to Numpy object
Variable-length Unicode? Tricky and not cross-platform.

>>> dt = h5py.special_dtype(vlen=str)
>>> dt
dtype(('|O4', [(({'type': }, 'vlen'), '|O4')]))

Good things - Concurrency

One connection for R/W and others for R only is a good paradigm
HDF5 by default not thread-safe
HDF5 compiled in parallel mode works seamlessly with MPI

How about
X vs HDF5?

Pickle the awesome

Want something to save in no-brain? Use pickle
Once addicted no one gets back
Save a list of 1M elements? Use pickle
Save the IPython parallel async metadata? Use pickle
Save but fail to load


class MyConfig:
    def __init__(self, ...):
        self.x = x
        self.y = y
        # ...

c = MyConfig()
pickle.dump(c)
c = pickle.load(pickle_pth)

Pickle vs HDF5

Pickle is no-brainer so good for prototype
Workable for one's own class and object
Not that workable for others Py object if not knowing the detail
Save a 8GB model in pickle? High memory overhead
Slower than HDF5

Numpy save / load

.npy, .npz are bascially a ZIP format.

>>> np.save('/tmp/123', np.array([[1, 2, 3], [4, 5, 6]]))
>>> np.load('/tmp/123.npy')
array([[1, 2, 3],
       [4, 5, 6]])

You said to save data? Database?

Most frequent question people asked about HDF5
SQLite, MongoDB, ... name it
They are designed for different purpose
Database supports performant index, search, table merge
HDF5 focuses on I/O and storage and it works better

Database vs HDF5

Expect (also in our test) at least 1 order slower than HDF5 on one machine
Database is still good for what you know
"Have you tested Cassandra cluster?" okay ... I don't know which better when it comes to cluster or more complex architecture.
HDF5 doesn't handle string that well
HDF5 doesn't have index, search and join.

Why not makes HDF5 support those awesomeness?

Some optimization comparision on their official site

Back on this later. Simply put, it reduces the cross-platform ability

HDF5
Use Case

Daily - Pandas X HDF5

df.to_hdf5(...) with PyTables
Support many kinds of pandas axes and searching
The saved HDF5 files have more complex structure
Writing a special methods for certain dataframe columns
Stop reading from CSV or other text files everytime

Large Numpy array X HDF5

One of my project back in MSRA
Digital histopathology image analysis
Biopy comes with 10,000 x 10,000+ pixels
Analyze by 256 x 256 patches
In the end plot a heatmap
Don't want to understand how to efficiently read TIFF

Something like this

Original way

Read the TIFF (large memory)
Work it by patch
Save after all patches finish
Large memory use (> 30GB), 1 CPU active

Handled by HDF5

Save raw image in HDF5
read it by patch and save it by patch
Very low memory usage (< 500MB)
Parallel for 24 cores, get a 15-fold boost

    
with h5py.File(hdf5_pth, 'w') as f:
    for batch in img_batches:
        region_data = f['raw_img_xx'][batch.region]
        # ...
        f['heatmap_img_xx'][batch.region] = outcome

Deep learning x HDF5

Ref: CS231n Note: Convolutional Netowrks

Input viewed as 32 width x 32 height x 3 channels (depth)
Output viewed as 1 x 1 x 10. Spatial information is preserved.

Learning process

# evaluate loss and gradient, internally use self.W
loss, grad = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)

# perform parameter update
self.W -= grad * learning_rate

Parameter update method comparison

Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Tue to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed.

Ref: CS231n Note: Neural Network 3 and Image credit: Alec Radford.

How to track the learning progress for all parameters

1M+ parameters for a modern CNN model
Learning rate is hard to tune
Track 1M+ parameters update for 1000+ iterations?
Use HDF5 to store the parameter history

RPi2 X HDF5

Firstly, thanks Kano for sending me RPi2 fast and sound
Some good hackable platform with strong ecosystem
The application is replacable with normal database

Long term monitoring

Continuous storage of sensor data
Group by date
Monitor and alert

Recap

HDF5 = Hierarchical Data Format
Group, dataset, chunck, compression
Large numerical matrix on disk
Robust and cross-platform

Convolutional Neural Network From the Ground Up

PyCon APAC 2015, 2015-06-07

HDF5 Use Case

A quick note before my talk

HDF5 is NOT HDFS

If you are expecting ...

About Me

Many good HDF5 intros out there already

HDF5 Intro

HDF = Hierarchical Data Format

Readable by most programming languages

Keep data organized and connected by design

Group and dataset

Datasets creation

Dataset partial reading

Chunks

Compression and filter

Datasets - metadata

Tricky things - Strings

Good things - Concurrency

How about
X vs HDF5?

Pickle the awesome

Pickle vs HDF5

Numpy save / load

You said to save data? Database?

Database vs HDF5

Why not makes HDF5 support those awesomeness?

HDF5
Use Case

Daily - Pandas X HDF5

Large Numpy array X HDF5

Original way

Handled by HDF5

Deep learning x HDF5

Learning process

Parameter update method comparison

How to track the learning progress for all parameters

RPi2 X HDF5

Long term monitoring

Recap

Thank You!

PyCon APAC 2015, 2015-06-07

HDF5 Use Case

A quick note before my talk

HDF5 is NOT HDFS

If you are expecting ...

About Me

Many good HDF5 intros out there already

HDF5 Intro

HDF = Hierarchical Data Format

Readable by most programming languages

Keep data organized and connected by design

Group and dataset

Datasets creation

Dataset partial reading

Chunks

Compression and filter

Datasets - metadata

Tricky things - Strings

Good things - Concurrency

How aboutX vs HDF5?

Pickle the awesome

Pickle vs HDF5

Numpy save / load

You said to save data? Database?

Database vs HDF5

Why not makes HDF5 support those awesomeness?

HDF5Use Case

Daily - Pandas X HDF5

Large Numpy array X HDF5

Original way

Handled by HDF5

Deep learning x HDF5

Learning process

Parameter update method comparison

How to track the learning progress for all parameters

RPi2 X HDF5

Long term monitoring

Recap

Thank You!

How about
X vs HDF5?

HDF5
Use Case