2025-02-11T14:07
DataLoader vs Dataset:
I think it's worthwhile to look at built-in datasets to see how they are structured.
A custom Dataset must have a constructor,
__len__
and __getitem__
. Constructors don't
need to call super()
. __getitem__
returns a
sample/label pair, i.e X,y
.
A custom DataLoader is like a dataset that is mini-batch
aware. It assists in data reshuffling and parallelizing data access.
IMO, much of this should be managed in Xarray via xbatcher, when
possible. DataLoaders are iterable, and thus, often called with
next(iter(train_loader))
.
DataLoaders should be compatible with Torch's Samplers.
A DataLoader is invoked in train scripts like so:
def train(dataloader, model, loss_fn, optimizer):
= len(dataloader.dataset)
size
model.train### Note the iteration pattern! ###
for batch, (X, y) in enumerate(dataloader):
= X.to(device), y.to(device)
X, y
# Compute prediction error
= model(X)
pred = loss_fn(pred, y)
loss
# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 100 == 0:
= loss.item(), (batch + 1) * len(X)
loss, current print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
A convention for data loaders is to couple them with
transforms. transform
and
target_transform
modify the data and label, respectively. I
wonder if xbatcher
has specific affordances for these s.t.
we can write stuff in Xarray's fluent API before converting the
underlying numpy arrays to torch Tensors?
There are two Dataset types made available to DataLoaders:
__getitem__
and
__len__
).IterableDataset
. This is used when random reads are
expensive or improbable, where batch size depends on the fetched
data.When using IterableDataset
, which is more like the
Xarray case, replicates of the loader are typically made across multiple
processes. Thus, the data needs to be configured well to avoid duplicate
data loading.
Iterable style datasets naturally lend themselves to chunking where a batch is yielded all at once.
Instead of using the shuffle flag in map-style data loaders, users
can specify custom Samplers. These yield the next index/key to fetch.
Samples can also be used to configure batches via the
batch_sampler
arg.
Sampler cannot be used with iterable-style datasets.
By default, loaded data is collated into batches. A batch dimension is added as the first dimension. This is configurable if you want to get single samples or manage this yourself. I believe the collate_fn converts numpy arrays to torch tensors, and most of the time adds a batch dimension.
According to this
GH issue, one shouldn't use dicts and lists inside a
__getitem__
call, but instead use numpy arrays or similar,
in order to avoid memory exploding from copy-on-write/refcounting
behavior in python multiprocessing.
CUDA Tensors should not be used in mulit-processing; Automatic memory pinning is faster.
For data loading, passing
pin_memory=True
to aDataLoader
will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs.
– memory pinning | data API 1
In this example, we are transferring many large tensors from the CPU to the GPU. This scenario is ideal for utilizing multithreaded
pin_memory()
, which can significantly enhance performance. However, if the tensors are small, the overhead associated with multithreading may outweigh the benefits. Similarly, if there are only a few tensors, the advantages of pinning tensors on separate threads become limited.
See references for the full article2, it's a complex topic.