torchgeo.samplers#

Samplers#

Samplers are used to index a dataset, retrieving a single query at a time. For NonGeoDataset, dataset objects can be indexed with integers, and PyTorch’s builtin samplers are sufficient. For GeoDataset, dataset objects require a bounding box for indexing. For this reason, we define our own GeoSampler implementations below. These can be used like so:

from torch.utils.data import DataLoader

from torchgeo.datasets import Landsat
from torchgeo.samplers import RandomGeoSampler

dataset = Landsat(...)
sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, sampler=sampler)

This data loader will return 256x256 px images, and has an epoch length of 10,000.

Random Geo Sampler#

class torchgeo.samplers.RandomGeoSampler(dataset, size, length=None, roi=None, toi=None, units=Units.PIXELS, generator=None)[source]#

Bases: GeoSampler

Samples elements from a region of interest randomly.

This is particularly useful during training when you want to maximize the size of the dataset and return as many random chips as possible. Note that randomly sampled chips may overlap.

This sampler is not recommended for use with tile-based datasets. Use RandomBatchGeoSampler instead.

__init__(dataset, size, length=None, roi=None, toi=None, units=Units.PIXELS, generator=None)[source]#

Initialize a new Sampler instance.

The size argument can either be:

  • a single float - in which case the same value is used for the height and width dimension

  • a tuple of two floats - in which case, the first float is used for the height dimension, and the second float for the width dimension

Changed in version 0.3: Added units parameter, changed default to pixel units

Changed in version 0.4: length parameter is now optional, a reasonable default will be used

Added in version 0.7: The generator parameter.

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • size (tuple[float, float] | float) – dimensions of each patch

  • length (int | None) – number of random samples to draw per epoch (defaults to approximately the maximal number of non-overlapping chips of size size that could be sampled from the dataset)

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

  • units (Units) – defines if size is in pixel or CRS units

  • generator (Generator | None) – pseudo-random number generator (PRNG).

__iter__()[source]#

Return the index of a dataset.

Yields:

[xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

__len__()[source]#

Return the number of samples in a single epoch.

Returns:

length of the epoch

Return type:

int

Grid Geo Sampler#

class torchgeo.samplers.GridGeoSampler(dataset, size, stride=None, roi=None, toi=None, units=Units.PIXELS)[source]#

Bases: GeoSampler

Samples elements in a grid-like fashion.

This is particularly useful during evaluation when you want to make predictions for an entire region of interest. You want to minimize the amount of redundant computation by minimizing overlap between chips.

Usually the stride should be slightly smaller than the chip size such that each chip has some small overlap with surrounding chips. This is used to prevent stitching artifacts when combining each prediction patch. The overlap between each chip (chip_size - stride) should be approximately equal to the receptive field of the CNN.

__init__(dataset, size, stride=None, roi=None, toi=None, units=Units.PIXELS)[source]#

Initialize a new Sampler instance.

The size and stride arguments can either be:

  • a single float - in which case the same value is used for the height and width dimension

  • a tuple of two floats - in which case, the first float is used for the height dimension, and the second float for the width dimension

Changed in version 0.3: Added units parameter, changed default to pixel units

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • size (tuple[float, float] | float) – dimensions of each patch

  • stride (tuple[float, float] | float | None) – distance to skip between each patch (defaults to size)

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

  • units (Units) – defines if size and stride are in pixel or CRS units

__iter__()[source]#

Return the index of a dataset.

Yields:

[xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

__len__()[source]#

Return the number of samples over the ROI.

Returns:

number of patches that will be sampled

Return type:

int

Pre-chipped Geo Sampler#

class torchgeo.samplers.PreChippedGeoSampler(dataset, roi=None, toi=None, shuffle=False, generator=None)[source]#

Bases: GeoSampler

Samples entire files at a time.

This is particularly useful for datasets that contain geospatial metadata and subclass GeoDataset but have already been pre-processed into chips.

This sampler should not be used with NonGeoDataset. You may encounter problems when using an ROI that partially intersects with one of the file bounding boxes, when using an IntersectionDataset, or when each file is in a different CRS. These issues can be solved by adding padding.

__init__(dataset, roi=None, toi=None, shuffle=False, generator=None)[source]#

Initialize a new Sampler instance.

Added in version 0.3.

Added in version 0.7: The generator parameter.

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

  • shuffle (bool) – if True, reshuffle data at every epoch

  • generator (Generator | None) – pseudo-random number generator (PRNG) used in combination with shuffle.

__iter__()[source]#

Return the index of a dataset.

Yields:

[xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

__len__()[source]#

Return the number of samples over the ROI.

Returns:

number of patches that will be sampled

Return type:

int

Batch Samplers#

When working with large tile-based datasets, randomly sampling patches from each tile can be extremely time consuming. It’s much more efficient to choose a tile, load it, warp it to the appropriate coordinate reference system (CRS) and resolution, and then sample random patches from that tile to construct a mini-batch of data. For this reason, we define our own BatchGeoSampler implementations below. These can be used like so:

from torch.utils.data import DataLoader

from torchgeo.datasets import Landsat
from torchgeo.samplers import RandomBatchGeoSampler

dataset = Landsat(...)
sampler = RandomBatchGeoSampler(dataset, size=256, batch_size=128, length=10000)
dataloader = DataLoader(dataset, batch_sampler=sampler)

This data loader will return 256x256 px images, and has a batch size of 128 and an epoch length of 10,000.

Random Batch Geo Sampler#

class torchgeo.samplers.RandomBatchGeoSampler(dataset, size, batch_size, length=None, roi=None, toi=None, units=Units.PIXELS, generator=None)[source]#

Bases: BatchGeoSampler

Samples batches of elements from a region of interest randomly.

This is particularly useful during training when you want to maximize the size of the dataset and return as many random chips as possible. Note that randomly sampled chips may overlap.

__init__(dataset, size, batch_size, length=None, roi=None, toi=None, units=Units.PIXELS, generator=None)[source]#

Initialize a new Sampler instance.

The size argument can either be:

  • a single float - in which case the same value is used for the height and width dimension

  • a tuple of two floats - in which case, the first float is used for the height dimension, and the second float for the width dimension

Changed in version 0.3: Added units parameter, changed default to pixel units

Changed in version 0.4: length parameter is now optional, a reasonable default will be used

Added in version 0.7: The generator parameter.

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • size (tuple[float, float] | float) – dimensions of each patch

  • batch_size (int) – number of samples per batch

  • length (int | None) – number of samples per epoch (defaults to approximately the maximal number of non-overlapping chips of size size that could be sampled from the dataset)

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

  • units (Units) – defines if size is in pixel or CRS units

  • generator (Generator | None) – pseudo-random number generator (PRNG).

__iter__()[source]#

Return the indices of a dataset.

Yields:

Batch of [xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

__len__()[source]#

Return the number of batches in a single epoch.

Returns:

number of batches in an epoch

Return type:

int

Base Classes#

If you want to write your own custom sampler, you can extend one of these abstract base classes.

Geo Sampler#

class torchgeo.samplers.GeoSampler(dataset, roi=None, toi=None)[source]#

Bases: Sampler[slice | tuple[slice] | tuple[slice, slice] | tuple[slice, slice, slice]], ABC

Abstract base class for sampling from GeoDataset.

Unlike PyTorch’s Sampler, GeoSampler returns enough geospatial information to uniquely index any GeoDataset. This includes things like latitude, longitude, height, width, projection, coordinate system, and time.

__init__(dataset, roi=None, toi=None)[source]#

Initialize a new Sampler instance.

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

abstractmethod __iter__()[source]#

Return the index of a dataset.

Yields:

[xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

Batch Geo Sampler#

class torchgeo.samplers.BatchGeoSampler(dataset, roi=None, toi=None)[source]#

Bases: Sampler[list[slice | tuple[slice] | tuple[slice, slice] | tuple[slice, slice, slice]]], ABC

Abstract base class for sampling from GeoDataset.

Unlike PyTorch’s BatchSampler, BatchGeoSampler returns enough geospatial information to uniquely index any GeoDataset. This includes things like latitude, longitude, height, width, projection, coordinate system, and time.

__init__(dataset, roi=None, toi=None)[source]#

Initialize a new Sampler instance.

Added in version 0.8: The toi parameter.

Parameters:
  • dataset (GeoDataset) – dataset to index from

  • roi (Polygon | None) – region of interest to sample from (defaults to the bounds of dataset.index)

  • toi (Interval | None) – time of interest to sample from (defaults to the bounds of dataset.index)

abstractmethod __iter__()[source]#

Return a batch of indices of a dataset.

Yields:

Batch of [xmin – xmax, ymin:ymax, tmin:tmax] coordinates to index a dataset.

Utilities#

torchgeo.samplers.get_random_bounding_box(bounds, size, res, generator=None)[source]#

Returns a random bounding box within a given bounding box.

The size argument can either be:

  • a single float - in which case the same value is used for the height and width dimension

  • a tuple of two floats - in which case, the first float is used for the height dimension, and the second float for the width dimension

Added in version 0.7: The generator parameter.

Parameters:
Returns:

randomly sampled bounding box from the extent of the input

Return type:

tuple[slice, slice]

torchgeo.samplers.tile_to_chips(bounds, size, stride=None)[source]#

Compute number of chips that can be sampled from a tile.

Let \(i\) be the size of the input tile. Let \(k\) be the requested size of the output patch. Let \(s\) be the requested stride. Let \(o\) be the number of output chips sampled from each tile. \(o\) can then be computed as:

\[o = \left\lceil \frac{i - k}{s} \right\rceil + 1\]

This is almost identical to relationship 5 in https://doi.org/10.48550/arXiv.1603.07285. However, we use ceiling instead of floor because we want to include the final remaining chip in each row/column when bounds is not an integer multiple of stride.

Parameters:
Returns:

the number of rows/columns that can be sampled

Return type:

tuple[int, int]

Added in version 0.4.

Units#

By default, the size parameter specifies the size of the image in pixel units. If you would instead like to specify the size in CRS units, you can change the units parameter like so:

from torch.utils.data import DataLoader

from torchgeo.datasets import Landsat
from torchgeo.samplers import RandomGeoSampler, Units

dataset = Landsat(...)
sampler = RandomGeoSampler(dataset, size=256 * 30, length=10000, units=Units.CRS)
dataloader = DataLoader(dataset, sampler=sampler)

Assuming that each pixel in the CRS is 30 m, this data loader will return 256x256 px images, and has an epoch length of 10,000.

class torchgeo.samplers.Units(*values)[source]#

Bases: Enum

Enumeration defining units of size parameter.

Used by GeoSampler and BatchGeoSampler.

PIXELS = 1#

Units in number of pixels

CRS = 2#

Units of coordinate reference system (CRS)