In Cito, the workflow for preparing complex data for Convolutional Neural Networks (CNNs) and Multimodal Neural Networks (MMNs) is the same. The only difference is that MMNs can process multiple data types simultaneously (see the ‘MMN’ section below).
Before we dive into the details, let’s clarify how we represent different types of complex data in R using multidimensional arrays:
Note: Cito currently supports up to three-dimensional inputs (excluding the sample dimension). Four-dimensional arrays, such as temporal RGB sequences \([3, height, width, time_steps]\), are not yet supported. Please contact us if you need support for 4D inputs.
The cnn() and mmn() functions both expect
their X argument to be a single array, with the first dimension indexing
samples. The subsequent dimensions then correspond to the data structure
of each sample. Specifically:
It is crucial that the order of samples in X matches the order of observations in your response (target) vector (or matrix for multiple response) y.
Ultimately, the only requirement for cito is that it has these multidimensional arrays as inputs. Please note that there are several ways in which we can build them; the following workflow is just one example.
Although images can be saved in different formats, we recommend using formats for which an R function or R package is available to allow you to load the images into R.
Grayscale and RGB images should be saved as .png or
.jpeg. Time series data can technically be interpreted as
grayscale images and should therefore also be saved as .png
or .jpeg.
However, LiDAR point clouds and/or other remote sensing data have
more ‘channels’ (ofc, they do not have channels at all) than grayscale
or RGB images and cannot therefore be saved as .png or
.jpeg. Classical formats for saving such data are
.tiff (GeoTiff) and .nc (netCDF).
We recommend saving each image individually to your hard drive and using a naming strategy that allows the observation ID to be inferred from the image name. Here is an example:
project/
├── data/
│ ├── RGBimages/
│ │ ├── 001-img.jpeg
│ │ ├── 002-img.jpeg
│ │ ├── 003-img.jpeg
│ │ ├── 004-img.jpeg
│ │ └── ...
│ ├── LiDAR/
│ │ ├── 001-LiDAR.tiff
│ │ ├── 002-LiDAR.tiff
│ │ ├── 003-LiDAR.tiff
│ │ ├── 004-LiDAR.tiff
│ │ └── ...
│ └── Response/
│ └── Y.csv
└── code/
├── 01-CNN.R
└── 02-MMN.R
Before we can run/train cito, we must load the images into R, transform them into arrays, and concatenate the individual images of one input type into one array.
.jpeg files: This can be done either using the
imager package via
as.array(imager::load.image("path-to-img.jpeg"))) or using
the torchvision package (dependency of cito) via
torchvision::base_loader("path-to-img.jpeg").png files: This can be done using the
torchvision package (dependency of cito) via
torchvision::base_loader("path-to-img.jpeg").tiff files:
tiff::readTIFF("path-to-img.tiff").tiff (GeoTIFF) files:
as.array(raster::brick("path-to-img.tiff")).nc (netCDF) files:
ncdf4::ncvar_get(ncdf4::nc_open("path-to-img.nc"))Loop to read images into R:
First data type:
RGBimages_files = list.files(path = "RGBimages/", full.names = TRUE)
RGBimages = vector("list", length(RGBimages_files))
for(i in 1:length(RGBimages_files)) {
RGBimages[[i]] = torchvision::base_loader(RGBimages_files[i])
}Second data type:
LiDAR_files = list.files(path = "LiDAR/", full.names = TRUE)
LiDAR = vector("list", length(RGBimages_files))
for(i in 1:length(LiDAR_files)) {
LiDAR[[i]] = torchvision::base_loader(LiDAR_files[i])
}Change list of arrays into one array:
Deep Neural Networks converge better when the inputs are normalized/standardized, for complex data, we can divide them by their max value to bring the values into the range of \([0, 1]\)
Also, cito expects the channel dimension for RGB images to be in the second dimension. For LiDAR, the question is which dimension should be treated as the channel dimension. The channel dimension is treated slightly differently in CNN, so I would propose setting the z dimension as the channel dimension. However, when we read images into R, the channel dimension is usually the last dimension. In Cito, though, it must be the second dimension. (Reminder: our dimensions for RGB are currently: \([n, height, width, 3]\))
Read tabular data into R using the read.csv function.
Predictors (e.g. spatial coordinates, altitude and climatic variables
such as bioclim variables) should be standardised using the
scale function.
Note: there should be no missing values in the data! If you have 1,000 images and 1,000 response values with NAs, Cito/R will drop the NA observations in the response, meaning the number of observations will no longer match up. Of course, there should also be no NAs in the images!
Note: The order of the tabular data (responses and predictors) should match the order of the images.
We can setup the architecture of the CNN by using the
create_architecture function. CNN usually consist of
several convolutional layers, each layer followed by a pooling layer,
and finally fully connected layers:
architecture <- create_architecture(conv(5), # convolutional layer with 5 kernels
maxPool(), # max pooling layer to reduce the dimension of the feature maps
conv(5), # convolutional layer with 5 kernels
maxPool(), # max pooling layer to reduce the dimension of the feature maps
linear(10)) # fully connected layerThe idea is that the convolutional layers learn to extract structures from the images such as shapes and edges. These structures are then presented to the fully connected layer that is then doing the actual classification or regression.
Finding a good architecture can require a lot of experience and knowledge of CNNs. As an alternative, we recommend using transfer learning, which is also state of the art. Rather than training our own convolutional layers, we use a pre-trained CNN (usually trained on a large dataset with hundreds or thousands of response categories) and only train the final fully connected layer. It has been found that the convolutional layers often learn the same things, so there is no need to retrain them each time. This saves a lot of computational runtime, but more importantly, we don’t need as much training data because we only have to train a small part of our model:
architecture <- create_architecture(transfer("resnet18"), # use pretrained resnet18 architecture
linear(100)) # our fully connnected layerAlso, with that, we don’t have to think about our own architecture!
Finally we can fit our model:
model <- cnn(X = LiDAR, Y, architecture, loss = "binomial",
epochs = 10, validation = 0.1, lr = 0.05, device=device)Note:
All of these points are described in the Introduction
to cito vignette and apply to the dnn() and
cnn() functions.
When the model is trained, we can make predictions via the
predict method:
pred = predict(model, LiDAR) # by default predictions will be on the scale of the link (so no probabilities)
pred_proba = predict(model, LiDAR, type = "response") # change type to get probabilitiesModel can be visualized via plot(model)
Multi-modal neural networks (MMNs) are useful when:
Each complex input data must be passed within its own multidimensional array and its own architecture:
architecture_LiDAR <- create_architecture(transfer("resnet18"))
architecture_RGBimages <- create_architecture(transfer("resnet18"))
model =
mmn(df$Y ~
cnn(X = LiDAR, architecture = architecture_LiDAR) +
cnn(X = RGBimages , architecture = architecture_RGBimages) +
dnn(~Temp+Precip, data = df),
loss = 'binomial',
optimizer = "adam")Important: Tabular data must be within one
data.frame, so here, the response variable Y is in the same
data.frame as Temp and Precip!
For multiple responses:
model =
mmn(cbind(df$Y1, df$Y2, df$Y3) ~
cnn(X = LiDAR, architecture = architecture_LiDAR) +
cnn(X = RGBimages , architecture = architecture_RGBimages) +
dnn(~Temp+Precip, data = df),
loss = 'binomial',
optimizer = "adam")Newdata must be passed as list to the predict function. The datasets must have the same order as the model components in the mmn:
Multiple different responses with different losses:
custom_joint_loss = function(pred, true) {
# first loss, e.g. binomial -> negative loglikelihood
loss1 = -torch::distr_bernoulli(logits = torch_sigmoid(pred[,1]))$log_prob(true[,1])$mean()
# second loss, e.g. mse
loss2 = torch::nnf_mse_loss(pred[,2], true[,2])
# third loss, e.g. poisson
loss3 = -torch::distr_poisson(pred[,3]$exp())$log_prob(true[,3])$mean()
# return joint loss
return(loss1 + loss2 + loss3)
}
model =
mmn(cbind(df$Y1, df$Y2, df$Y3) ~
cnn(X = LiDAR, architecture = architecture_LiDAR) +
cnn(X = RGBimages , architecture = architecture_RGBimages) +
dnn(~Temp+Precip, data = df),
loss = custom_joint_loss,
epochs = 5L,
optimizer = "adam")As cito now lacks the inverse link functions, we have to apply them to the predictions ourselves:
pred = predict(model, newdata = list(LiDAR, RGBimages, df))
pred[,1] = plogis(pred[,1])
pred[,3] = exp(pred[,3])Note:
All of these points are described in the Introduction
to cito vignette and apply to the dnn() and
cnn() functions.
When working with convolutional neural networks (CNNs) and multimodal networks (MMNs), two critical computational factors are memory (RAM) and GPU availability.
CNNs benefit significantly from GPU acceleration:
batch_size = 20).Currently, all images must be loaded into a single R session:
| Data Type | Dimensions | Memory Usage |
|---|---|---|
| LiDAR volume | (500, 500, 500, 500) | ~500 GB |
| Optical satellite images | (500, 3, 500, 500) | ~3 GB |
| Tabular data | Negligible | < 1 GB (ignored) |
| Model parameters (~2.24 M) | — | < 0.1 GB (negligible) |
| Total | ~503 GB |
To run this dataset in a single R session you would need ~550
GB of system RAM. By contrast, GPU memory is less constraining
because only one batch is loaded at a time on the GPU. For
batch_size = 20, a 12–14 GB GPU should suffice.
Summary: The primary bottleneck is system RAM on the CPU, not VRAM on the GPU.