# Experimenting with autoregressive flows in TensorFlow Probability In the first part of this mini-series on autoregressive flow models, we looked at bijectors in TensorFlow Probability (TFP), and saw how to use them for sampling and density estimation. We singled out the affine bijector to demonstrate the mechanics of flow construction: We start from a distribution that is easy to sample from, and that allows for straightforward calculation of its density. Then, we attach some number of invertible transformations, optimizing for data-likelihood under the final transformed distribution. The efficiency of that (log)likelihood calculation is where normalizing flows excel: Loglikelihood under the (unknown) target distribution is obtained as a sum of the density under the base distribution of the inverse-transformed data plus the absolute log determinant of the inverse Jacobian.

Now, an affine flow will seldom be powerful enough to model nonlinear, complex transformations. In constrast, autoregressive models have shown substantive success in density estimation as well as sample generation. Combined with more involved architectures, feature engineering, and extensive compute, the concept of autoregressivity has powered – and is powering – state-of-the-art architectures in areas such as image, speech and video modeling.

This post will be concerned with the building blocks of autoregressive flows in TFP. While we won’t exactly be building state-of-the-art models, we’ll try to understand and play with some major ingredients, hopefully enabling the reader to do her own experiments on her own data.

This post has three parts: First, we’ll look at autoregressivity and its implementation in TFP. Then, we try to (approximately) reproduce one of the experiments in the “MAF paper” (Masked Autoregressive Flows for Distribution Estimation ) – essentially a proof of concept. Finally, for the third time on this blog, we come back to the task of analysing audio data, with mixed results.

In distribution estimation, autoregressivity enters the scene via the chain rule of probability that decomposes a joint density into a product of conditional densities:

$p(\mathbf{x}) = \prod_{i}p(\mathbf{x}_i|\mathbf{x}_{1:i−1})$

In practice, this means that autoregressive models have to impose an order on the variables – an order which might or might not “make sense.” Approaches here include choosing orderings at random and/or using different orderings for each layer.
While in recurrent neural networks, autoregressivity is conserved due to the recurrence relation inherent in state updating, it is not clear a priori how autoregressivity is to be achieved in a densely connected architecture. A computationally efficient solution was proposed in MADE: Masked Autoencoder for Distribution Estimation: Starting from a densely connected layer, mask out all connections that should not be allowed, i.e., all connections from input feature $$i$$ to said layer’s activations $$1 … i-1$$. Or expressed differently, activation $$i$$ may be connected to input features $$1 … i-1$$ only. Then when adding more layers, care must be taken to ensure that all required connections are masked so that at the end, output $$i$$ will only ever have seen inputs $$1 … i-1$$.

Thus masked autoregressive flows are a fusion of two major approaches – autoregressive models (which need not be flows) and flows (which need not be autoregressive). In TFP these are provided by MaskedAutoregressiveFlow, to be used as a bijector in a TransformedDistribution.

While the documentation shows how to use this bijector, the step from theoretical understanding to coding a “black box” may seem wide. If you’re anything like the author, here you might feel the urge to “look under the hood” and verify that things really are the way you’re assuming. So let’s give in to curiosity and allow ourselves a little escapade into the source code.

Peeking ahead, this is how we’ll construct a masked autoregressive flow in TFP (again using the still new-ish R bindings provided by tfprobability):

library(tfprobability)

hidden_layers = list(num_hidden, num_hidden),
activation = tf$nn$tanh)
)

Pulling apart the relevant entities here, tfb_masked_autoregressive_flow is a bijector, with the usual methods tfb_forward(), tfb_inverse(), tfb_forward_log_det_jacobian() and tfb_inverse_log_det_jacobian().
The default shift_and_log_scale_fn, tfb_masked_autoregressive_default_template, constructs a little neural network of its own, with a configurable number of hidden units per layer, a configurable activation function and optionally, other configurable parameters to be passed to the underlying dense layers. It’s these dense layers that have to respect the autoregressive property. Can we take a look at how this is done? Yes we can, provided we’re not afraid of a little Python.

masked_autoregressive_default_template (now leaving out the tfb_ as we’ve entered Python-land) uses masked_dense to do what you’d suppose a thus-named function might be doing: construct a dense layer that has part of the weight matrix masked out. How? We’ll see after a few Python setup statements.

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tf.enable_eager_execution()

current form on master), and when possible, simplified for better readability, accommodating just the specifics of the chosen example – a toy matrix of shape 2×3:

# construct some toy input data (this line obviously not from the original code)
inputs = tf.constant(np.arange(1.,7), shape = (2, 3))

# (partly) determine shape of mask from shape of input
input_depth = tf.compat.dimension_value(inputs.shape.with_rank_at_least(1)[-1])
num_blocks = input_depth
num_blocks # 3
mask = np.zeros([units, input_depth], dtype=tf.float32.as_numpy_dtype())
mask
def _gen_slices(num_blocks, n_in, n_out):
slices = []
col = 0
d_in = n_in // num_blocks
d_out = n_out // num_blocks
row = d_out
for _ in range(num_blocks):
row_slice = slice(row, None)
col_slice = slice(col, col + d_in)
slices.append([row_slice, col_slice])
col += d_in
row += d_out
return slices

slices = _gen_slices(num_blocks, input_depth, units)
for [row_slice, col_slice] in slices:

mask
layer = tf.compat.v1.layers.Dense(
units,
kernel_constraint=lambda x: mask * x   # 2
)
kernel_initializer = tf.compat.v1.glorot_normal_initializer()

return mask * kernel_initializer(shape, dtype, partition_info)
kernel_constraint=lambda x: mask * x 

Papamakarios, Pavlakou, and Murray 2017) applied masked autoregressive flows (as well as single-layer-MADE and Real NVP ) to a number of datasets, including MNIST, CIFAR-10 and several datasets from the UCI Machine Learning Repository.

We pick one of the UCI datasets: Gas sensors for home activity monitoring. On this dataset, the MAF authors obtained the best results using a MAF with 10 flows, so this is what we will try.

Collecting information from the paper, we know that

• data was included from the file ethylene_CO.txt only;
• discrete columns were eliminated, as well as all columns with correlations > .98; and
• the remaining 8 columns were standardised (z-transformed).

Regarding the neural network architecture, we gather that

• each of the 10 MAF layers was followed by a batchnorm;
• as to feature order, the first MAF layer used the variable order that came with the dataset; then every consecutive layer reversed it;
• specifically for this dataset and as opposed to all other UCI datasets, tanh was used for activation instead of relu;
• the Adam optimizer was used, with a learning rate of 1e-4;
• there were two hidden layers for each MAF, with 100 units each;
• training went on until no improvement occurred for 30 consecutive epochs on the validation set; and
• the base distribution was a multivariate Gaussian.

This is all useful information for our attempt to estimate this dataset, but the essential bit is this. In case you knew the dataset already, you might have been wondering how the authors would deal with the dimensionality of the data: It is a time series, and the MADE architecture explored above introduces autoregressivity between features, not time steps. So how is the additional temporal autoregressivity to be handled? The answer is: The time dimension is essentially removed. In the authors’ words,

[…] it is a time series but was treated as if each example were an i.i.d. sample from the marginal distribution.

This undoubtedly is useful information for our present modeling attempt, but it also tells us something else: We might have to look beyond MADE layers for actual time series modeling.

Now though let’s look at this example of using MAF for multivariate modeling, with no time or spatial dimension to be taken into account.

Following the hints the authors gave us, this is what we do.

Observations: 4,208,261
Variables: 19
$X1 <dbl> 0.00, 0.01, 0.01, 0.03, 0.04, 0.05, 0.06, 0.07, 0.07, 0.09,...$ X2  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$X3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...$ X4  <dbl> -50.85, -49.40, -40.04, -47.14, -33.58, -48.59, -48.27, -47.14,...
$X5 <dbl> -1.95, -5.53, -16.09, -10.57, -20.79, -11.54, -9.11, -4.56,...$ X6  <dbl> -41.82, -42.78, -27.59, -32.28, -33.25, -36.16, -31.31, -16.57,...
$X7 <dbl> 1.30, 0.49, 0.00, 4.40, 6.03, 6.03, 5.37, 4.40, 23.98, 2.77,...$ X8  <dbl> -4.07, 3.58, -7.16, -11.22, 3.42, 0.33, -7.97, -2.28, -2.12,...
$X9 <dbl> -28.73, -34.55, -42.14, -37.94, -34.22, -29.05, -30.34, -24.35,...$ X10 <dbl> -13.49, -9.59, -12.52, -7.16, -14.46, -16.74, -8.62, -13.17,...
$X11 <dbl> -3.25, 5.37, -5.86, -1.14, 8.31, -1.14, 7.00, -6.34, -0.81,...$ X12 <dbl> 55139.95, 54395.77, 53960.02, 53047.71, 52700.28, 51910.52,...
$X13 <dbl> 50669.50, 50046.91, 49299.30, 48907.00, 48330.96, 47609.00,...$ X14 <dbl> 9626.26, 9433.20, 9324.40, 9170.64, 9073.64, 8982.88, 8860.51,...
$X15 <dbl> 9762.62, 9591.21, 9449.81, 9305.58, 9163.47, 9021.08, 8966.48,...$ X16 <dbl> 24544.02, 24137.13, 23628.90, 23101.66, 22689.54, 22159.12,...
$X17 <dbl> 21420.68, 20930.33, 20504.94, 20101.42, 19694.07, 19332.57,...$ X18 <dbl> 7650.61, 7498.79, 7369.67, 7285.13, 7156.74, 7067.61, 6976.13,...
$X19 <dbl> 6928.42, 6800.66, 6697.47, 6578.52, 6468.32, 6385.31, 6300.97,... # we don't know if we'll end up with the same columns as the authors did, # but we try (at least we do end up with 8 columns) df <- df[,-(1:3)] hc <- findCorrelation(cor(df), cutoff = 0.985) df2 <- df[,-c(hc)] # scale df2 <- scale(df2) df2 # A tibble: 4,208,261 x 8 X4 X5 X8 X9 X13 X16 X17 X18 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 -50.8 -1.95 -4.07 -28.7 50670. 24544. 21421. 7651. 2 -49.4 -5.53 3.58 -34.6 50047. 24137. 20930. 7499. 3 -40.0 -16.1 -7.16 -42.1 49299. 23629. 20505. 7370. 4 -47.1 -10.6 -11.2 -37.9 48907 23102. 20101. 7285. 5 -33.6 -20.8 3.42 -34.2 48331. 22690. 19694. 7157. 6 -48.6 -11.5 0.33 -29.0 47609 22159. 19333. 7068. 7 -48.3 -9.11 -7.97 -30.3 47047. 21932. 19028. 6976. 8 -47.1 -4.56 -2.28 -24.4 46758. 21504. 18780. 6900. 9 -42.3 -2.77 -2.12 -27.6 46197. 21125. 18439. 6827. 10 -44.6 3.58 -0.65 -35.5 45652. 20836. 18209. 6790. # … with 4,208,251 more rows Now set up the data generation process: # train-test split n_rows <- nrow(df2) # 4208261 train_ids <- sample(1:n_rows, 0.5 * n_rows) x_train <- df2[train_ids, ] x_test <- df2[-train_ids, ] # create datasets batch_size <- 100 train_dataset <- tf$cast(x_train, tf$float32) %>% tensor_slices_dataset %>% dataset_batch(batch_size) test_dataset <- tf$cast(x_test, tf$float32) %>% tensor_slices_dataset %>% dataset_batch(nrow(x_test)) To construct the flow, the first thing needed is the base distribution. base_dist <- tfd_multivariate_normal_diag(loc = rep(0, ncol(df2))) Now for the flow, by default constructed with batchnorm and permutation of feature order. num_hidden <- 100 dim <- ncol(df2) use_batchnorm <- TRUE use_permute <- TRUE num_mafs <-10 num_layers <- 3 * num_mafs bijectors <- vector(mode = "list", length = num_layers) for (i in seq(1, num_layers, by = 3)) { maf <- tfb_masked_autoregressive_flow( shift_and_log_scale_fn = tfb_masked_autoregressive_default_template( hidden_layers = list(num_hidden, num_hidden), activation = tf$nn$tanh)) bijectors[[i]] <- maf if (use_batchnorm) bijectors[[i + 1]] <- tfb_batch_normalization() if (use_permute) bijectors[[i + 2]] <- tfb_permute((ncol(df2) - 1):0) } if (use_permute) bijectors <- bijectors[-num_layers] flow <- bijectors %>% discard(is.null) %>% # tfb_chain expects arguments in reverse order of application rev() %>% tfb_chain() target_dist <- tfd_transformed_distribution( distribution = base_dist, bijector = flow ) And configuring the optimizer: optimizer <- tf$train$AdamOptimizer(1e-4) Under that isotropic Gaussian we chose as a base distribution, how likely are the data? base_loglik <- base_dist %>% tfd_log_prob(x_train) %>% tf$reduce_mean()
base_loglik %>% as.numeric()        # -11.33871

base_loglik_test <- base_dist %>%
tfd_log_prob(x_test) %>%
tf$reduce_mean() base_loglik_test %>% as.numeric() # -11.36431 And, just as a quick sanity check: What is the loglikelihood of the data under the transformed distribution before any training? target_loglik_pre <- target_dist %>% tfd_log_prob(x_train) %>% tf$reduce_mean()
target_loglik_pre %>% as.numeric()        # -11.22097

target_loglik_pre_test <-
target_dist %>% tfd_log_prob(x_test) %>% tf$reduce_mean() target_loglik_pre_test %>% as.numeric() # -11.36431 The values match – good. Here now is the training loop. Being impatient, we already keep checking the loglikelihood on the (complete) test set to see if we’re making any progress. n_epochs <- 10 for (i in 1:n_epochs) { agg_loglik <- 0 num_batches <- 0 iter <- make_iterator_one_shot(train_dataset) until_out_of_range({ batch <- iterator_get_next(iter) loss <- function() - tf$reduce_mean(target_dist %>% tfd_log_prob(batch))
optimizer$minimize(loss) loglik <- tf$reduce_mean(target_dist %>% tfd_log_prob(batch))
agg_loglik <- agg_loglik + loglik
num_batches <- num_batches + 1

test_iter <- make_iterator_one_shot(test_dataset)
test_batch <- iterator_get_next(test_iter)
loglik_test_current <- target_dist %>% tfd_log_prob(test_batch) %>% tf$reduce_mean() if (num_batches %% 100 == 1) cat( "Epoch ", i, ": ", "Batch ", num_batches, ": ", (agg_loglik %>% as.numeric()) / num_batches, " --- test: ", loglik_test_current %>% as.numeric(), "\n" ) }) } With both training and test sets amounting to over 2 million records each, we did not have the patience to run this model until no improvement occurred for 30 consecutive epochs on the validation set (like the authors did). However, the picture we get from one complete epoch’s run is pretty clear: The setup seems to work pretty okay. Epoch 1 : Batch 1: -8.212026 --- test: -10.09264 Epoch 1 : Batch 1001: 2.222953 --- test: 1.894102 Epoch 1 : Batch 2001: 2.810996 --- test: 2.147804 Epoch 1 : Batch 3001: 3.136733 --- test: 3.673271 Epoch 1 : Batch 4001: 3.335549 --- test: 4.298822 Epoch 1 : Batch 5001: 3.474280 --- test: 4.502975 Epoch 1 : Batch 6001: 3.606634 --- test: 4.612468 Epoch 1 : Batch 7001: 3.695355 --- test: 4.146113 Epoch 1 : Batch 8001: 3.767195 --- test: 3.770533 Epoch 1 : Batch 9001: 3.837641 --- test: 4.819314 Epoch 1 : Batch 10001: 3.908756 --- test: 4.909763 Epoch 1 : Batch 11001: 3.972645 --- test: 3.234356 Epoch 1 : Batch 12001: 4.020613 --- test: 5.064850 Epoch 1 : Batch 13001: 4.067531 --- test: 4.916662 Epoch 1 : Batch 14001: 4.108388 --- test: 4.857317 Epoch 1 : Batch 15001: 4.147848 --- test: 5.146242 Epoch 1 : Batch 16001: 4.177426 --- test: 4.929565 Epoch 1 : Batch 17001: 4.209732 --- test: 4.840716 Epoch 1 : Batch 18001: 4.239204 --- test: 5.222693 Epoch 1 : Batch 19001: 4.264639 --- test: 5.279918 Epoch 1 : Batch 20001: 4.291542 --- test: 5.29119 Epoch 1 : Batch 21001: 4.314462 --- test: 4.872157 Epoch 2 : Batch 1: 5.212013 --- test: 4.969406  With these training results, we regard the proof of concept as basically successful. However, from our experiments we also have to say that the choice of hyperparameters seems to matter a lot. For example, use of the relu activation function instead of tanh resulted in the network basically learning nothing. (As per the authors, relu worked fine on other datasets that had been z-transformed in just the same way.) Batch normalization here was obligatory – and this might go for flows in general. The permutation bijectors, on the other hand, did not make much of a difference on this dataset. Overall the impression is that for flows, we might either need a “bag of tricks” (like is commonly said about GANs), or more involved architectures (see “Outlook” below). Finally, we wind up with an experiment, coming back to our favorite audio data, already featured in two posts: Simple Audio Classification with Keras and Audio classification with Keras: Looking closer at the non-deep learning parts. ## Analysing audio data with MAF The dataset in question consists of recordings of 30 words, pronounced by a number of different speakers. In those previous posts, a convnet was trained to map spectrograms to those 30 classes. Now instead we want to try something different: Train an MAF on one of the classes – the word “zero,” say – and see if we can use the trained network to mark “non-zero” words as less likely: perform anomaly detection, in a way. Spoiler alert: The results were not too encouraging, and if you are interested in a task like this, you might want to consider a different architecture (again, see “Outlook” below). Nonetheless, we quickly relate what was done, as this task is a nice example of handling data where features vary over more than one axis. Preprocessing starts as in the aforementioned previous posts. Here though, we explicitly use eager execution, and may sometimes hard-code known values to keep the code snippets short. library(tensorflow) library(tfprobability) tfe_enable_eager_execution(device_policy = "silent") library(tfdatasets) library(dplyr) library(readr) library(purrr) library(caret) library(stringr) # make decode_wav() run with the current release 1.13.1 as well as with the current master branch decode_wav <- function() if (reticulate::py_has_attr(tf, "audio")) tf$audio$decode_wav else tf$contrib$framework$python$ops$audio_ops$decode_wav # same for stft() stft <- function() if (reticulate::py_has_attr(tf, "signal")) tf$signal$stft else tf$spectral$stft files <- fs::dir_ls(path = "audio/data_1/speech_commands_v0.01/", # replace by yours recursive = TRUE, glob = "*.wav") files <- files[!str_detect(files, "background_noise")] df <- tibble( fname = files, class = fname %>% str_extract("v0.01/.*/") %>% str_replace_all("v0.01/", "") %>% str_replace_all("/", "") ) Audio classification with Keras: Looking closer at the non-deep learning parts, we’d like to train the network on spectrograms instead of the raw time domain data. Using the same settings for frame_length and frame_step of the Short Term Fourier Transform as in that post, we’d arrive at data shaped number of frames x number of FFT coefficients. To make this work with the masked_dense() employed in tfb_masked_autoregressive_flow(), the data would then have to be flattened, yielding an impressive 25186 features in the joint distribution. With the architecture defined as above in the GAS example, this lead to the network not making much progress. Neither did leaving the data in time domain form, with 16000 features in the joint distribution. Thus, we decided to work with the FFT coefficients computed over the complete window instead, resulting in 257 joint features. batch_size <- 100 sampling_rate <- 16000L data_generator <- function(df, batch_size) { ds <- tensor_slices_dataset(df) ds <- ds %>% dataset_map(function(obs) { wav <- decode_wav()(tf$read_file(tf$reshape(obs$fname, list())))
samples <- wav$audio[ ,1] # some wave files have fewer than 16000 samples padding <- list(list(0L, sampling_rate - tf$shape(samples)))
padded <- tf$pad(samples, padding) stft_out <- stft()(padded, 16000L, 1L, 512L) magnitude_spectrograms <- tf$abs(stft_out) %>% tf$squeeze() }) ds %>% dataset_batch(batch_size) } ds_train <- data_generator(df_train, batch_size) batch <- ds_train %>% make_iterator_one_shot() %>% iterator_get_next() dim(batch) # 100 x 257 Training then proceeded as on the GAS dataset. # define MAF base_dist <- tfd_multivariate_normal_diag(loc = rep(0, dim(batch))) num_hidden <- 512 use_batchnorm <- TRUE use_permute <- TRUE num_mafs <- 10 num_layers <- 3 * num_mafs # store bijectors in a list bijectors <- vector(mode = "list", length = num_layers) # fill list, optionally adding batchnorm and permute bijectors for (i in seq(1, num_layers, by = 3)) { maf <- tfb_masked_autoregressive_flow( shift_and_log_scale_fn = tfb_masked_autoregressive_default_template( hidden_layers = list(num_hidden, num_hidden), activation = tf$nn$tanh, )) bijectors[[i]] <- maf if (use_batchnorm) bijectors[[i + 1]] <- tfb_batch_normalization() if (use_permute) bijectors[[i + 2]] <- tfb_permute((dim(batch) - 1):0) } if (use_permute) bijectors <- bijectors[-num_layers] flow <- bijectors %>% # possibly clean out empty elements (if no batchnorm or no permute) discard(is.null) %>% rev() %>% tfb_chain() target_dist <- tfd_transformed_distribution(distribution = base_dist, bijector = flow) optimizer <- tf$train$AdamOptimizer(1e-3) # train MAF n_epochs <- 100 for (i in 1:n_epochs) { agg_loglik <- 0 num_batches <- 0 iter <- make_iterator_one_shot(ds_train) until_out_of_range({ batch <- iterator_get_next(iter) loss <- function() - tf$reduce_mean(target_dist %>% tfd_log_prob(batch))
optimizer$minimize(loss) loglik <- tf$reduce_mean(target_dist %>% tfd_log_prob(batch))
agg_loglik <- agg_loglik + loglik
num_batches <- num_batches + 1

loglik_test_current <-
target_dist %>% tfd_log_prob(ds_test) %>% tf\$reduce_mean()

if (num_batches %% 20 == 1)
cat(
"Epoch ",
i,
": ",
"Batch ",
num_batches,
": ",
((agg_loglik %>% as.numeric()) / num_batches) %>% round(1),
" --- test: ",
loglik_test_current %>% as.numeric() %>% round(1),
"\n"
)
})
}

During training, we also monitored loglikelihoods on three different classes, cat, bird and wow. Here are the loglikelihoods from the first 10 epochs. “Batch” refers to the current training batch (first batch in the epoch), all other values refer to complete datasets (the complete test set and the three sets selected for comparison).

epoch   |   batch  |   test   |   "cat"  |   "bird"  |   "wow"  |
--------|----------|----------|----------|-----------|----------|
1       |   1443.5 |   1455.2 |   1398.8 |    1434.2 |   1546.0 |
2       |   1935.0 |   2027.0 |   1941.2 |    1952.3 |   2008.1 |
3       |   2004.9 |   2073.1 |   2003.5 |    2000.2 |   2072.1 |
4       |   2063.5 |   2131.7 |   2056.0 |    2061.0 |   2116.4 |
5       |   2120.5 |   2172.6 |   2096.2 |    2085.6 |   2150.1 |
6       |   2151.3 |   2206.4 |   2127.5 |    2110.2 |   2180.6 |
7       |   2174.4 |   2224.8 |   2142.9 |    2163.2 |   2195.8 |
8       |   2203.2 |   2250.8 |   2172.0 |    2061.0 |   2221.8 |
9       |   2224.6 |   2270.2 |   2186.6 |    2193.7 |   2241.8 |
10      |   2236.4 |   2274.3 |   2191.4 |    2199.7 |   2243.8 |        

While this does not look too bad, a complete comparison against all twenty-nine non-target classes had “zero” outperformed by seven other classes, with the remaining twenty-two lower in loglikelihood. We don’t have a model for anomaly detection, as yet.

## Outlook

As already alluded to several times, for data with temporal and/or spatial orderings more evolved architectures may prove useful. The very successful PixelCNN family is based on masked convolutions, with more recent developments bringing further refinements (e.g. Gated PixelCNN , PixelCNN++ . Attention, too, may be masked and thus rendered autoregressive, as employed in the hybrid PixelSNAIL and the – not surprisingly given its name – transformer-based ImageTransformer .

To conclude, – while this post was interested in the intersection of flows and autoregressivity – and last not least the use therein of TFP bijectors – an upcoming one might dive deeper into autoregressive models specifically… and who knows, perhaps come back to the audio data for a fourth time.

Chen, Xi, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. 2017. “PixelSNAIL: An Improved Autoregressive Generative Model.” CoRR abs/1712.09763. http://arxiv.org/abs/1712.09763.
Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2016. “Density Estimation Using Real NVP.” CoRR abs/1605.08803. http://arxiv.org/abs/1605.08803.
Germain, Mathieu, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. “MADE: Masked Autoencoder for Distribution Estimation.” CoRR abs/1502.03509. http://arxiv.org/abs/1502.03509.
Oord, Aaron van den, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. “Conditional Image Generation with PixelCNN Decoders.” CoRR abs/1606.05328. http://arxiv.org/abs/1606.05328.
Papamakarios, George, Theo Pavlakou, and Iain Murray. 2017. “Masked Autoregressive Flow for Density Estimation.” arXiv e-Prints, May, arXiv:1705.07057. https://arxiv.org/abs/1705.07057.
Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. 2018. “Image Transformer.” CoRR abs/1802.05751. http://arxiv.org/abs/1802.05751.
Salimans, Tim, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. 2017. “PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications.” CoRR abs/1701.05517. http://arxiv.org/abs/1701.05517.