TiDE - Forecasting in the lodging world

Note that the development of this article was stopped after I noticed a problem in the TiDE official implementation. After noticing the bug, I realized an issue was already opened in the repo: LINK. You can still read the article as it provides some key insights of how to formalize the forecasting problem for lodging signals, but I'm not using this model in the end and won't provide any additional detail for the implementation. I'm currently writing an article on how to implement the SOFTS paper in tensorflow for this same use case, which is a pretty cool paper (I made sure that there were no blatant issues this time 👀) and also quite similar to TiDE as it's mostly using simple dense layers in an encoder-decoder style. So stay tuned !

If you take time to read the code and the paper carefully, you'll actually realize that the TiDE model is nothing but... a linear regression using the target lags ! All of the deep learning technicalities, the encoder and the decoder, are actually not used in the final predictions because of this bug. 

Now about the "bug": a LayerNorm is applied to a $(B, 1)$ tensor, where $B$ is the batch size. Because of how a LayerNorm works, applying it to such tensor will return a tensor of zeros of size $(B, 1)$. I never imagined that a paper with a Google stamp could have this kind of caveats, and I blindly went for an implementation. Reading code is never a waste of time though, I learned a nice way of batching time series when you have not one but many time series to forecast within one model. I strongly encourage you to read up to section 4.2 of this article if you want more detail on this. Now I just know that I have to be extra careful with those arxiv pre-prints...

Below is the article nevertheless:

Today we're going to be talking about forecasting time series in the lodging business industry. Signals such as demand and occupancy are key factors in rolling out successful strategies such as bidding, pricing, competitiveness etc. It's therefore of major interest to be able to forecast them successfully and dynamically (i.e adjsut them as some market conditions change), which this article is about. We're going to look at one algorithm specifically: TiDE, which is a deep learning architecture published by Google Research not so long ago (2023) and that I find very appropriate for the lodging data.

1. Problem formulation

We're going to be looking at two different signals to forecast:

  • Occupancy: The % of rooms that are booked in a given market for a given time period. It's the number of reserved rooms divided by the number of total rooms
  • Demand: The number of searches (or clicks, or booking requests) that we observe on the website in a given market for a given time period
Both signals are spatio-temporal: they vary with the destination (or a market, essentially a geographical zone defined by some boundary) for which hotel rooms are booked or searched, and with the period (think about the last days of the year: occupancy and demand will usually be much higher during this period).

There is actually a third dimension to those signals, which is the booking window: the occupancy in Paris during the 2024 Olympics will not be the same whether you observe it 1 week or 1 year in advance. The booking window is the number of days that separate the observation date and the night that you're actually considering demand and occupancy for. To be as clean as possible, the signals in the lodging industry will usually be 3D: location ($i$), stay period ($t$) and booking window ($w$). We'll be using the notations throughout:
  • Occupancy: $O(i, t, w) = \frac{R(i, t, w)}{R(i, t, w) + A(i, t, w)}$, where $R$ and $A$ are respectively the number of reserved and available rooms
  • Demand: $D(i, t, w)$
When we're referring to either demand or occupancy we will just use $Y$ as a placeholder for either $O$ or $D$ to make the whole thing a little less verbose. 

So let's assume we have collected our data points $Y(i, t, w)$ for a bunch of $i$, $t$ and $w$. What's the problem we're trying to solve ? At a given observation date (let's denote it $t_0)$, we're interested in predicting what will be the final demand and occupancy for all markets $\mathbf{\mathcal{I}}$ and all stay dates in the future ($t_0 < t \leq H$), where $H$ is the "Looking Forward Window" (or horizon) we're working with ($H$ can be as large as 6 months or one year in the future). This means we want to build the estimates $\mathbf{\hat{Y}}(i, t, 0)$ with $t_0 < t \leq H\,\, ; m \in \mathcal{I}$. 

That's right: the final value of a signal for a given market / stay date is equivalent to considering that signal with 0 booking window ($w=0)$ as after that the good has perished (you can't book a room that's in the past) and therefore the occupancy and demand won't be changing anymore.

2. Notes on data collection for $D$

The problem is well established now, but let's add some detail about what differentiates occupancy and demand. 
First thing of course is the range of the signals: occupancy is bounded in $\left[0;1\right]$ whereas demand is any positive number. 
Second thing is a little less obvious and is related to how you collect the demand signal. Let's asssume the demand we're looking at is the number of searches that a market has received. Let's denote the number of searches at a given observation date $t_0$ for a given $(i, t)$  as $d(i, t, w)$ (just to make you practice with the notations, you can realize that $w = t - t_0$). I want to be clear that this is not the signal we will be willing to forecast. Indeed, this signal is just the instantaneous demand that you have observed at a specific point in time, and will usually not be representative of how demanded is a $(i, t)$ couple (travelers usually search for hotels during week-ends, so you will typically find that $d$ is higher when observed in week-ends than week-days). What we are rather interested in is the cumulated sum of this signal across windows of observation dates. This will give you a much clearer picture of how demanded a stay date is. Keeping the same notations, we have:

$$D(i, t, w) = \sum_{0 \leq j \leq k}d(i, t, w + j)$$

where $k$ is some hyper-parameter to be determined during data collection. I recomment using $k$ as a multiple of 7 in order to smooth out the weekly effect of searches. I also recommend one to choose $k$ to be quite large, ranging from 3 months to 6 months in order to have a robust estimation of what the total demand really is. The ideal would be to have $k$ equal to the largest observed booking window for the $(i, t)$ couple (i.e the one that corresponds to the time the very first search was made for $(i, t)$), but this is often impractical as search data usually consists in very large datasets and aggregating over so much data can quickly become prohibitively costly. Second, most travelers won't usually start searching for a place to stay earlier than 6 months in advance, so I didn't find that pushing $k$ above 6 months days adds a lot of value: by choosing a larger $k$ you might end up burning computing resources for little to no added value.

About $k$, it's important to note that taking large windows will hide some very dynamic behavior: if one stay date suddenly stops being searched (take the example of an event being cancelled), they you won't be able to see this if $k$ is too large. If your use case is about taking very dynamic decisions then I would recommend either coupling this signal with some bookings / cancellations, or using a much shorter booking window (such as one week for instance). The choice of $k$ will naturally be a tradeoff between sparsity and dynamicity (if $k$ is too short then the signal is more dynamic but probably less robust. Smaller $k$ are therefore better adapted for larger markets that receive larger amount of searches).

As a final note on data collection, the occupancy is by definition cumulative: it already accounts for all reservations that happened up to the observation date, so we do not need to have a similar processing logic for this signal. 

3. Some visual illustrations

Now, time to plot some data ! We'll illustrate what we've been discussing so far, using Florida as a placeholder for $i$. In the plots below, we have projected the data on a normal scale so as to secure its sensitive nature (the rescaled versions are denoted $\tilde{Y}$). The goal of those plots is really to make you more familiar with the notations and how they relate to different objects that we will be manipulating in the remaining parts of the article.

The signals we are manipulating are 3D objects, which can essentially be represented as 3D tensors. There are different ways to slice these tensors that are of particular interests:
  • The "Final Slice":  $\mathbf{Y}(i, t, 0)_{t_1 \leq t \leq t_2}$. This represents the final value of the signal for one market and one interval of stay dates. Note that we will also refer to this time series using the "slice notation" to get closer to the paper's notations: $\mathbf{Y}^{(i)}(t_1:t_2, 0)$
  • The "Looking Forward Slice": $\mathbf{Y}(i, t, t - t_1)_{t_1 < t \leq t_2}$. This represents the known signal as of some observation date $t_1$ for all stays before $t_2$. If today is $t_1$, then this slice is giving you tomorrow's known occupnacy as of today, after tomorrows' occupancy as of today etc. If not obvious, the difference in between the two dates (such as $t - t_1$ in the notations above) is the number of days that separates them. We will also refer to this time series using the "slice notation": $\mathbf{Y}^{(i)}(t_1:t_2, 1:H)$ where $H = t_2 - t_1$
This will become clear with the plotting, but you can already realize that both of the slices above are 1D time series.

The math can appear a little verbose, especially if this is the first time you've been introduced to lodging data, but trust me we need it to make things as clear as possible. I still want to show you that both slices are actually very easy to interpret once you have visualized them through example plots.

3.1 Final Slice

The first plot we'll be looking at the is the "Final Slice" time series $\tilde{O}(i=\text{Florida}, t, w=0)$ where we choose $t$ to span 2 full years, from 2022-01-01 to 2024-01-01. To make things crystal clear, this plot (that I created using a Tableau workbook) is representing the final occupancy in Florida from 2022-01-01 to 2024-01-01. 

What we can see from this plot is that it shows both yearly and weekly seasonalities (you would probably find something very similar by looking at $\tilde{D}$ rather than $\tilde{O}$). More generally, the lodging signals are mostly seasonal, meaning that past behaviour could repeat itself in the future. This is a great incentive for us to include past data in our forecasting algorithms as covariates.



Remember from the introduction that this slice is essentially the target for the forecasting algorithm that we will build

3.2 Looking Forward Slice

We now show a forward looking slice of $\tilde{O}$ for that same market using October 31st 2023 as a placeholder for some observation date $t_1$. The plot is showing what is the known occupancy as of October 31st 2023 for all the stay dates happening between 1st November 2023 (this is $t_1 + 1$, the left most data point in the plot) up to some stay date far in the future ($t_2$ is around August 2024 in this example). Note that $t_2 = t_1 + H$, where we recall $H$ is the placeholder for our forecasting horizon.

As one could expect, the signal is decreasing with the long booking window: as of October 31st 2023, travelers didn't start booking those stay dates far in the future. However, we can already notice some peaks around January 2024 and March 2024 which also correspond to the peaks from the Final Slice plot.


3.3 Putting the two together

You might start feeling how those different slices will interact into our forecasting algorithm, but let's write it down clearly nevertheless
At a given observation date $t_0$, we wish to build estimates of the future final slices $\mathbf{\hat{Y}^{(i)}}((t_0 + 1):(t_0+ 1+ H), 0)$ given:
  • Some past (or lookback) final slices: $\mathbf{Y}^{(i)}((t_0 - L):t_0, 0)$
  • Some looking forward slices:  $\mathbf{Y}^{(i)}((t_0 + 1):(t_0 + 1+H), 1:H)$ 
We have introduced a new notation: $L$, which represents the "Look Back Window" and has the same function as $H$ but for data for which we know the final status as of the observation date. Note that the $L$ and $H$ variables we are using here are the same than that defined in the paper.

Our algorithm will typically be learning a mapping function $g$ such that:

$$\mathbf{\hat{Y}^{(i)}}((t_0+1):(t_0 + 1+H), 0) = g\left(\mathbf{Y}^{(i)}((t_0 - L):t_0, 0), \mathbf{Y}^{(i)}((t_0+1):(t_0 + 1+H, 1:H), \mathbf{X}\right)$$

where $\mathbf{X}$ represents any other feature that we consider being of interest.

Graphically, it would look like the illustration below: predict the green box using the blue and purple boxes. In this example we used Oct 31st 2023 as $t_0$ with $H$ roughly equal to 5 months and $L$ roughly equal to one year.


It's important to stress that the blue and green boxes have aligned timestamps. We tried to make this clear by using the light blue dotted lines that connect the two boxes. The blue box is essentially the green box that's still building up to its final state: we know all the data in the blue box when standign at $t_0$, which is why we're "allowed" to use it in our attempt to forecast the green Final Slice.

4. TiDE

Now is time to introduce the forecasting algorithm TiDE. There are different reasons why this algorithm is interesting for us:
  • It has been evaluated on long-term forecasting tasks, which is key in lodging: the earlier you can predict some signal for a given $(i, t)$ couple, the earlier you can take actions
  • It explicitly leverages known future data in its architecture, meaning we can probably make the best use of the Forward Looking Slice in this model
  • Because it's using Dense layers alsmost exclusively to build up its blocks, it can seamlessly handle univariate and multivariate time series. This is good for us as this means we can decide to predict simultaneously $A$ and $R$ and then compute $O$ rather than trying to compute $O$ directly (which would ensure we have a quantity bounded in $\left[0;1\right]$ by applying the formula for $O$)
  • It's fast: replacing the SOTA recursive layers or transformers with Dense layer encoders / decoders makes training faster
  • It can handle any values for $L$ and $H$: both need not be the same

4.1 Model architecture

4.1.1 Overview

The architecture is fairly simple, so let's quickly get it out of the way. The authors published a self contained diagram that is reproduced here.


This architecture is essentially a "vanilla" encoder-decoder architecture with the folloging tweaks:
  • It's using a skip residual connection for the lookback data (our "purple box") (top left)
  • It's also fast-forwarding the forward slice data after a shallow encoding (bottom right to top right)
Some other things to note:
  • To us, $y^{(i)}_{1:L}$ is a Final Slice ($y^{(i)}_. = Y^{(i)}(., 0)$)
  • The dynamic covariates $x_{1: L+H}$ looks like a Forward Looking Slice but is not exactly one. This is because our forward looking slice has no lookback (no timestamps for $1:L$ but only for $L+1:L+H$). The other reason why we can't compare the two is that in the TiDE architecture the dynamic covariates are common to all time series whereas we have one different Forward Looking Slice for every time series. However, in the lodging case, we also have dynamic covariates that are constant across all time series (such as the trend, booking window, holiday flags etc.)
  • Their target $y$ is not multi-dimensional
  • Every dataset they use has more than one time series to predict. Such time series are indexed by $i$, and $y^{(i)}$ refers to the $i$th time series. Every $y^{(i)}$ is one-dimensional as stated above
  • They stress that the feature projection is "per time-step" for the dynamic covariates $\mathbf{x_{1:L+H}}$. What they mean is that the block is applied independently for every timestep: I actually made sure of that looking at the authors' code which I talk about in the next section. This means it will project the input data $(T, r)$ into a $(T, n_p)$ matrix (the time dimension stays the same)
  • For $y$ and attributes $a$ (which in the public implementation only consists in an embedding of the time series index $i$), the encoding is not per time-step but per instance (where every instance is one time series). This means it will project the input data of shape $(T, B)$ into a $(n_e, B)$ matrix (the time dimension does not stay the same)

4.1.2 Per Time-Step block forward pass

This section is a small digression on feeding time series data into Dense layers

Usually when working with time series, the first inputs to the model are going to be 3D tensors $z_i$ of shape $(B, T, F)$ where $B, T, F$ are respectively the batch, timestep and features dimensions. Let's see what happens if the first layer of our model is a "per time-step" dense layer.

Let's denote such layer $f$ and $z_{it}$ the slice of $z_i$ for one time step $t$ ($z_{it}$ is of shape $(B, F)$). "Per time-step" application of $f$ means that the output tensor of the block will be the concatenation of the $f(z_{it})_t$: we compute $f$ for every slice of timestep (every $(B, t, F)_t$), the outputs of which will all have dimension $(B, n_p)$ (where $n_p$ is the output dimension of $f$) and then re-concatenate those outputs into one final 3D output (which will be of size $(B, T, n_p)$). 

This sounds pretty complicated implementation-wise but is actually handled out of the box: the dense layers already handle this behaviour (at least with the latest versions of TF and PyTorch) without any other coding required.
However, if the layers were more complicated than Dense layers and were only handling 2D inputs then there still exists a simple way in Tensorflow to apply it "per time-step" using the TimeDistributed wrapper. I built one example in this notebook that shows that standalone Dense has the exact same behaviour as wrapping a Dense with a TimeDistributed.

So in a nutshell, using TimeDistributed over Dense does the exact same thing as Dense when used on top of 3D inputs, but this doesn't mean that we won't be using TimeDistributed in our implementation. Indeed, TimeDistributed still gives us more functionality such as masking which can be used to deal with missing data (a specific case of which is sequences with uneven length), which won't happen out of the box with using standalone dense layers.

To conclude on this "per time-step" feature, to be 100% sure this is indeed what the authors meant you can take a look at the publically shared code on github

4.1.3 Warning on TiDE implementation

Be aware that despite the TiDE being a time-series oriented architecture, the authors do not feed 3D inputs in their model. 
  • For the lookback / lookforward slices $y_{1:L}$ and $y_{L+1:L+H}$, they directly feed tensors either of shape $(B, L)$ or $(B, H)$ into their residual blocks, where $B$ represents which time series in the dataset are being batched (the next section describes the batching process in detail). All of their time series are 1D, so having $(B, L, F=1)$ is the same as $(B, L)$
  • For the dynamic covariates $x_{1:L+H}$, they directly feed tensors of shape $(B=1, L+H, r)$ where $r$ is the notation they use in the paper for the number of dynamic covariates. Here the batch size is $1$ because all of their time series have the same dynamic covariates: they sample $B$ different time series $y^{(i)}$ ($i$ is representing a time series index here) in the dataset, but they do not perform such batching for the dynamic covariates as these are held constant for all $i$. They also simplify $(B=1, L+H, r)$ to $(L+H, r)$
This took me quite some time to fully grasp, hopefully this explanation will save you some !

4.2 Generating the training samples

4.2.1 How the authors do it

All-in-all, implementing the architecture with TF or PyTorch is pretty straightforward. As usual, most of our work will be to understand how to batch our data in order to generate training examples to feed the model with. 
This section is dedicated about explaining how the authors batch their data to fully understand the process and then tweak it so that it matches our lodging data. This will be a little tedious beause we will have to go through different sections of the codebase to put things together to manually extract a couple of data batches and trace back their origin. The type of information we're willing to clarify are of the type: if a dataset has $N$ different time series (which is our case: we have $\left|\mathcal{I}\right|$ time series, each series corresponding to one market), do they consider them as one multi-dimensional time series or $N$ 1D series? 

To that end, we will be focusing on the weather dataset. The repo gives pretty clean instructions on how to download and build this dataset so we won't be convering the data retrieval part and directly assume you have the weather.csv data at hand. 

Below is a screen capture of the data batching generation code (can be found here). It consists of two nested loops:
  • The top level: loops on (permuted) timestamp slices
  • Bottom level: loops on "mini-batches" of time series
The first loop can be summarized in the diagram below: all slices are generated by shifting some timestamp windows one step at a time, which is building $n$ total slices. All slices are then permuted randomly so that they're not fed in the same order within each epoch.




In the diagram above we have colored the slices according to the functional role of each column. T and Tdew are split in 2: the past data and the target (respectively purple and green), and X represents the dynamic covariates. Let's add some notational precision to make things as clear as possible:
  • $t$ is the Timestamp index: it's just there to tell you how the data is ordered
  • $T$ and $Tdew$ are examples of some $y^{(i)}$. For our lodging signals forecasting scenario, it will be some signals $Y^{(i)}(., .)$ (one time series of either demand or occupancy for every market $i$ in $\mathcal{I}$)
  • The purple block is $y_{1:L}$ and the green block is $y_{L+1:L+H}$
  • The blue block is $x_{1:L+H}$. In the github code, $x$ is common across all time series $y$. With the lodging use case we can't really identify $x$ with our Forward Looking Slice because these objects do not have data for indexes $1:L$ but only data for $L+1:L+H$ so we will either have to modify the TiDE architecture or tweak our data so it fits in
  • We can reflect on section 4.2.3: we can see that the purple / green blocks will be of dimension $(B, L)$ or $(B, H)$ whereas the blue will be of size $(L+H, r)$: because covariates are constant across $i$, and because all $y^{(i)}$ are 1D then the model will be fed 2D inputs only
It's now clear what is meant when the authors say they use "mini-batches" of time series:
  • There is a top level batch which samples one window of time index
  • Within this one window, some time series are sampled and passed through the model. This is the "mini-batching" part
Below is an illustration of K mini-batches within one slice $s$ (for one slice I believe you can choose to loop through all $i$ or make draws with replacement. In the public implementation they draw each series once and only once within a slice)


4.2.2 Some batch examples

In the notebook, we have generated one batch of data using our toy example with the weather dataset by specifying batch_size = 2 (we are sure that every mini batch will contain both series T and Tdew). Here are examples of the lookback and forward slices that were generated (when running the notebook you will have different plotted data as the slices are randomly permuted)



5. TiDE for lodging signals forecasting

Here we are: time to tweak the overall architecture so that it fits our lodging data
Here are the differences that we wish to adress:
  • We might have multi-dimensional series: to us, every $y^{(i)}$ may not be 1D (we might want to model $O$ as a 2D time series $(R, A))$
  • We have another type of input that is neither $y$ and neither $x$: the Forward Looking Slices. We believe these objects are very good predictors of the future forward slices, especially for short booking windows where the final state is almost reached. We therefore need to find a way to fit those within the TiDE architecture

5.1 Batching our data

The data structure used in the publically available code is essentially a table with rows being the time index and columns consisting in the time series to forecast and the common dynamic covariates. We cannot use such structure out of the box to account for the differences laid out above. Another reason is that we need to account for scalability: we have more than 500 slices for more than 6K time series to train on, each slice having a different set of looking forward features and consisting in ~600 data points for every time series (we roughly have $L=H=300$), so we probably won't be able to afford loading everything in memory at once (we roughly have 200Gb of data) !

The strategy that we are going for is rather splitting our data in different files in the tfrecords format, every file containing exactly one slice for all our 6K+ time series. For those of you who have read my other posts you might have noticed I have a strong bias towards using Tensorflow rather than PyTorch for training models and loading my data, and this time is no exception !

Our workflow will therefore look like this:
  1. Write our data to tfrecords with one file per slice
  2. Load all files sequentially, and for each file batch on the time series index axis (i.e from 1 to 6K)
  3. Pass the batches to the model

Here is how our initial data looks like (the data is first stored in parquet format which is a structured format that accepts array types). Our data consists in 5 kinds of columns: the lookback, ($y^{(i)}_{1:L}$) the target  ($y^{(i)}_{L+1:L+H}$), the future and past covariates ($x_{1:L}$ and $x_{L+1:L+H}$), and then our additional forward slices that are neither $y$ and $x$.


In our scenario, we can allow $F  > 1$. if $F=1$ then we have the same setup as in the article i.e the time series to predict are all 1D. 

In this raw data, all arrays are flattened for convenience (for instance, the lookback column is a 1D array of shape $L\times F$ rather than a 2D array of shape $(L, F)$), but we will need to reshape those data to multi-dimensional arrays if needed when loaded in tensorflow before it's fed into the model.

Note that this way of storing the data allows us to rigorously replicate the time series mini-batching scheme laid out in the paper, as long as we are careful about how we build the tensorflow dataset that will load the data. 

There is one drawback to this approach though which is that the dynamic covariates are all equal within one file which is bad for efficiency and speed (we will repeatedly load the same arrays). Looking at the mock table above, you can actully see it for yourself: the columns X_past and X_future are constant. There is a solution to this which is to build a static lookup table that maps the time index to the dynamic covariates, but this is left as an exercise. I have access to big compute which makes me lazy sometimes, but you should really try to make this part efficient. The drawback is not just about loading the same arrays many times, it's also about having to reshape them and broadcast them again and again so that we can concatenate them with the other inputs. If the model trains for many epochs then this can significantly slow down the training process.



Comments

Popular posts from this blog

An Embedding Learning Framework for Numerical Features in CTR Prediction