Diviner with deep stationary processes
On this Part, we introduce our proposed deep studying mannequin, Diviner, that tackles the non-stationarity of long-term time collection prediction with deep stationary processes, which captures multi-scale steady options and fashions multi-scale steady regularities to attain long-term time collection prediction.
Smoothing filter consideration mechanism as a scale converter
As proven in Fig. 2a, the smoothing filter consideration mechanism adjusts the characteristic scale and allows Diviner to mannequin time collection from totally different scales and entry the multi-scale variation options inside non-stationary time collection. We construct this part primarily based on Nadaraya-Watson regression51,52, a classical algorithm for non-parametric regression. Given the pattern area (Omega ={({x}_{i},{y}_{i})| 1le ile n,{x}_{i}in {mathbb{R}},{y}_{i}in {mathbb{R}}}), window dimension h, and kernel operate Ok( ⋅ ), the Nadaraya–Watson regression has the next expression:
$$hat{y}=mathop{sum }limits_{i=1}^{n}Kleft(frac{x-{x}_{i}}{h}proper){y}_{i}/mathop{sum }limits_{j=1}^{n}Kleft(frac{x-{x}_{j}}{h}proper),$$ (1)
the place the kernel operate Ok( ⋅ ) is topic to (int
olimits_{-infty }^{infty }Ok(x)dx=1) and n, x, y denote pattern dimension, unbiased variable, and dependent variable, respectively.
Fig. 2: Illustration of the construction of smoothing filter consideration mechanism and distinction consideration module. a This panel shows the smoothing filter consideration mechanism, which includes computing adaptive weights Ok(ξ i , ξ j ) (orange block) and using a self-masked construction (grey block with dashed traces) to filter out the outliers, the place ξ i denotes the ith embedded time collection interval (yellow block). The adaptive weights serve to regulate the characteristic scale of the enter collection and acquire the scale-transformed interval embedding h i (pink block). b This diagram illustrates the distinction consideration module. The Matrix-Distinction Transformation (pale blue block) subtracts adjoining columns of a matrix to acquire the shifted question, key, and worth objects (ΔQ, ΔK, and ΔV). Then, an autoregressive multi-head self-attention is carried out (within the pale blue background) to seize the correlation of time collection throughout totally different time steps, leading to ({widetilde{{{{{{{{bf{V}}}}}}}}}}_{s}^{(i)}) for the ith consideration head. Right here, ({{{{{{{{bf{Q}}}}}}}}}_{s}^{(i)}), ({{{{{{{{bf{Ok}}}}}}}}}_{s}^{(i)}), ({{{{{{{{bf{V}}}}}}}}}_{s}^{(i)}), and ({widetilde{{{{{{{{bf{V}}}}}}}}}}_{s}^{(i)}) symbolize the question, key, worth, and end in objects, respectively. the ({{{{{{{rm{SoftMax}}}}}}}}) is utilized to the scaled dot-product between the question and key vectors to acquire consideration weights (the pale yellow block). The system for the ({{{{{{{rm{SoftMax}}}}}}}}) operate is ({{{{{{{rm{SoftMax}}}}}}}}({{{{{{{{bf{okay}}}}}}}}}_{i})={e}^{{{{{{{{{bf{okay}}}}}}}}}_{i}}/mathop{sum }
olimits_{j = 1}^{n}{e}^{{{{{{{{{bf{okay}}}}}}}}}_{j}}), the place okay i is the ith component of the enter vector, and n is the size of the enter vector. Lastly, the Matrix-CumSum operation (gentle orange block) accumulates the shifted options utilizing the ConCat operation, and W s denotes the learnable aggregation parameters. Full dimension picture
The Nadaraya–Watson regression estimates the regression worth (hat{y}) utilizing a neighborhood weighted common technique, the place the load of a pattern (x i , y i ), (Ok(frac{x-{x}_{i}}{h})/mathop{sum }
olimits_{j = 1}^{n}Ok(frac{x-{x}_{j}}{h})), decays with the gap of x i from x. Consequently, the first pattern (x i , y i ) is nearer to samples in its neighborhood. This course of implies the fundamental notion of scale transformation, the place adjoining samples get nearer on a extra vital visible scale. Impressed by this thought, we will reformulate the Nadaraya–Watson regression from the attitude of scale transformation. We incorporate it into the eye construction to design a learnable scale adjustment unit. Concretely, we introduce the smoothing filter consideration mechanism with a learnable kernel operate and self-masked operation, the place the previous shrinks (or magnifies) variations for adaptive feature-scale adjustment, and the letter eliminates outliers. To ease understanding, we take into account the 1D time collection case right here, and the high-dimensional case will be simply extrapolated (proven mathematically in Part “Strategies”). Given the time step t i , we estimate its regression worth ({hat{y}}_{i}) with an adaptive-weighted common of values {y t ∣t ≠ t i }, ({hat{y}}_{i}={sum }_{j
e i}{alpha }_{j}{y}_{j}), the place the adaptive weights α are obtained by a learnable kernel operate f. The punctured window {t j ∣t j ≠ t i } of dimension n − 1 denotes our self-masked operation, and (f{({y}_{i},y)}_{{w}_{i}}=exp({w}_{i}{({y}_{i}-y)}^{2})), ({alpha }_{i}=f{({y}_{i},y)}_{{w}_{i}}/{sum }_{j
e i}f{({y}_{j},y)}_{{w}_{i}}). Our adaptive weights fluctuate with the interior variation ({{({y}_{i}-y)}^{2}| {t}_{i}
e t}) (decreased or elevated), which adjusts (shrinking or magnifying) the gap of factors throughout every time step and achieves an adaptive feature-scale transformation. Particularly, the minor variation will get additional shrunk at a big characteristic scale, magnified at a small characteristic scale, and vice versa. Regarding random parts, world consideration can function a mean smoothing technique to assist filter small perturbations. As for outliers, their massive margin in opposition to common objects results in minor weights, which eliminates the interference of outliers. Particularly when the pattern (t i , y i ) involves be an outlier, this construction brushes itself apart. Thus, the smoothing filter consideration mechanism filters out random parts and dynamically adjusts characteristic scales. This fashion, we will dynamically rework non-stationary time collection based on totally different scales, which accesses time collection beneath complete sights.
Distinction consideration module to find steady regularities
The distinction consideration module calculates the interior connections amongst steady shifted options to find steady regularities throughout the non-stationary time collection and thereby overcomes the interference of uneven distributions. Concretely, as proven in Fig. 2b, this module contains the distinction and CumSum operations at each ends of the self-attention mechanism35, which interconnects the shift throughout every time step to seize inner connections inside non-stationary time collection. The distinction operation separates the shifts from the long-term developments, the place the shift refers back to the minor distinction within the developments between adjoining time steps. Contemplating developments lead the information distribution to alter over time, the distinction operation makes the time collection steady and varies round a set imply stage with minor distribution shifts. Subsequently, we use a self-attention mechanism to interconnect shifts, which captures the temporal dependencies throughout the time collection variation. Final, we make use of a CumSum operation to build up shifted options and generate a non-stationary time collection conforming to the found regularities.
Modeling and producing non-stationary time collection in Diviner framework
The smoothing filter consideration mechanism filters out random parts and dynamically adjusts the characteristic scale. Subsequently, the distinction consideration module calculates inner connections and captures the steady regularity throughout the time collection on the corresponding scale. Cascading these two modules, one Diviner block can uncover steady regularities inside non-stationary time collection at one scale. Then, we stack Diviner blocks in a multilayer construction to attain multi-scale transformation layers and seize multi-scale steady options from non-stationary time collection. Such a multilayer construction is organized in an encoder-decoder structure with uneven enter lengths for environment friendly information utilization. The encoder takes a protracted historic collection to embed developments, and the decoder receives a comparatively quick time collection. With the cross-attention between the encoder and decoder, we will pair the most recent time collection with pertinent variation patterns from lengthy historic collection and make inferences about future developments, enhancing calculation effectivity and decreasing redundant historic info. The purpose is that the most recent time collection is extra conducive to anticipating the rapid future than the remote-past time collection, the place the correlation throughout time steps usually degrades with the size of the interval53,54,55,56,57. Moreover, we design a generator to acquire prediction leads to one step to keep away from dynamic cumulative error problems39. The generator is constructed with CovNet sharing parameters all through every time step primarily based on the linear projection generator39,58,59, which saves {hardware} sources. These strategies allow deep studying strategies to mannequin non-stationary time collection with multi-scale steady options and produce forecasting leads to a generative paradigm, which is an try and sort out long-term time collection prediction issues.
Efficiency of the 5G community visitors forecasting
To validate the effectiveness of the proposed strategies, we gather in depth NPTs from China Unicom. The NPT datasets embody information recorded each quarter-hour for the entire 2021 yr from three teams of real-world metropolitan community visitors ports {NPT-1, NPT-2, NPT-3}, the place every sub-dataset accommodates {18, 5, 5} ports, respectively. We cut up them chronologically with a 9:1 proportion for coaching and testing. As well as, we put together 16 community ports for parameter-searching. The primary difficulties lie within the express shift of the distribution and quite a few outliers. And this Part elaborates on the excellent comparability of our mannequin with prediction-based and growth-rate-based fashions in making use of 5G community visitors forecasting.
Experiment 1
We first evaluate Diviner to different time collection prediction-based strategies, we be aware these baseline fashions as Baselines-T for readability. Baselines-T embody conventional fashions ARIMA19,20 and Prophet26; basic machine studying mannequin LSTMa60; deep learning-based fashions Transformer35, Informer39, Autoformer42, and NBeats61. These fashions are required to foretell your entire community visitors collection {1, 3, 7, 14, 30} days, aligned with {96, 288, 672, 1344, 2880} prediction spans forward in Desk 1, and inbits is the goal characteristic. When it comes to the analysis, though the MAE, MSE, and MASE predictive accuracy usually lower with prediction intervals, the degradation fee varies between fashions. Subsequently, we introduce an exponential velocity indicator to measure the speed of accuracy degradation. Particularly, given time spans [t 1 , t 2 ] and the corresponding MSE, MAE, and MASE errors, we’ve got the next:
$${,{{mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}}=(root {t}_{2}-{t}_{1} of {{{{mbox{MSE}}}}_{{t}_{2}}/{{{mbox{MSE}}}}_{{t}_{1}}}-1)occasions 100 % ,$$ (2)
$${,{{mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}}=(root {t}_{2}-{t}_{1} of {{{{mbox{MAE}}}}_{{t}_{2}}/{{{mbox{MAE}}}}_{{t}_{1}}}-1)occasions 100 % ,$$ (3)
$${,{{mbox{dMASE}}}}_{{t}_{1}}^{{t}_{2}}=(root {t}_{2}-{t}_{1} of {{{{mbox{MASE}}}}_{{t}_{2}}/{{{mbox{MASE}}}}_{{t}_{1}}}-1)occasions 100 % ,$$ (4)
the place ({,{{mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}},{{{mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}},{{{mbox{dMASE}}},}_{{t}_{1}}^{{t}_{2}}in {mathbb{R}}). In regards to the shut experimental outcomes between {NPT-1, NPT-2, and NPT-3}, we focus primarily on the results of the NPT-1 dataset, and the experimental outcomes are summarized in Desk 1. Though there exist portions of outliers and frequent oscillations within the NPT dataset, Diviner achieves a 38.58% common MSE discount (0.451 → 0.277) and a 20.86% common MAE discount (0.465 → 0.368) primarily based on the prior artwork. When it comes to the scalability to totally different prediction spans, Diviner has a a lot decrease ({,{{mbox{dMSE}}},}_{1}^{30}) (4.014% → 0.750%) and ({,{{mbox{dMAE}}},}_{1}^{30}) (2.343% → 0.474%) than the prior artwork, which displays a slight efficiency degradation with a considerable enchancment in predictive robustness when the prediction horizon turns into longer. The degradation charges and predictive efficiency of all baseline approaches have been supplied in Supplementary Desk S1 concerning to the area limitation.
Desk 1 Time-series forecasting outcomes on the 5G visitors community dataset. Full dimension desk
The experiments on NPT-2 and NPT-3 proven in Supplementary Information 1 reproduce the above outcomes, the place Diviner can help correct long-term community visitors prediction and exceed present artwork involving accuracy and robustness by a big margin. As well as, we’ve got the next outcomes by sorting the excellent performances (obtained by the common MASE errors) of the baselines established with the Transformer framework: Diviner > Autoformer > Transformer > Informer. This order aligns with the non-stationary elements thought-about in these fashions and verifies our proposal that incorporating non-stationarity promotes neural networks’ adaptive skills to mannequin time collection, and the modeling multi-scale non-stationarity different breaks by the ceiling of prediction skills for deep studying fashions.
Experiment 2
The second experiment compares Diviner with two different industrial strategies, which purpose to foretell the capability utilization of inbits and outbits with historic progress charges. The experiment shares the identical community port visitors information as in Experiment 1, whereas the cut up ratio is modified to three:1 chronologically for an extended prediction horizon. Moreover, we use a protracted building cycle of {30, 60, 90} days (aligned with {2880, 5760, 8640} time steps) to make sure the validity of such growth-rate-based strategies for the regulation of enormous numbers. Right here we first outline capability utilization mathematically:
Given a set bandwidth (Bin {mathbb{R}}) and the visitors circulate of the kth building cycles (widetilde{{{{{{{{bf{X}}}}}}}}}(okay)=left[begin{array}{cccc}{tilde{{{{{{{{bf{x}}}}}}}}}}_{kC+1}&{tilde{{{{{{{{bf{x}}}}}}}}}}_{kC+2}&…&{tilde{{{{{{{{bf{x}}}}}}}}}}_{(k+1)C}end{array}right]), (widetilde{{{{{{{{bf{X}}}}}}}}}(okay)in {{mathbb{R}}}^{Ttimes C}), the place ({tilde{{{{{{{{bf{x}}}}}}}}}}_{i}in {{mathbb{R}}}^{T}) is a column vector of size T representing the time collection per day and C denotes the variety of days in a single building cycle. Then the capability utilization (CU) of the kth building cycle is outlined as follows:
$$,{{mbox{CU}}},(okay)=frac{parallel widetilde{{{{{{{{bf{X}}}}}}}}}(okay){parallel }_{m1}}{BCT},$$ (5)
the place (,{{mbox{CU}}},(okay)in {mathbb{R}}). As proven within the definition, capability utilization is instantly associated to community visitors, so a exact community visitors prediction results in a high quality prediction of capability utilization. We evaluate the proposed predictive technique with two generally used shifting common progress fee predictive strategies within the business, the additive and multiplicative shifting common progress fee predictive strategies. For readability, we be aware the additive technique as Baseline-A and the multiplicative technique as Baseline-M. Baseline-A calculates an additive progress fee with the distinction of adjoining building cycles. Given the capability utilization of the final two building cycles CU(okay − 1), CU(okay − 2), we’ve got the next:
$${widehat{{{mbox{CU}}}}}_{A}(okay)=2,{{mbox{CU}}}(k-1)-{{mbox{CU}}},(k-2).$$ (6)
Baseline-M calculates a multiplicative progress fee with the quotient of adjoining building cycles. Given the capability utilization of the final two building cycles CU(okay − 1), CU(okay − 2), we’ve got the next:
$${widehat{{{mbox{CU}}}}}_{M}(okay)=frac{,{{mbox{CU}}}(k-1)}{{{mbox{CU}}}(k-2)}{{mbox{CU}}},(k-1).$$ (7)
Completely different from the above two baselines, we calculate the capability utilization of the community with the community visitors forecast. Given the community visitors of the final Ok building cycles (widetilde{{{{{{{{bf{X}}}}}}}}}=left[begin{array}{ccccccc}{tilde{{{{{{{{bf{x}}}}}}}}}}_{(k-K)C+1}&…&{tilde{{{{{{{{bf{x}}}}}}}}}}_{(k-K+1)C}&…&{tilde{{{{{{{{bf{x}}}}}}}}}}_{(k-1)C}&…&{tilde{{{{{{{{bf{x}}}}}}}}}}_{kC}end{array}right]), we’ve got the next:
$$widetilde{{{{{{{{mathcal{X}}}}}}}}}(okay)={{{{{rm{Diviner}}}}}}(widetilde{{{{{{{{mathcal{X}}}}}}}}}),$$ (8)
$${widehat{{{mbox{CU}}}}}_{D}(okay)=frac{parallel widetilde{{{{{{{{mathcal{X}}}}}}}}}(okay){parallel }_{m1}}{BCT}.$$ (9)
We summarize the experimental leads to Desk 2. In regards to the shut experimental outcomes between {NPT-1, NPT-2, and NPT-3} proven in, we focus primarily on the results of the NPT-1 dataset, which has essentially the most community visitors ports. Diviner achieves a considerable discount of 31.67% MAE (0.846 → 0.578) on inbits and a discount of 24.25% MAE (0.944 → 0.715) on outbits over Baseline-A. An intuitive clarification is that the growth-rate-based strategies extract explicit historic options however lack adaptability. We discover that Baseline-A has a significantly better efficiency of 0.045× common inbits-MAE and 0.074× common outbits-MAE over Baseline-M. This consequence means that community visitors tends to extend linearly fairly than exponentially. Nonetheless, there stay inherent multi-scale variations of community visitors collection, so Diviner nonetheless exceeds the Baseline-A, suggesting the need of making use of deep studying fashions reminiscent of Diviner to find nonlinear latent regularities inside community visitors.
Desk 2 Lengthy-term (1–3 months) capability utilization forecasting outcomes on the NPT dataset. Full dimension desk
When analyzing the outcomes of those two experiments collectively, we current that Diviner possesses a comparatively low degradation fee for a prediction of 90 days, ({,{{mbox{dMASE}}},}_{1}^{90}=1.034 %). In distinction, the degradation fee of the prior artwork involves ({,{{mbox{dMASE}}},}_{1}^{30}=2.343 %) for a three-times shorter prediction horizon of 30 days. Moreover, contemplating various community visitors patterns within the supplied datasets (about 50 ports), the proposed technique can take care of a variety of non-stationary time collection, validating its applicability with out modification. These experiments witness Diviner’s success in offering high quality long-term community visitors forecasting and increasing the efficient prediction spans of deep studying fashions for as much as three months.
Software on different real-world datasets
We validate our technique on benchmark datasets for the climate (WTH), electrical energy transformer temperature (ETT), electrical energy (ECL), and alternate (Change). We summarize the experimental leads to Desk 3. We comply with the usual protocol and divide them into coaching, validation, and take a look at units in chronological order with a proportion of seven:1:2 until in any other case specified. As a result of area limitation, the whole experimental outcomes are proven in Supplementary Information 2.
Desk 3 Time-series forecasting outcomes on different real-world datasets. Full dimension desk
Climate temperature prediction
The WTH dataset42 information 21 meteorological indicators for Jena 2020, together with air temperature and humidity, and WetBulbFarenheit is the goal. This dataset is finely quantified to the 10-min stage, which signifies that there are 144 steps for in the future and 4320 steps for one month, thereby difficult the capability of fashions to course of lengthy sequences. Amongst all baselines, NBeats and Informer have the bottom error when it comes to MSE and MAE metrics, respectively. Nonetheless, we discover a distinction between these two fashions when extending prediction spans. Informer degrades precipitously when the prediction spans enhance from 2016 to 4032 (MAE:0.417 → 0.853), however quite the opposite, NBeats positive aspects a efficiency enchancment (MAE:0.635 → 0.434). We attribute this to a trade-off of pursuing context and texture. Informer has a bonus over the feel within the short-term case. Nonetheless, it must seize the context dependency of the collection contemplating the size of enter historical past collection ought to lengthen inpace with prediction spans and vice versa. As for Diviner, it achieves a exceptional 29.30% common MAE discount (0.488 → 0.345) and 41.54% common MSE discount (0.491 → 0.287) over each Informer and NBeats. Moreover, Diviner positive aspects a low degradation fee of ({,{{mbox{dMSE}}},}_{1}^{30}=0.439 %), ({,{{mbox{dMAE}}},}_{1}^{30}=0.167 %) displaying its means to harness historic info inside time collection. The predictive performances and degradation charges of all baseline approaches have been supplied in Supplementary Desk S2. Our mannequin can synthesize context and texture to steadiness each short-term and long-term circumstances, guaranteeing its correct and sturdy long-term prediction.
Electrical energy transformer temperature prediction
The ETT dataset accommodates two-year information with six energy load options from two counties in China, and oil temperature is our goal. Its cut up ratio of coaching/validation/take a look at set is 12/4/4 months39. The ETT information set is split into two separate datasets on the 1-h {ETTh 1 , ETTh 2 } and 15-minute ranges ETTm 1 . Subsequently, we will examine the efficiency of the fashions beneath totally different granularities, the place the prediction steps {96, 288, 672} of ETTm 1 align with the prediction steps {24, 48, 168} of ETTh 1 . Our experiments present that Diviner achieves the perfect efficiency in each circumstances. Though within the hour-level case, Diviner outperforms the baselines with the closest MSE and MAE to Autoformer (MSE: 0.110 → 0.082, MAE: 0.247 → 0.216). When the hour-level granularity turns to a minute-level case, Diviner outperforms Autoformer by a big margin (MSE:0.092 → 0.064, MAE:0.239 → 0.194). The predictive performances and the granularity change when the hour-level granularity turns into the minute-level granularity of all baseline approaches have been supplied in Supplementary Desk S3. These exhibit the capability of the Diviner in processing time collection of various granularity. Moreover, the granularity can be a manifestation of scale. These outcomes exhibit that modeling multi-scale options is conducive to coping with time collection of various granularity.
Client electrical energy consumption prediction
The ECL dataset information the two-year electrical energy consumption of 321 purchasers, which is transformed into hour-level consumption owing to the lacking information, and MT-320 is the goal feature62. We predict totally different time horizons of {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps forward. Subsequent, we analyze the experimental outcomes based on the prediction spans (≤360 as short-term prediction, ≥360 as long-term prediction). NBeats achieves the perfect forecasting efficiency for short-term electrical energy consumption prediction, whereas Diviner surpasses it within the long-term prediction case. The short-term and long-term efficiency of all approaches has been supplied in Supplementary Desk S4. Statistically, the proposed technique outperforms the perfect baseline (NBeats) by lowering 17.43% MSE (0.367 → 0.303), 15.14% MAE (0.482 → 0.409) at 720 steps forward, and 6.56% MSE (0.457 → 0.427) at 9.44% MAE (0.540 → 0.489) at 960 steps forward. We attribute this to scalability, the place totally different fashions converge to carry out equally within the short-term case, however their variations emerge when the prediction span turns into longer.
Gold worth prediction
The Change dataset accommodates 5-year closing costs of a troy ounce of gold within the US recorded each day from 2016 to 2021. As a result of high-frequency fluctuation of the market worth, the predictive objective is to foretell its normal development moderately (https://www.lbma.org.uk). To this finish, we carry out a long-term prediction of {10, 20, 30, 60} days. The experimental outcomes clearly present obvious efficiency degrades for many baseline fashions. Given a historical past of 90 days, solely Autoformer and Diviner can predict with MAE and MSE errors decrease than 1 when the prediction span is 60 days. Nonetheless, Diviner nonetheless outperforms different strategies with a 38.94% common MSE discount (0.588 → 0.359) and a 22.73% common MSE discount (0.607 → 0.469) and achieves the perfect forecast efficiency. The predictive efficiency of all baseline approaches has been supplied in Supplementary Desk S5. These outcomes point out the adaptability of Diviner to the speedy evolution of monetary markets and its cheap extrapolation, contemplating that it’s usually tough to foretell the monetary system.
Photo voltaic vitality manufacturing prediction
The photo voltaic dataset accommodates the 10-minute stage 1 yr (2006) solar energy manufacturing information of 137 PV crops in Alabama State, and PV-136 is the goal characteristic (http://www.nrel.gov). Provided that the quantity of photo voltaic vitality produced each day is usually steady, conducting a brilliant long-term prediction is pointless. Subsequently, we set the prediction horizon to {1, 2, 5, 6} days, aligned with {144, 288, 720, 864} prediction steps forward. Moreover, this attribute of photo voltaic vitality signifies that its manufacturing collection are typically stationary, and thereby the comparability of the predictive performances between totally different fashions on this dataset presents their fundamental collection modeling skills. Concretely, contemplating the MASE error can be utilized to evaluate the mannequin’s efficiency on totally different collection, we calculate and type every mannequin’s common MASE error beneath totally different prediction horizon settings to measure the time collection modeling means (supplied in Supplementary Desk S6). The outcomes are as follows: Diviner > NBeats > Transformer > Autoformer > Informer > LSTM, the place Diviner surpasses all Transformer-based fashions within the chosen baselines. Supplied that the collection information shouldn’t be that non-stationary, some great benefits of Autoformer’s modeling time collection non-stationarity should not obvious. On the similar time, capturing steady long- and short-term dependencies remains to be efficient.
Highway occupancy fee prediction
The Visitors dataset accommodates hourly 2-year (2015–2016) highway occupancy fee collected from 862 sensors on San Francisco Bay space freeways by the California Division of Transportation, the place sensor-861 is the goal characteristic (http://pems.dot.ca.gov). The prediction horizon is about to {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps forward. Contemplating the highway occupancy fee tends to have a weekly cycle, we use this dataset to check totally different networks’ means to mannequin the temporal cycle. Through the comparability, we primarily deal with the next two teams of deep studying fashions: group-1 takes the non-stationary specialization of time collection into consideration (Diviner, Autoformer), and group-2 doesn’t make use of any time-series-specific parts (Transformer, Informer, LSTMa). We discover that group-1 positive aspects a major efficiency enchancment over group-2, which suggests the need of modeling non-stationarity. As for the proposed Diviner mannequin, it achieves a 27.64% MAE discount (0.604 → 0.437) to the Transformer mannequin when forecasting 30-day highway occupancy charges. Subsequently, we conduct an intra-group comparability for group-1, the place Diviner nonetheless positive aspects a mean 35.37% MAE discount (0.523 → 0.338) to Autoformer. The predictive efficiency of all approaches has been supplied in Supplementary Desk S7. We attribute this to Diviner’s multiple-scale modeling of non-stationarity, whereas the trend-seasonal decomposition of Autoformer merely displays time collection variation at explicit scales. These experimental outcomes exhibit that Diviner is competent in predicting time collection information with cycles.