๐Ÿชด jaden lorenc

Search

Search IconIcon to open search

Triplet Sparsity Reduction

Last updated Feb 27, 2024 Edit Source

#work/patientsim

This note was originally created in contrast to the gru-based autoencoder in the Merkelbach Paper. That’s right, I had the idea before I found it in the paper. I got a big head about it.

# My Version

# existing similar approaches

Another paper already does, this for use in a transformer-based approach! That’s where I get the triplet name from.

It’s super similar to the way that tokens themselves are treated in a transformer. The time, feature type embedding, and the value of the instance of the feature, are akin to keys, queries, values.

# 3.2. Architecture of STraTS

The architecture of STraTS is illustrated in Figure 3. Unlike most of the existing approaches which take a time-series matrix as input, STraTS defines its input as a set of observation triplets. Each observation triplet in the input is embedded using the Initial Triplet Embedding module. The initial triplet embeddings are then passed through a Contextual Triplet Embedding module which utilizes the Transfomer architecture to encode the context for each triplet. The Fusion Self-attention module then combines these contextual embeddings via self-attention mechanism to generate an embedding for the input time-series which is concatenated with demographics embedding and passed through a feed-forward network to make the final prediction. The notations used in the paper are summarized in Table 1.

# 3.2.1. Initial Triplet Embedding

Given an input time-series ๐“={(ti,fi,vi)}i=1n, the initial embedding for the ith triplet ๐ž๐ขโˆˆโ„d is computed by summing the following component embeddings: (i) Feature embedding ๐ž๐ข๐Ÿโˆˆโ„d, (ii) Value embedding ๐ž๐ข๐ฏโˆˆโ„d, and (iii) Time embedding ๐ž๐ข๐ญโˆˆโ„d. In other words, ๐ž๐ข=๐ž๐ข๐Ÿ+๐ž๐ข๐ฏ+๐ž๐ข๐ญโˆˆโ„d. Feature embeddings ๐ž๐Ÿ(โ‹…) are obtained from a simple lookup table similar to word embeddings. Since feature values and times are continuous unlike feature names which are categorical objects, we cannot use a lookup table to embed these continuous values unless they are categorized. Some researchers (Vaswani etย al., 2017; Yin etย al., 2020) have used sinusoidal encodings to embed continuous values. We propose a novel continuous value embedding (CVE) technique using a one-to-many Feed-Forward Network (FFN) with learnable parameters i.e., ๐ž๐ข๐ฏ=FFNv(vi), and ๐ž๐ข๐ญ=FFNt(ti).

Both FFNs have one input neuron and d output neurons and a single hidden layer with โŒŠdโŒ‹ neurons and tanh(โ‹…) activation. They are of the form FFN(x)=Utanh(Wx+b) where the dimensions of weights {W,b,U} can be inferred from the size of hidden and output layers of the FFN. Unlike sinusoidal encodings with fixed frequencies, this technique offers more flexibility by allowing end-to-end learning of continuous value and time embeddings without the need to categorize them.

Figure 3. The overall architecture of the proposed STraTS model. The Input Triplet Embedding module embeds each observation triplet, the Contextual Triplet Embedding module encodes contextual information for the triplets, the Fusion Self-Attention module computes time-series embedding which is concatenated with demographics embedding and passed through a dense layer to generate predictions for target and self-supervision (forecasting) tasks.

Refer to caption

Fully described in text.

Figure 3. The overall architecture of the proposed STraTS model. The Input Triplet Embedding module embeds each observation triplet, the Contextual Triplet Embedding module encodes contextual information for the triplets, the Fusion Self-Attention module computes time-series embedding which is concatenated with demographics embedding and passed through a dense layer to generate predictions for target and self-supervision (forecasting) tasks.

# 3.2.2. Contextual Triplet Embedding

The initial triplet embeddings {๐ž1,โ€ฆ,๐žn} are then passed through a Transformer architecture (Vaswani etย al., 2017) with M blocks, each containing a Multi-Head Attention (MHA) layer with h attention heads and a FFN with one hidden layer. Each block takes n input embeddings ๐„โˆˆโ„nร—d and outputs the corresponding n output embeddings ๐‚โˆˆโ„nร—d that capture the contextual information. MHA layers use multiple attention heads to attend to information contained in different embedding projections in parallel. The computations of the MHA layer are given by

# 3.2. Architecture of STraTS

The architecture of STraTS is illustrated in Figure 3. Unlike most of the existing approaches which take a time-series matrix as input, STraTS defines its input as a set of observation triplets. Each observation triplet in the input is embedded using the Initial Triplet Embedding module. The initial triplet embeddings are then passed through a Contextual Triplet Embedding module which utilizes the Transfomer architecture to encode the context for each triplet. The Fusion Self-attention module then combines these contextual embeddings via self-attention mechanism to generate an embedding for the input time-series which is concatenated with demographics embedding and passed through a feed-forward network to make the final prediction. The notations used in the paper are summarized in Table 1.

# 3.2.1. Initial Triplet Embedding

Given an input time-series ๐“={(ti,fi,vi)}i=1n, the initial embedding for the ith triplet ๐ž๐ขโˆˆโ„d is computed by summing the following component embeddings: (i) Feature embedding ๐ž๐ข๐Ÿโˆˆโ„d, (ii) Value embedding ๐ž๐ข๐ฏโˆˆโ„d, and (iii) Time embedding ๐ž๐ข๐ญโˆˆโ„d. In other words, ๐ž๐ข=๐ž๐ข๐Ÿ+๐ž๐ข๐ฏ+๐ž๐ข๐ญโˆˆโ„d. Feature embeddings ๐ž๐Ÿ(โ‹…) are obtained from a simple lookup table similar to word embeddings. Since feature values and times are continuous unlike feature names which are categorical objects, we cannot use a lookup table to embed these continuous values unless they are categorized. Some researchers (Vaswani etย al., 2017; Yin etย al., 2020) have used sinusoidal encodings to embed continuous values. We propose a novel continuous value embedding (CVE) technique using a one-to-many Feed-Forward Network (FFN) with learnable parameters i.e., ๐ž๐ข๐ฏ=FFNv(vi), and ๐ž๐ข๐ญ=FFNt(ti).

Both FFNs have one input neuron and d output neurons and a single hidden layer with โŒŠdโŒ‹ neurons and tanh(โ‹…) activation. They are of the form FFN(x)=Utanh(Wx+b) where the dimensions of weights {W,b,U} can be inferred from the size of hidden and output layers of the FFN. Unlike sinusoidal encodings with fixed frequencies, this technique offers more flexibility by allowing end-to-end learning of continuous value and time embeddings without the need to categorize them.

Figure 3. The overall architecture of the proposed STraTS model. The Input Triplet Embedding module embeds each observation triplet, the Contextual Triplet Embedding module encodes contextual information for the triplets, the Fusion Self-Attention module computes time-series embedding which is concatenated with demographics embedding and passed through a dense layer to generate predictions for target and self-supervision (forecasting) tasks.

Refer to caption

Fully described in text.

Figure 3. The overall architecture of the proposed STraTS model. The Input Triplet Embedding module embeds each observation triplet, the Contextual Triplet Embedding module encodes contextual information for the triplets, the Fusion Self-Attention module computes time-series embedding which is concatenated with demographics embedding and passed through a dense layer to generate predictions for target and self-supervision (forecasting) tasks.

# 3.2.2. Contextual Triplet Embedding

The initial triplet embeddings {๐ž1,โ€ฆ,๐žn} are then passed through a Transformer architecture (Vaswani etย al., 2017) with M blocks, each containing a Multi-Head Attention (MHA) layer with h attention heads and a FFN with one hidden layer. Each block takes n input embeddings ๐„โˆˆโ„nร—d and outputs the corresponding n output embeddings ๐‚โˆˆโ„nร—d that capture the contextual information. MHA layers use multiple attention heads to attend to information contained in different embedding projections in parallel. The computations of the MHA layer are given by

(1)๐‡j=softmax((๐„๐–jq)(๐„๐–jk)Td/h)(๐„๐–jv)j=1,โ€ฆ,h
(2)MHA(๐„)=(๐‡1โˆ˜โ€ฆโˆ˜๐‡h)๐–c

Each head projects the input embeddings into query, key, and value subspaces using matrices {๐–jq,๐–jk,๐–jv}โŠ‚โ„dร—dh. The queries and keys are then used to compute the attention weights which are used to compute weighted averages of value (different from value in observation triplet) vectors. Finally, the outputs of all heads are concatenated and projected to original dimension with ๐–cโˆˆโ„hdhร—d. The FFN layer takes the form

(3)๐…(๐—)=ReLU(๐—๐–1f+๐›1f)๐–2f+๐›2f

with weights ๐–๐Ÿ๐Ÿโˆˆโ„dร—2d,๐›๐Ÿ๐Ÿโˆˆโ„2d,๐–๐Ÿ๐Ÿโˆˆโ„2dร—d,๐›๐Ÿ๐Ÿโˆˆโ„d. Dropout, residual connections, and layer normalization are added for every MHA and FFN layer. Also, attention dropout randomly masks out some positions in the attention matrix before the softmax computation during training. The output of each block is fed as input to the succeeding one, and the output of the last block gives the contextual triplet embeddings {๐œ1,โ€ฆ,๐œn}.

# 3.2.3. Fusion Self-attention

After computing contextual embeddings using a Transformer, we fuse them using a self-attention layer to compute time-series embedding ๐žTโˆˆโ„d. This layer first computes attention weights {ฮฑ1,โ€ฆ,ฮฑn} by passing each contextual embedding through a FFN and computing a softmax over all the FFN outputs.

(4)ai=๐ฎaTtanh(๐–a๐œi+๐›a)
(5)ฮฑi=exp(ai)โˆ‘j=1nexp(aj)โˆ€i=1,โ€ฆ,n

๐–aโˆˆโ„daร—d,๐›aโˆˆโ„da,๐ฎ๐šโˆˆโ„da are the weights of this attention network which has da neurons in the hidden layer. The time-series embedding is then computed as

(6)๐žT=โˆ‘i=1nฮฑi๐œ๐ข

# 3.2.4. Demographics Embedding

We realize that demographics can be encoded as triplets with a default value for time. However, we found that the prediction models performed better in our experiments when demographics are processed separately by passing ๐ through a FFN as shown below. The demographics embedding is thus obtained as

(7)๐žd=tanh(๐–2dtanh(๐–1d๐+๐›1d)+๐›2d)โˆˆโ„d

where the hidden layer has a dimension of 2d.

# 3.2.5. Prediction Head

The final prediction for target task is obtained by passing the concatenation of demographics and time-series embeddings through a dense layer with weights ๐ฐoTโˆˆโ„d, boโˆˆโ„ and sigmoid activation.

(8)y~=sigmoid(๐ฐoT[๐ždโˆ˜๐žT]+bo)

The model is trained on the target task using cross-entropy loss.

(1)๐‡j=softmax((๐„๐–jq)(๐„๐–jk)Td/h)(๐„๐–jv)j=1,โ€ฆ,h
(2)MHA(๐„)=(๐‡1โˆ˜โ€ฆโˆ˜๐‡h)๐–c

Each head projects the input embeddings into query, key, and value subspaces using matrices {๐–jq,๐–jk,๐–jv}โŠ‚โ„dร—dh. The queries and keys are then used to compute the attention weights which are used to compute weighted averages of value (different from value in observation triplet) vectors. Finally, the outputs of all heads are concatenated and projected to original dimension with ๐–cโˆˆโ„hdhร—d. The FFN layer takes the form

(3)๐…(๐—)=ReLU(๐—๐–1f+๐›1f)๐–2f+๐›2f

with weights ๐–๐Ÿ๐Ÿโˆˆโ„dร—2d,๐›๐Ÿ๐Ÿโˆˆโ„2d,๐–๐Ÿ๐Ÿโˆˆโ„2dร—d,๐›๐Ÿ๐Ÿโˆˆโ„d. Dropout, residual connections, and layer normalization are added for every MHA and FFN layer. Also, attention dropout randomly masks out some positions in the attention matrix before the softmax computation during training. The output of each block is fed as input to the succeeding one, and the output of the last block gives the contextual triplet embeddings {๐œ1,โ€ฆ,๐œn}.

# 3.2.3. Fusion Self-attention

After computing contextual embeddings using a Transformer, we fuse them using a self-attention layer to compute time-series embedding ๐žTโˆˆโ„d. This layer first computes attention weights {ฮฑ1,โ€ฆ,ฮฑn} by passing each contextual embedding through a FFN and computing a softmax over all the FFN outputs.

(4)ai=๐ฎaTtanh(๐–a๐œi+๐›a)
(5)ฮฑi=exp(ai)โˆ‘j=1nexp(aj)โˆ€i=1,โ€ฆ,n

๐–aโˆˆโ„daร—d,๐›aโˆˆโ„da,๐ฎ๐šโˆˆโ„da are the weights of this attention network which has da neurons in the hidden layer. The time-series embedding is then computed as

(6)๐žT=โˆ‘i=1nฮฑi๐œ๐ข

# 3.2.4. Demographics Embedding

We realize that demographics can be encoded as triplets with a default value for time. However, we found that the prediction models performed better in our experiments when demographics are processed separately by passing ๐ through a FFN as shown below. The demographics embedding is thus obtained as

(7)๐žd=tanh(๐–2dtanh(๐–1d๐+๐›1d)+๐›2d)โˆˆโ„d

where the hidden layer has a dimension of 2d.

# 3.2.5. Prediction Head

The final prediction for target task is obtained by passing the concatenation of demographics and time-series embeddings through a dense layer with weights ๐ฐoTโˆˆโ„d, boโˆˆโ„ and sigmoid activation.

(8)y~=sigmoid(๐ฐoT[๐ždโˆ˜๐žT]+bo)

The model is trained on the target task using cross-entropy loss.