A recurrent neural network for rapid detection of delivery errors during real-time portal dosimetry

Highlights • Real-time portal dosimetry can detect errors in volumetric modulated arc therapy.• Neural networks avoid false positive errors during intrafraction portal dosimetry.• Error detection is 30% earlier with an artificial neural network than with thresholds.


Introduction
Portal dosimetry is widely used to ensure the dosimetric accuracy of radiotherapy delivery [1][2][3][4]. In the case of forward-projection, portal images are predicted at the time of treatment planning, and then measured images are compared with these [5][6][7], and in the case of back-projection, measured images are projected onto the CT scan of the patient and converted into a dose distribution, which is then compared with the planned dose distribution [8][9][10][11][12]. Groups of images are selected to represent the segments of volumetric modulated arc therapy (VMAT) [13,14].
Usually, images for completed fractions of treatment are analysed. However, there is growing interest in analysing the measured images as the treatment fraction proceeds. In this way, it is possible to identify errors before significant dosimetric impact occurs for the patient [15][16][17][18][19], particularly for hypofractionated treatments [20], which are becoming increasingly commonplace [21][22][23]. The real-time method is time-resolved, which also has its own advantages in giving a more thorough analysis than when using integrated images or dose [24,25]. Typically, errors are detected by setting a series of thresholds for a number of image features or measures, and then watching for the measures to exceed the thresholds [26], preferably avoiding false positives, which are disruptive in the real-time context [27].
Use of an accurate prediction model is an important means of providing sensitivity to errors while avoiding false positives. However, another possible means of increasing reliability is to use an artificial neural network. Simple neural networks have been used in the radiotherapy context before, such as for prediction of biological outcomes [28] and for pre-treatment quality assurance [29], and more complex neural networks are increasingly used in radiotherapy for deep learning in structure delineation and treatment planning [30][31][32][33]. However, they have so far not been used in the context of error detection in portal dosimetry.
This study therefore investigated the training of a simple artificial neural network to detect errors based on the supplied image measures at each time point. The study was a proof of principle of a recurrent neural network (RNN) approach, using VMAT treatment of the prostate as an illustration.

Materials and methods
There were several types of neural network that could be used for this application, but the RNN was used in this study because it could not only learn from training data, but also had the ability to learn from, and adapt to, a temporal series of inputs, such as the image measures at each segment of a VMAT arc. The study used the forward-projection method of portal dosimetry and a variety of deliberate errors. The differences between the measured and predicted images were investigated firstly using multiple separate metrics (MSM) and related thresholds and then with the use of an RNN, so as to quantify the timeliness with which each method was able to detect the errors.

Patients and treatment plans
.eatment plans for radiotherapy of the prostate were created using AutoBeam v5.8 [34] for 60 Gy in 20 fractions with the 6 MV beam of a VersaHD linear accelerator (Elekta AB, Stockholm, Sweden) [35,36]. For six patients who gave their consent for their images to be used for research, predicted portal images were retrospectively produced for each segment of the VMAT arcs and input to AutoDose v1.1 software for comparison with real-time images [19] (Fig. 1). AutoBeam was also used to recalculate the plans and predicted images on a water-equivalent phantom of dimensions 300 mm long (G-T direction) × 300 mm wide (A-B direction) × 200 mm high, with the isocentre located at the centre of the phantom.

Measured images
Errors were deliberately introduced into all 180 segments of the treatment plans and both the normal and erroneous plans were then delivered to a Solid Water phantom (Radiation Measurements, Inc., Middleton, WI). The errors consisted of a 2-10% increase in monitor units in 2% steps, a retraction of 2-10 mm in 2 mm steps of all multileaf collimator (MLC) leaves, a shift of 2-10 mm in 2 mm steps of all MLC leaves, and introduction of an air space of 10-50 mm width in 10 mm steps into the phantom to simulate rectal gas [37]. In three patients, all error cases were simulated, and in a further three patients, only the error-free case and 4% increase in monitor units, 4 mm MLC retraction, 4 mm MLC shift and 20 mm air space were simulated. Portal images were recorded using an iViewGT imaging panel (Elekta) and analysed using AutoDose, which allocated the images to control points of the treatment plan [19].

Image metrics and selection of thresholds
At each segment of the VMAT plan, four measures of agreement between predicted and measured images were calculated: central axis signal, mean image value, root-mean-square difference as a percentage of global maximum and root-mean-square difference as a percentage of local prediction. These simple difference measures were used in favour of more complex difference measures as the intention was to identify differences, however small spatially or temporally, and then to use error detection to work with these. The first 10% of segments were neglected as the images were not stable in this period. The startup of the linear accelerator, estimated to affect the first 1% of segments, may have been contributory to this instability. After the first 10% of segments, a running sum of 10 segments was used. For comparison purposes MSM was applied, in which the value of median + 2 × range of the maximum value of each statistic over the cases under consideration was taken as the threshold, and image metrics exceeding these thresholds signified errors.

Recurrent neural network
The four measures were applied to an RNN [38] consisting of four layers of gated recurrent units (GRUs), with four nodes in the first layer, eight in the second layer, four in the third layer and one in the final layer. The function of the GRU was exactly as defined by Cho et al. [39]. For training and testing, a leave-two-out cross-correlation strategy was used [40,41]. Four of the patients were used to train the network, and the remaining two patients were used to test the result. Of the four patients used for training, two were from patients 1-3, for which a full set of error cases were available, and the other two were from patients 4-6, for which only representative errors were available (see section 2.2). There were therefore nine ways of selecting unique combinations of patient for testing, so the RNN was trained and tested nine times. For example, firstly patients 1 and 4 were retained for testing, so patients 2, 3, 5, and 6 were used for training. Then patients 1 and 5 were retained for testing, so patients 2, 3, 4 and 6 were used for training, etc.
Using p to index the P training patients, e to index the E + 1 error types, (e = 0 representing no error), s to index segments after exclusion of the first 19 segments and the vector w to represent the W weights of the RNN, the objective function for training was defined as: The factor f 0 (e) was an importance factor to avoid false positives: and f e (e) was an error-specific factor to ensure that the larger errors were detected: where M e was the physical ranking of the error, i.e. 1 to 5 according to a monitor unit increase of 2% to 10% etc. The factor f s (s) was a segmentspecific factor: thereby emphasising the importance of early segments in normal cases and late segments in error cases. Finally, f y (p, e, s, w) provided a quadratic penalty from the "off" state for normal cases and from the "on" state for error cases: where y(p, e, s, w) was the output of the network( − 1 < y < 1), with y > 0 signifying an error and y < 0 signifying normal delivery. The final term in equation (1) was an L 2 norm to prevent overfitting to the training data. This was applied to the W primary weights of the network, excluding the hidden state, update and reset weights, using an empiricallydetermined value of 40 for the regularisation parameter, λ. To further avoid false positives, indices of e for which M e = 1, i.e. 2% increase in monitor units, 2 mm aperture opening etc, were also defined as normal (no-error) cases. Due to the non-convexity of the objective function, a random search algorithm was used for training. The software was run on a SPARC T4-2 server with 128 hyper-threads (Oracle Corporation) using a separate execution thread for each of the nine combinations of training and testing. To visualise real-time performance, the network trained on patients 2, 3, 5, and 6 was applied to errors for patient 1. The final validation was to apply the RNN to actual patient images for four patients (A-D) different to those used for the phantom study. All of these treatments were considered to be normal deliveries, but the images for patient D were re-acquired on further occasions (in a non-real-time workflow) and were taken as an example of images that the medical physicist was not satisfied with.

Training the recurrent neural network
Training and testing of the network required around 50 h. Over this time, the training progressed steadily, with the objective function converging to a similar value for the nine data sets (Fig. 2). Benefits were observed in timeliness of error detection with the RNN for monitor unit, aperture shift and air gap errors. Importantly, there were no false positives in any of the error-free cases. For the training cases as a whole, the median segment index at which errors were detected was 105 (range 97 -120) for MSM and 68 (range 52 -75) for the RNN, with a median relative reduction of 0.57 (range 0.49 -0.72). The delivery time was approximately 180 s for the 180 segments of these treatment plans, so in terms of time, each segment equated to approximately 1 s of delivery time. Thus, finding the error at segment 68 meant that approximately 68 s of delivery was completed when the error was detected. There were 186 false negatives, in which the error was not detected at all during the 180 segments, out of 432 errors for MSM, representing a ratio of 0.43. There were 100 false negatives out of 432 errors for the RNN, a ratio of 0.23.

Testing the recurrent neural network
Testing showed that the RNN was most beneficial for errors in monitor units, aperture position and path length (Fig. 3). MSM were already effective in detecting errors in aperture opening, so in this case the RNN was less beneficial. The thresholds for central image signal and mean image value were exceeded in several instances for an aperture shift of 2 mm (Fig. 3c) but not for 4 mm, unrelated to the errors being introduced. The slightly worse performance of the RNN for larger aperture opening and aperture shift errors ( Fig. 3b and 3c) was due to the L 2 norm. This prevented overfitting, but meant that some of the obvious errors were not found until several segments after the MSM method.
Testing results for a specific level of error were found to be broadly similar between patients (Fig. 4), although overall, there was some variation in the nine test samples (Table 1). Again, there were no false positives in any of the test results for error-free cases. There were 77 false negatives out of 216 errors for MSM, representing a ratio of 0.36. There were 52 false negatives out of 216 errors for the RNN, a ratio of 0.24.
In the real-time context, the RNN was found to be most active initially in the treatment delivery for the case of moderate errors (Fig. 5). The network failed to detect a 4% increase in monitor units (Fig. 4a), but successfully detected the other errors rapidly (Fig. 4b-d). After error detection, the signal did not change appreciably.
For the real patient images, deliveries for patients A-C were classified as normal, with a network output of close to − 1. Those for patient D were identified very rapidly as abnormal, with the network output quickly moving to approach +1.

Discussion
The results show that in the context of forward-projection real-time portal dosimetry for prostate treatment delivery, the RNN is able to improve the timeliness of error detection by around 30%, compared to MSM. There is some variability in effectiveness of the RNN between error types and between patients.
Implicitly, the thresholds of MSM are built in to the RNN in the form of the biases, but the more complex connectivity of the RNN is shown to provide a more effective result, similar to dose-volume histogram prediction [42]. The RNN is trained to detect particular types of errors for a particular treatment site, and there is no guarantee that it operates correctly for other errors or treatment sites. In other words, although the L 2 norm prevents overfitting within the patients used, the model as a whole may be over-fitted to certain types of error and treatment site. However, by using general image difference measures, the present study gives an indication of what is likely to be achieved in a larger study using treatment plans of similar complexity.
There are relatively few studies focusing on real-time EPID dosimetry for VMAT, but it is possible to make some comparisons with other studies. The method behaves similarly to that of Woodruff et al. [17], except for the use of section images rather than integrated images. Compared to real-time MSM using site-specific control limits [15], which is able to detect monitor unit errors of 5% in static gantry intensity-modulated radiotherapy after about 23% of the delivery, the detection speed in the present study is slower, but the thresholds must be higher with VMAT due to the gantry rotation, which explains this effect. Monitor unit changes and aperture shifts of a similar magnitude to those in the present study can also be detected by back-projection in a nonreal-time context [43,44]. In the real-time situation, Spreeuw et al. [18] show that a 20 cGy dosimetric difference in the patient can be detected after around 10% of the delivery time for deliberately introduced serious errors in prostate radiotherapy. This is faster than either MSM or RNN in this study, but is expected to be so because of the magnitude of the errors. The study presented here is in agreement with Schyns et al. [25] that the time-resolved element is valuable in the forward-projection approach but that interpretation of any errors detected in terms of dose to the patient is not straightforward.
As with all studies using deliberate errors, the results must either be based on phantom studies or simulated measurements. For the former, used in this study, the anatomy is somewhat simplified, but the measurements include real variations in quality of panel output and calibration. Other uncertainties are the start-up of the accelerator, the initial instability of the images and the allocation of images to segments of the treatment plan. The method of using a running sum of images for a limited number of treatment plan segments is able to detect errors for parts of the VMAT arc, but this has not been fully demonstrated in this study as the introduced errors are present for the whole arc. However, the method of detecting errors in the whole plan does have the advantage that the timeliness of the detection can be quantified in an analogue manner, such as using segment number at which the error is detected, whereas the introduction of short errors means that the detection is binary, for example detected or not, which is then difficult to analyse in small data sets. It is also more important to detect and act upon persistent errors.
Simulated measurements are easier to obtain, by taking predictions and applying noise, e.g. [45], but it is very difficult to ensure that the noise accurately represents the random and systematic errors that typically occur during operation of a portal dosimetry service [46][47][48]. In addition, the effectiveness of the portal dosimetry method depends on how accurate the prediction method is [43,44]. The study does not address patient positioning errors, for which a method such as conebeam CT is more suitable, either separately from the portal dosimetry, or included within it [7,44,49]. However, it is likely that anatomical changes can be detected with improved accuracy using the RNN, particularly as this type of change may only impact on the portal images Table 1 Mean segment index at which errors are detected for multiple separate metrics with threshold and for a recurrent neural network, during testing.  5. Network output for patient 1 for several error cases. Results less than or equal to zero indicate absence of an error and results greater than zero indicate an error. The output in the grey region at the left is disregarded due to instability of the raw signals.
at particular gantry angles [24,25]. Avoidance of false positive results is an important part of this approach, as a false positive error in the real-time context means that the patient's treatment is paused while the error is investigated. False positives also add to the operator workload and encourage a lax attitude towards real errors when they occur. There are some false negative results in the study, mostly for the small error cases where the clinical impact is relatively small, but these are reduced in number by appropriate training of the RNN [50].
A logical progression of this work is use a deep learning approach [30,31,51,52] to analyse the predicted and measured images as a whole. Either the pixels of a difference map between the predicted and measured images, or the pixels of both of the images separately could be applied to the inputs. A convolutional stage could detect specific image features which might be indicative of errors.
The RNN presented in this study, taking as input several measures of difference between predicted and measured images, can be used to provide timely indication of errors during real-time portal dosimetry. In this simulation study of forward-projection portal dosimetry for prostate VMAT, a variety of errors are detected around 30% earlier than when using the image difference measures alone in a threshold-based approach. The leave-two-out strategy used in this feasibility study gives an indication of the benefit likely to be observed in a larger cohort of similarly complex VMAT treatments.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.