In laboratory medicine, mislabeled specimens (MLS) are pre-analytical errors where blood from one patient is given an ID label from a different patient. These errors are estimated to occur in between 0.03 to 17 specimens per 1000 specimens collected (1-4). It is likely that the lower estimates may be falsely low due to the difficulty in identifying these errors in clinical practice. When MLS are not detected, they can place one or both patients at risk of harm as clinical decisions are carried out based on incorrect data. MLS are estimated to cost 280,000 USD per million specimens collected, and cause 160,000 adverse medical events per year in the United States (5).
Pre-analytical solutions have been relatively successful at reducing the number of MLS. In one study, the implementation of barcode labels with bedside printers reduced the number of MLS by 92% (6). In a multi-institutional survey, MLS were noted to occur significantly less frequently in institutions with ongoing quality monitoring systems for specimen identification, and in institutions with 24/7 inpatient phlebotomy service (3). One approach to reducing MLS is to give patients identifying wristbands, but wristband errors can result in downstream MLS error. To reduce the number of wristband errors, the College of American Pathologists performed a study involving 217 institutions where phlebotomists were tasked with continuously evaluating patient wristbands for error. This strategy reduced wristband errors from 7.40% to 3.05% (7).
In cases where pre-analytical strategies fail, post-analytical methods to detect MLS have been developed. Delta checks are one such system and are widely used due to the low cost of implementation. In this method, patients’ analytical test results from two different time points are compared. If the value change exceeds a pre-determined threshold, the results are flagged and either reviewed, repeated, or the specimen is recollected (8). Multiple strategies using this framework have been implemented. Thresholds, for example, can be applied to the absolute change in value (current result minus the previous result), or a relative change in value (current result divided by the previous result). Change velocities (change in value divided by the difference in collection time) can also be used. No standard acceptable tolerances have been established, although median values have been reported (8). Despite widespread implementation of delta-checks in clinical laboratories, the value of this strategy is questionable. Receiver operator characteristics curve (ROC) analysis has shown that the best performing delta check was for mean corpuscular volume (MCV) which only achieved an area under the curve (AUC) of 0.90 (9). In one analysis, multiple analytes were combined and a weighted cumulative delta check was implemented. Although this model achieved promising results with a maximum AUC of 0.98 (10), the same data was used to both generate and test their model, introducing a significant source of potential bias.
Recently, delta checks were revisited using machine learning techniques. An AUC of 0.97 was achieved using a support vector machines (SVM) method (11), showing that a better performance could be achieved as compared logistical regression (AUC =0.92). Despite their achievement, the researchers limited their analysis to a rigid panel of 11 analytes, and only examined specimens collected within 36 hours of one another. This restrictive approach likely meant that only a small minority of all the blood-specimens collected at their institution could be evaluated by their model. Although they compared their method to a weighted logistical regression model, these were also limited to the same 11 analytes as their SVM method, even if more tests were actually performed.
To expand and improve upon previous work, we devised two novel machine learning methods to identify MLS using neural networks. In one approach, the results from rigid analyte panels were used to identify MLS. In the second approach, neural networks were created that were not limited to a specific panel of analytes, but instead could be given any combination of analytes. For both approaches, different neural networks were created and evaluated for different ranges of time deltas. The performance of each neural network was compared directly to the current delta check strategy used at our institution.
A MatLab code was created to automate all data processing, and to create and test all neural networks. MATLAB R2018b (MathWorks, version 184.108.40.2064444, Natick, Massachusetts, USA) was used.
All analytical test results performed on our automated core chemistry and immunoassay analyzers at our institution between 4/12/2012 to 1/30/2014 (1.8 years) were collected. All patient medical record numbers were de-identified via assignment of unique numeric codes that were not tied to the original numbers. The complete list of analytical tests is shown in Table S1. The test results were sorted by patient and time of collection, and bundled with all other test results from the same patient that were collected at the same time (Figure 1A). A total of 4,119,977 analytical tests from 122,433 patients over 462,998 unique time points were obtained. Test-bundles with less than 5 analytes were discarded from the set.
Combinations of two test-bundles from a single patient were linked to create properly-labeled-specimen-pairs (PLSP) (Figure 1B). Every possible combination was produced, and the time between each linked pair was recorded. The delta time, Δt, is the time difference between two collection times:
where tA and tB are the first and second time points. PLSP with a Δt greater than 10 days were discarded from the dataset.
Mislabeled specimen simulation
Mislabeled-specimen-pairs (MLSP) were created by randomly reassigning the analytical test results from the second time point to a different patient (Figure 1C). Test-bundles were always reassigned to different patients that had the same set of analytes performed. In cases where there were insufficient unique patients to reassign the test results to a different patient, both the MLSP and corresponding PLSP were discarded. A total of 481,956 PLSP and MLSP (963,912 pairs in total) were created in this manner, spanning 18,886 patients. 80% of the specimen-pairs (771,128 pairs) were randomly assigned to a master-training set, while 10% (96,392 pairs) each were assigned to a validation set and a test set (Figure 1D). The data sets were then divided into five groups depending on their Δt: <1.5, 1.5–2.5, 2.5–3.5, 3.5–5, and 5–10 days.
Neural networks were created to predict if specimen-pairs were PLSP or MLSP using different methodologies described below.
Panel neural networks (PNN)
PNN were created to detect MLS when the same panel was ordered at two time points. All analytes that were not in the panels were discarded for this analysis. Specimen pairs that did not have the full panel were likewise discarded. The panels that were evaluated included the basic metabolic panel (BMP), the comprehensive metabolic panel (CMP), the renal function panel (RFP), and the liver function panel (LFP). Analytes for each panel are listed in Table 1, while the number of specimen pairs used to train, validate, and test each PNN are shown in Table S2, along with neural network training parameters.
The prototypical neural network architecture is shown in Figure 1E, and the prototypical input layer is shown in Figure 1F. In brief, the input layer was an (N+1)×4 matrix, where N is the number of analytes in the test-panel. The first N columns each correspond to a different analyte. The first row was populated with the analytes from the first time-point, while the second row was populated with the analytes from the second time point. The third row was populated with the absolute change in value between the two time points (current result minus the previous result), while the fourth row was populated by the relative change in value (current result divided by the previous result). The cell in row 1 of the final column (column N+1) was populated by Δt in days, while cells in rows 2, 3, and 4 in the final column were left as 0. The PNNs were trained between 20 and 100 epochs each. Ten PNNs were created for each panel and for each group of Δt ranges in order to perform statistical analysis.
Open-ended neural networks (ONN)
ONN were created to detect MLS regardless of what tests were ordered at either time point. For these networks, the data sets were separated based on the number of analytes ordered at the second time point. The specimen pair data was thus divided into three groups: group 1 (5 to 8 analytes tested at the second time-point), group 2 (9 to 12 analytes), and group 3 (13 or more analytes). The number of analytical tests performed at the first time point did not affect what category the specimen pair was placed in. Likewise, the same analytes did not need to be ordered at both time points in the specimen pair. The number of specimen pairs used to train, validate, and test each ONN are shown in Table S2, along with neural network training parameters.
The structure for the ONNs is similar to that of the PNNs. The difference between the approaches was in the input layer, which consisted of a 131×4 matrix. The first 130 columns either corresponded to different analytes, or was left blank (assigned a value of zero). Similar to PNNs, the first row corresponded to the value at the first time point, the second row corresponded to the value at the second time point, the third row corresponded to the absolute change in value, and the fourth row corresponded to the relative change in value. When an analyte was not tested at a given time-point, that column was left as zero. When an analyte was only tested at the first time point and not at the second time point, row 1 was populated with the test result, and rows 2, 3, and 4 were left as zero. Similarly, when an analyte was only tested at the second time point and not the first, row 2 was populated with the result and rows 1, 3 and 4 were left as zero.
Similar to PNNs, for ONNs, the cell in the first row of the final column (column 131) was populated by Δt in days, while cells in rows 2, 3, and 4 of the final column were set to zero. The ONNs were trained between 20 and 100 epochs each. Ten ONNs were created for each panel and for each group of Δt ranges for statistical analysis.
ROC analysis was performed for each neural network using the PLSP and MLSP categories as the gold standard, and the neural network output score as the analyte. The AUC was calculated for each neural network. The sensitivity and specificity were obtained for the optimal operating point (OOP) on the ROC curve as calculated by the MATLAB perfcurve function that relies on a previously described cost-function curve analysis (12). The specificity was also calculated for each neural network at the points where the sensitivities reached 50% and 80%. A positive predictive value (PPV) was calculated assuming a mislabeled-specimen frequency of 1 in 200 (0.5%). This was performed for the OOP, as well as at the points where the sensitivities were set to 50% and 80%. Neural networks performance metrics are reported as mean ± standard deviation.
Classic delta checks
The data sets used to test the PNNs and ONNs were also evaluated using the classic delta check limits used at our institution. Classic delta check limits were derived from reference change limit calculations and can be viewed in Table S1. In order to properly compare the two methods, the classic delta checks were applied to all analytes that were performed at both time points in the specimen pairs. For the PNNs, analytes that were not in a given panel were still included in the classic delta check analysis. The number of times each test was ordered in our cohort, and the mean and standard deviation change over the course of 1.5 days are described in Table S1. The PPVs for each PNN and ONN test set were calculated using the classic delta checks.
ROC analysis was applied to all neural networks. The mean AUCs for all the neural networks are compiled in Figure 2 and listed in Table S3. The best performing neural network was the CMP PNN for Δt <1.5 days, which achieved an AUC of 0.994±0.001. The CMP neural networks were the best performing of all neural networks, maintaining an AUC above 0.95 even for the 5 to 10 day Δt. The HPF PNNs were the second best performing PNNs, the BMP PNNs were third, and the RFP PNNs were fourth. In general, the ONNs performed similar to or worse than the PNNs. Performance in the ONNs improved as the number of test-analytes increased. In general, both PNNs and ONNs with low Δt performed best, and performance decreased as the Δt increased.
Sensitivity and specificity analysis
The sensitivities, specificities, and PPVs are shown for the PNNs and the ONNs in Tables S4 and S5, respectively. PNN results are shown in Figure 3, while the ONN results are shown in Figure 4. Similar to the AUC values, sensitivity and specificity generally decreased as the Δt increased. The CMP PNN with a Δt of <1.5 days had the highest OOP sensitivity and specificity which were 98.4% and 96.4% respectively. When sensitivities were set to 50%, the specificities for the CMP PNNs were all in excess of 99.4% regardless of the Δt. These additionally had PPVs greater than 29% for all time periods when a 0.5% MLSP frequency was assumed. The PPVs for the CMP PNNs with Δt <1.5 days and 1.5 to 2.5 days were both greater than 68%, however, these groups each represent less than 1% of the total specimen pairs. Although the BMP PNN with Δt <1.5 days had a smaller PPV (41.7%), this neural network covers a much larger proportion of all the specimen pairs (19.3%).
For the ONN, when sensitivity was set to 50%, only the 13-analyte ONN with a Δt <1.5 days had a specificity that exceeded 99% (actual value =99.5%), which resulted in a PPV of 17.6%.
The classic delta checks were highly sensitive in identifying MLSPs, but their specificities were lower than all the neural networks at the OOP. When classic delta checks were evaluated on the panel data sets, the highest PPV achieved was 1.7% for the BMP data set with a Δt <1.5 days. The highest PPV achieved using the classic delta checks in the open data sets was 1.9% for the 5–8 analyte data set with a Δt <1.5 days.
Neural networks were created to identify MLS. Using this method, the best AUC achieved was 0.994±0.001 for a CMP PNN with Δt <1.5 days. This study improves upon previous work by increasing the maximum AUC achieved in detecting mislabeled specimen (11). We additionally created neural networks designed to detect MLS when alternative panels were performed, namely BMPs, RFPs, and HFPs. We compared this strategy to an unrestricted approach, where any analyte could be used at either time point. These ONNs were less accurate than the PNNs at detecting MLS, however their flexibility may have some niche applications.
Although the BMP PNN only produced a PPV of 41.7% with a sensitivity of 50%, it is worth highlighting the magnitude of difference in PPV of the PNN when compared to the classic delta checks. The BMP PNN had a 24-fold improvement in PPV when compared to the classic delta checks which had a maximum PPV of 1.7% using the same BMP analytes, though at the sacrifice of sensitivity. The low PPV is in part due to an overly-sensitive classic delta check strategy, which increased both the true-positive rates as well as the false positive rates. A high false positive rate diverts laboratory resources and can become costly to investigate MLS. The analytical tests often need to be repeated, there can be additional blood loss incurred due to the necessity of repeating phlebotomy, and laboratory personnel need to spend time reviewing and analysing the error. The implementation of machine learning-based protocols to detect MLS with fewer false-positive errors may have a dramatic impact on patient care and health care costs, and require little-to-no monetary investment.
One of the strengths and limitations of our study was that pre-existing clinical data obtained from our middleware system was utilized. Using pre-existing data, rather than simulated data, allowed for the direct analysis of realistic scenarios in which the blood-in-tube of the MLS sample could come from any random patient. The limitation of this strategy is that undetected MLS were likely present in the raw data, and these were miscategorized as PLSP in the training and test sets. The effect of undetected MLS pairs in our training and test sets would be expected to have decreased the performance and lowered the AUC of the PNNs and ONNs.
Neural networks and other machine learning strategies have clear advantages over conventional classic delta checks, but these should be implemented with caution due to a number of practical limitations. The algorithms typically generate a “black box” approach to error detection which needs to be evaluated empirically. The strategy relies heavily on the use of contemporary clinical data to train the algorithm. The frequency by which new data must be collected and new neural networks must be trained needs to be established.
Finally, the implementation of machine learning based MLS-detection protocols requires a dedicated understanding of rapidly evolving artificial intelligent technologies. Given the demonstration of improved performance of these protocols over the classic delta checks, laboratory information systems and middleware vendors must be pressed to develop software that can utilize these tools in real-time.
The research project was possible due to the generous donation of an Nvidia Titan XP graphics processing unit as part of the GPU Research Grant (Nvidia, Santa Clara, California). In addition, this work was supported by the Dartmouth Hitchcock Medical Center Department of Pathology and Laboratory Medicine.
Conflicts of Interest: The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The institutional review board (IRB) at our institution determined that the project is not research involving human subjects as defined by our internal and FDA regulations. IRB review and approval by the organization is not required. The outcomes of the study will not affect the future management of the patients involved.
- Dzik WH, Murphy MF, Andreu G, et al. An international study of the performance of sample collection from patients. Vox Sang 2003;85:40-7. [Crossref] [PubMed]
- Ansari S, Szallasi A. “Wrong blood in tube”: Solutions for a persistent problem. Vox Sang 2011;100:298-302. [Crossref] [PubMed]
- Wagar EA, Stankovic AK, Raab S, et al. Specimen labeling errors: A Q-probes analysis of 147 clinical laboratories. Arch Pathol Lab Med 2008;132:1617-22. [PubMed]
- Saathoff AM, MacDonald R, Krenzischek E. Effectiveness of Specimen Collection Technology in the Reduction of Collection Turnaround Time and Mislabeled Specimens in Emergency, Medical-Surgical, Critical Care, and Maternal Child Health Departments. Comput Inform Nurs 2018;36:133-9. [PubMed]
- Valenstein PN, Raab SS, Walsh MK. Identification errors involving clinical laboratories: A College of American Pathologists Q-Probes study of patient and specimen identification errors at 120 institutions. Arch Pathol Lab Med 2006;130:1106-13. [PubMed]
- Brown JE, Smith N, Sherfy BR. Decreasing Mislabeled Laboratory Specimens Using Barcode Technology and Bedside Printers. J Nurs Care Qual 2011;26:13-21. [Crossref] [PubMed]
- Howanitz PJ, Renner SW, Walsh MK. Continuous wristband monitoring over 2 years decreases identification errors: A College of American Pathologists Q-Tracks study. Arch Pathol Lab Med 2002;126:809-15. [PubMed]
- Schifman RB, Talbert M, Souers RJ. Delta check practices and outcomes a q-probes study involving 49 health care facilities and 6541 delta check alerts. Arch Pathol Lab Med 2017;141:813-23. [Crossref] [PubMed]
- Balamurugan S, Rohith V. Receiver Operator Characteristics (ROC) Analyses of Complete Blood Count (CBC) Delta. J Clin DIAGNOSTIC Res 2019;13:9-11.
- Yamashita T, Ichihara K, Miyamoto A. A novel weighted cumulative delta-check method for highly sensitive detection of specimen mix-up in the clinical laboratory. Clin Chem Lab Med 2013;51:781-9. [Crossref] [PubMed]
- Rosenbaum MW, Baron JM. Using machine learning-based multianalyte delta checks to detect wrong blood in tube errors. Am J Clin Pathol 2018;150:555-66. [Crossref] [PubMed]
- Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:561-77. [Crossref] [PubMed]
Cite this article as: Jackson CR, Cervinski MA. Development and characterization of neural network-based multianalyte delta checks. J Lab Precis Med 2020;5:10.