br Evaluation of NLP models br
3.3. Evaluation of NLP models
Table 3 shows the accuracy of the rule-based and CNN models as a function of precision, recall and F-measure values based on 100 manually annotated notes. The rule-based model showed high precision (0.924) but a lower recall (0.871), indicating that the model missed many notes mentioning bone scan utilization, but was rarely wrong regarding positive identification of bone scan performance. The results are different for the CNN model, where precision was not as high (0.882), but recall was very high (0.957).
The predictions of the two models are summarized in Fig. 3, where each CCK8 represents a note. Nodes that are black correspond to notes where bone scan utilization was mentioned, whereas white nodes re-present notes that state that the patient has not received a bone scan. The orange and green zones correspond to the predictions of the bone scan receipt according to the 2 models.
The combination of the two methods improved the accuracy of in-formation extraction. In Table 3, four possible models are presented that differentially harmonize precision and recall to adjust model ac-curacy. It was possible to combine the predictions of the two models. For example, the number of false positives could be minimized by using the intersection of notes with positive annotations by both methods (model 3). This approach increased the precision score (0.968) at the
Fig. 2. Flowchart to select the final cohort, to classify the patients and to detect if patients underwent a bone scan.
Demographic data of patients in function of NCCN guidelines and predictions of CNN model.
NCCN risk group
No BS No BS
Table 2BDemographic data of patients in function of Patient characteristics
Evaluation of NLP models in function of 100 manually annotated notes.
Model 1 Model 2 Model 3 Model 4
Rule-based CNN Rule-based and CNN Rule-based or CNN
expense of decreasing the recall score (0.857). False negatives could be minimized by selecting the union of patients of positive annotations by both methods (model 4), producing high recall (0.971) but low preci-sion (0.85). The 5500 patients included in our study had 369,764 associated notes. These notes were composed of a total of 17,101,187 sentences, including 14,090 sentences with the word “bone scan”. The CNN model predicted 6701 positive notes from the 369,764 notes and the rule-based model predicted 5636 positives notes. The intersection of model predictions (model 3) was 5326 positive notes, while the union of model predictions (model 4) included 7011 positive notes.
3.4. Guideline adherence
To measure guideline adherence, we chose to use the CNN model because it had the highest F-measure (0.918) compared to the rule-based model, and therefore the best compromise between precision and recall. Using structured and semi-structured data, we determined that only 813 patients received a bone scan (15%). However, an additional 1270 patients (23%) were annotated when we used the CNN model. Fig. 4 summarizes the use of bone scan according to the NCCN and AUA guidelines, where each bar corresponds to the percentage of patients who received a bone scan. Bone scans were used at modestly high rates in high-risk patients (73%), while only 10% of low-risk patients re-ceived a bone scan. When intermediate risk patients were substratified into unfavorable risk and favorable risk according to the AUA guide-lines, 39% and 23% underwent bone scan, respectively.
We developed a pipeline using heterogeneous EHR data to assess guideline adherence (the over- and under-use) of radionuclide bone scans in newly diagnosed prostate cancer patients for staging prior to treatment. To measure adherence, we developed electronic phenotypes to classify patients into different clinical risk categories according to two different guidelines because each clinical risk category has a dif-ferent bone scan recommendation. Assessment of bone scan doc-umentation required the transformation of heterogenous data to knowledge using NLP technologies, with CNN models outperforming a rule-based approach. Our work also provides a model for the demon-strates the use of orthogonal NLP methods to adjust model precision for individual use cases, allowing to titrate for higher precision to ensure all high-risk patients needing a bone scan are identified, or for higher recall to measure guideline adherence. For assessment of adherence to the bone scan quality metric, it is critical to avoid false positives (label a high-risk patient as ‘bone scan performed’ if he did not receive one), therefore models tuned to the highest precision would minimize this risk and lower the number of false positives. Integrating this informa-tion at point of care will be essential to ensure both patients and clinics have evidence necessary to guide bone scan use and treatment path-ways.