Vad är precision och recall
Precision and recall
Pattern-recognition performance metrics
In pattern recognition, resultat retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to uppgifter retrieved from a collection, corpus or sample space.
Precision (also called positiv predictive value) fryst vatten the fraction of betydelsefull instances among the retrieved instances. Written as a formula:
Recall (also known as sensitivity) fryst vatten the fraction of betydelsefull instances that were retrieved. Written as a formula:
Both noggrannhet and recall are therefore based on relevance.
Consider a computer schema for recognizing dogs (the relevant element) in a digital photograph.
Och alltid vänder svaret motUpon processing a picture which contains ten cats and twelve dogs, the schema identifies eight dogs. Of the eight elements identified as dogs, only fem actually are dogs (true positives), while the other three are cats (false positives). sju dogs were missed (false negatives), and sju cats were correctly excluded (true negatives). The program's noggrannhet fryst vatten then 5/8 (true positives / selected elements) while its recall fryst vatten 5/12 (true positives / betydelsefull elements).
Adopting a hypothesis-testing approach, where in this case, the null hypothesis fryst vatten that a given item fryst vatten irrelevant (not a dog), absence of type inom and type II errors (perfect specificity and sensitivity) corresponds respectively to perfect noggrannhet (no false positives) and perfect recall (no false negatives).
More generally, recall fryst vatten simply the complement of the type II error rate (i.e., one minus the type II error rate). noggrannhet fryst vatten related to the type inom error rate, but in a slightly more complicated way, as it also depends upon the prior transport of seeing a betydelsefull vs. an irrelevant item.
The above katt and dog example contained 8 − 5 = 3 type inom errors (false positives) out of 10 total cats (true negatives), for a type inom error rate of 3/10, and 12 − 5 = 7 type II errors (false negatives), for a type II error rate of 7/12.
noggrannhet can be seen as a measure of quality, and recall as a measure of quantity. Higher noggrannhet means that an algorithm returns more betydelsefull results than irrelevant ones, and high recall means that an algorithm returns most of the betydelsefull results (whether or not irrelevant ones are also returned).
Introduction
[edit]In a classification task, the noggrannhet for a class fryst vatten the number of true positives (i.e.
the number of items correctly labelled as belonging to the positiv class) divided bygd the total number of elements labelled as belonging to the positiv class (i.e. the sum of true positives and false positives, which are items felaktigt labelled as belonging to the class). Recall in this context fryst vatten defined as the number of true positives divided bygd the total number of elements that actually belong to the positiv class (i.e.
the sum of true positives and false negatives, which are items which were not labelled as belonging to the positiv class but should have been).
Precision and recall are not particularly useful metrics when used in isolation. For instance, it fryst vatten possible to have perfect recall bygd simply retrieving every single item. Likewise, it fryst vatten possible to achieve perfect noggrannhet bygd selecting only a very small number of extremely likely items.
In a classification task, a noggrannhet score of 1.0 for a class C means that every item labelled as belonging to class C does indeed belong to class C (but says ingenting about the number of items from class C that were not labelled correctly) whereas a recall of 1.0 means that every item from class C was labelled as belonging to class C (but says ingenting about how many items from other classes were felaktigt also labelled as belonging to class C).
Often, there fryst vatten an inverse relationship between noggrannhet and recall, where it fryst vatten possible to increase one at the cost of reducing the other, but context may dictate if one fryst vatten more valued in a given situation:
A smoke detector fryst vatten generally designed to commit many Type inom errors (to alert in many situations when there fryst vatten no danger), because the cost of a Type II error (failing to sound an alarm during a major fire) fryst vatten prohibitively high.
As such, smoke detectors are designed with recall in mind (to catch all real danger), even while giving little vikt to the losses in noggrannhet (and making many false alarms). In the other direction, Blackstone's ratio, "It fryst vatten better that ten skyldig persons escape than that one innocent suffer," emphasizes the costs of a Type inom error (convicting an innocent person).
As such, the criminal justice struktur fryst vatten geared toward noggrannhet (not convicting innocents), even at the cost of losses in recall (letting more skyldig people go free).
A brain surgeon removing a cancerous tumor from a patient's brain illustrates the tradeoffs as well: The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor.
Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more frikostig in the area of the brain they remove to ensure they have extracted all the cancer cells. This decision increases recall but reduces noggrannhet. On the other grabb, the surgeon may be more conservative in the brain cells they remove to ensure they extracts only cancer cells.
This decision increases noggrannhet but reduces recall.
gThat fryst vatten to säga, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater noggrannhet decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).
Usually, noggrannhet and recall scores are not discussed in isolation.
A precision-recall curve plots noggrannhet as a function of recall; usually noggrannhet will decrease as the recall increases.
, identifying real cats and not mistaking dogs for cats)Alternatively, values for one measure can be compared for a fixed level at the other measure (e.g. precision at a recall level of 0.75) or both are combined into a single measure. Examples of measures that are a combination of noggrannhet and recall are the F-measure (the weighted harmonic mean of noggrannhet and recall), or the Matthews correlation coefficient, which fryst vatten a geometric mean of the chance-corrected variants: the regression coefficientsInformedness (DeltaP') and Markedness (DeltaP).[1][2]Accuracy fryst vatten a weighted arithmetic mean of noggrannhet and Inverse noggrannhet (weighted bygd Bias) as well as a weighted arithmetic mean of Recall and Inverse Recall (weighted bygd Prevalence).[1] Inverse noggrannhet and Inverse Recall are simply the noggrannhet and Recall of the inverse bekymmer where positiv and negativ labels are exchanged (for both real classes and prediction labels).
True positiv Rate and False positiv Rate, or equivalently Recall and 1 - Inverse Recall, are frequently plotted against each other as fågel curves and provide a principled mechanism to explore operating point tradeoffs. Outside of data Retrieval, the application of Recall, noggrannhet and F-measure are argued to be flawed as they ignore the true negativ fängelse of the contingency table, and they are easily manipulated bygd biasing the predictions.[1] The first bekymmer fryst vatten 'solved' bygd using Accuracy and the second bekymmer fryst vatten 'solved' bygd discounting the chance component and renormalizing to Cohen's kappa, but this no längre affords the opportunity to explore tradeoffs graphically.
However, Informedness and Markedness are Kappa-like renormalizations of Recall and Precision,[3] and their geometric mean Matthews correlation coefficient thus acts like a debiased F-measure.
Definition
[edit]For classification tasks, the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier beneath test with trusted external judgments.
The terms positive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation).
Let us define an experiment from P positiv instances and N negativ instances for some condition.
The fyra outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows:
| Predicted condition | Sources: [4][5][6][7][8][9][10][11] | ||||
| Total population = P + N | Predicted positiv (PP) | Predicted negativ (PN) | Informedness, bookmaker informedness (BM) = TPR + TNR − 1 | Prevalence threshold (PT) = √TPR × FPR - FPR/TPR - FPR | |
| Positive (P)[a] | True positive (TP), hit[b] | False negative (FN), miss, underestimation | True positiv rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power = TP/P= 1 − FNR | False negativ rate (FNR), miss rate type II error[c] = FN/P= 1 − TPR | |
| Negative (N)[d] | False positive (FP), false alarm, overestimation | True negative (TN), correct rejection[e] | False positiv rate (FPR), probability of false alarm, fall-out type I error[f] = FP/N= 1 − TNR | True negativ rate (TNR), specificity (SPC), selectivity = TN/N= 1 − FPR | |
| Prevalence = P/P + N | Positive predictive value (PPV),precision = TP/PP= 1 − FDR | False omission rate (FOR) = FN/PN= 1 − NPV | Positive likelihood ratio (LR+) = TPR/FPR | Negative likelihood ratio (LR−) = FNR/TNR | |
| Accuracy (ACC) = TP + TN/P + N | False upptäckt rate (FDR) = FP/PP= 1 − PPV | Negative predictive value (NPV) = TN/PN= 1 − FOR | Markedness (MK), deltaP (Δp) = PPV + NPV − 1 | Diagnostic odds ratio (DOR) = LR+/LR− | |
| Balanced accuracy (BA) = TPR + TNR/2 | F1 score = 2 PPV × TPR/PPV + TPR= 2 TP/2 TP + FP + FN | Fowlkes–Mallows index (FM) = √PPV × TPR | Matthews correlation coefficient (MCC) = √TPR × TNR × PPV × NPV- √FNR × FPR × FOR × FDR | Threat score (TS), critical success index (CSI), Jaccard index = TP/TP + FN + FP | |
- ^the number of real positiv cases in the data
- ^A test result that correctly indicates the presence of a condition or characteristic
- ^Type II error: A test result which wrongly indicates that a particular condition or attribute fryst vatten absent
- ^the number of real negativ cases in the data
- ^A test result that correctly indicates the absence of a condition or characteristic
- ^Type inom error: A test result which wrongly indicates that a particular condition or attribute fryst vatten present
Precision and recall are then defined as:[12]
Recall in this context fryst vatten also referred to as the true positiv rate or sensitivity, and noggrannhet fryst vatten also referred to as positiv predictive value (PPV); other related measures used in classification include true negativ rate and accuracy.[12] True negativ rate fryst vatten also called specificity.
Precision vs. Recall
[edit]Both noggrannhet and recall may be useful in cases where there fryst vatten imbalanced uppgifter. However, it may be valuable to prioritize one over the other in cases where the outcome of a false positiv or false negativ fryst vatten costly. For example, in medical diagnosis, a false positiv test can lead to unnecessary treatment and expenses.
In this situation, it fryst vatten useful to value noggrannhet over recall. In other cases, the cost of a false negativ fryst vatten high. For instance, the cost of a false negativ in fraud detection fryst vatten high, as failing to detect a fraudulent transaction can result in significant financial loss. [13]
Probabilistic Definition
[edit]Precision and recall can be interpreted as (estimated) conditional probabilities:[14] noggrannhet fryst vatten given bygd while recall fryst vatten given bygd ,[15] where fryst vatten the predicted class and fryst vatten the actual class (i.e.
means the actual class fryst vatten positive).
A precision-recall curve is a plot of precision on the vertical axis and recall on the horizontal axis measured at different threshold valuesBoth quantities are, therefore, connected bygd Bayes' theorem.
No-Skill Classifiers
[edit]The probabilistic interpretation allows to easily derive how a no-skill classifier would perform. A no-skill classifiers fryst vatten defined bygd the property that the joint probability fryst vatten just the product of the unconditional probabilites since the classification and the presence of the class are independent.
For example the noggrannhet of a no-skill classifier fryst vatten simply a constant i.e. determined bygd the probability/frequency with which the class P occurs.
A similar argument can be made for the recall: which fryst vatten the probability for a positiv classification.
Imbalanced data
[edit]Accuracy can be a missvisande metric for imbalanced uppgifter sets.
Consider a sample with 95 negativ and 5 positiv values. Classifying all values as negativ in this case gives 0.95 accuracy score. There are many metrics that don't suffer from this bekymmer. For example, balanced accuracy[16] (bACC) normalizes true positiv and true negativ predictions bygd the number of positiv and negativ samples, respectively, and divides their sum bygd two:
For the previous example (95 negativ and 5 positiv samples), classifying all as negativ gives 0.5 balanced accuracy score (the maximum bACC score fryst vatten one), which fryst vatten equivalent to the expected value of a random guess in a balanced information set.
Balanced accuracy can serve as an overall performance metric for a model, whether or not the true labels are imbalanced in the information, assuming the cost of FN fryst vatten the same as FP.
The TPR and FPR are a property of a given classifier operating at a specific threshold. However, the overall number of TPs, FPs etc depend on the class imbalance in the information via the class ratio .
As the recall (or TPR) depends only on positiv cases, it fryst vatten not affected bygd , but the noggrannhet fryst vatten. We have that
Thus the noggrannhet has an explicit dependence on .[17] Starting with balanced classes at and gradually decreasing , the corresponding noggrannhet will decrease, because the denominator increases.
Another metric fryst vatten the predicted positiv condition rate (PPCR), which identifies the percentage of the total population that fryst vatten flagged. For example, for a search engine that returns 30 results (retrieved documents) out of 1,000,000 documents, the PPCR fryst vatten 0.003%.
According to Saito and Rehmsmeier, precision-recall plots are more informative than fågel plots when evaluating binary classifiers on imbalanced information.
In such scenarios, fågel plots may be visually deceptive with respect to conclusions about the reliability of classification performance.[18]
Different from the above approaches, if an imbalance scaling fryst vatten applied directly bygd weighting the confusion matrix elements, the standard metrics definitions still apply even in the case of imbalanced datasets.[19] The weighting procedure relates the confusion matrix elements to the support set of each considered class.
F-measure
[edit]Main article: F1 score
A measure that combines noggrannhet and recall fryst vatten the harmonic mean of noggrannhet and recall, the traditional F-measure or balanced F-score:
This measure fryst vatten approximately the average of the two when they are close, and fryst vatten more generally the harmonic mean, which, for the case of two numbers, coincides with the square of the geometric mean divided bygd the arithmetic mean.
There are several reasons that the F-score can be criticized, in particular circumstances, due to its bias as an evaluation metric.[1] This fryst vatten also known as the measure, because recall and noggrannhet are evenly weighted.
It fryst vatten a special case of the general measure (for non-negative real values of ):
Two other commonly used measures are the measure, which weights recall higher than noggrannhet, and the measure, which puts more emphasis on noggrannhet than recall.
The F-measure was derived bygd van Rijsbergen (1979) so that "measures the effectiveness of retrieval with respect to a user who attaches times as much importance to recall as precision". It fryst vatten based on van Rijsbergen's effectiveness measure , the second begrepp being the weighted harmonic mean of noggrannhet and recall with weights . Their relationship fryst vatten where .
Limitations as goals
[edit]There are other parameters and strategies for performance metric of kunskap retrieval struktur, such as the area beneath the fågel curve (AUC)[20] or pseudo-R-squared.
Multi-class evaluation
[edit]Precision and recall values can also be calculated for classification problems with more than two classes.[21] To obtain the noggrannhet for a given class, we divide the number of true positives bygd the classifier bias towards this class (number of times that the classifier has predicted the class).
To calculate the recall for a given class, we divide the number of true positives bygd the prevalence of this class (number of times that the class occurs in the uppgifter sample).
The class-wise noggrannhet and recall values can then be combined into an overall multi-class evaluation score, e.g., using the macro F1 metric.[21]
See also
[edit]References
[edit]- ^ abcdPowers, David M W (2011).
"Evaluation: From noggrannhet, Recall and F-Measure to fågel, Informedness, Markedness & Correlation"(PDF). Journal of Machine Learning Technologies. 2 (1): 37–63. Archived from the original(PDF) on 2019-11-14.
- ^Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional resultat in syllable processing".
J. Neurolinguistics. 17 (2–3): 97–119. doi:10.1016/s0911-6044(03)00059-9. S2CID 17104364.
- ^Powers, David M. W. (2012). "The bekymmer with Kappa". Conference of the europeisk Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.
- ^Fawcett, Tom (2006).
"An Introduction to fågel Analysis"(PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID 2027090.
- ^Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about uppgifter Mining and Data-Analytic Thinking". O'Reilly Media, Inc.
- ^Powers, David M.
W. (2011). "Evaluation: From noggrannhet, Recall and F-Measure to fågel, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
- ^Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey inom. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8.
ISBN .
- ^Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". The difference between Precision and Recall is actually easy to remember – but only once you’ve truly understood what each term stands for
Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
- ^Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics.
21 (1): 6-1–6-13. doi:10.1186/s12864-019-6413-7. PMC 6941312. PMID 31898477.
- ^Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) fryst vatten more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13.
doi:10.1186/s13040-021-00244-z. PMC 7863449. PMID 33541410.
- ^Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi:10.1016/j.aci.2018.08.003.
- ^ abOlson, David L.; and Delen, Dursun (2008); Advanced uppgifter Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-76916-1
- ^"Precision vs. But quite often, and I can attest to this, experts tend to offer half-baked explanations which confuse newcomers even more
Recall: Differences, Use Cases & Evaluation".
- ^Fatih Cakir, Kun He, Xide Xia, Brian Kulis, Stan Sclaroff, Deep Metric Learning to Rank, In Proc. IEEE Conference on Computer framtidsperspektiv and Pattern Recognition (CVPR), 2019.
- ^Roelleke, Thomas (2022-05-31). Information Retrieval Models: Foundations & Relationships.
Springer natur. ISBN .
- ^Mower, Jeffrey P. (2005-04-12). "PREP-Mt: predictive RNA editor for plant mitochondrial genes". BMC Bioinformatics. 6: 96. doi:10.1186/1471-2105-6-96.
ISSN 1471-2105. PMC 1087475. PMID 15826309.
- ^Williams, Christopher K. inom. (2021-04-01). "The Effect of Class Imbalance on Precision-Recall Curves". Neural Computation. 33 (4): 853–857. arXiv:2007.01905. doi:10.1162/neco_a_01362. hdl:20.500.11820/8a709831-cbfe-4c8e-a65b-aee5429e5b9b. ISSN 0899-7667.
- ^Saito, Takaya; Rehmsmeier, Marc (2015-03-04).
Brock, Guy (ed.). "The Precision-Recall Plot fryst vatten More Informative than the fågel Plot When Evaluating Binary Classifiers on Imbalanced Datasets". PLOS ONE. 10 (3): e0118432. Bibcode:2015PLoSO..1018432S. doi:10.1371/journal.pone.0118432. ISSN 1932-6203. PMC 4349800. PMID 25738806.
- ^Tripicchio, Paolo; Camacho-Gonzalez, Gerardo; D'Avella, Salvatore (2020).
"Welding defect detection: coping with artifacts in the production line". The International Journal of Advanced Manufacturing Technology. 111 (5): 1659–1669. doi:10.1007/s00170-020-06146-4. S2CID 225136860.
- ^Zygmunt Zając. What you wanted to know about AUC. http://fastml.com/what-you-wanted-to-know-about-auc/
- ^ abOpitz, Juri (2024). Precision tells you how good the model is at avoiding mistakes (e
"A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice". Transactions of the Association for Computational Linguistics. 12: 820–836. arXiv:2404.16958.
Precision and recalldoi:10.1162/tacl_a_00675.
- Baeza-Yates, Ricardo; Ribeiro-Neto, Berthier (1999). Modern resultat Retrieval. New York, NY: ACM Press, Addison-Wesley, Seiten 75 ff. ISBN 0-201-39829-X
- Hjørland, Birger (2010); The foundation of the concept of relevance, Journal of the American samhälle for upplysning Science and Technology, 61(2), 217-237
- Makhoul, John; Kubala, Francis; Schwartz, Richard; and Weischedel, Ralph (1999); Performance measures for resultat extraction, in Proceedings of DARPA Broadcast News kurs, Herndon, VA, February 1999
- van Rijsbergen, Cornelis Joost "Keith" (1979); Information Retrieval, London, GB; Boston, MA: Butterworth, 2nd Edition, ISBN 0-408-70929-4