Table of Contents
The AcroMine 'Common' Corpus employs 50 short forms occurring frequently in the whole MEDLINE (as of March 2006). The criterion focuses on system performance for well known acronyms in the biomedical domain. The corpus includes 3,362 short/long-form pairs extracted manually from the 605,047 sentences that contain the target 50 short form.
The common misperception of this corpus is that the corpus might not contain rare acronym definitions, but it is not true. Even though the 50 short forms occur frequently in the whole MEDLINE by their own, each short form has multiple long-forms (i.e., definitions). For example, although the most frequent long-form for the short form CT (32,507 times) is computed tomography (18512 times), a great number of less frequent long-forms also exist in the corpus, e.g., contract time (40 times), closure time (36 times), cell transplant (10 times), constant time (7 times), cavernous tissue (2 times), complex tone (2 times), cortical threshold (2 times). In fact, as many as 1,076 short/long-form pairs occur only twice in the corpus.
A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.
The criteria for including long forms in the evaluation corpus were established:
- a long form with minimum necessary elements (words) to produce its acronym is accepted
- a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
- a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task
Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.
The following table shows the complete list of short/long forms in this corpus.
Table 1. List of the short forms and their statistics
Rank | Short form | # distinct longforms | # contextual sentences |
---|---|---|---|
Total | 3362 | 605047 | |
1 | CT | 257 | 32507 |
2 | PCR | 48 | 26486 |
3 | HIV | 12 | 19032 |
4 | LPS | 52 | 18750 |
5 | MRI | 10 | 18396 |
6 | ELISA | 27 | 16502 |
7 | SD | 189 | 16362 |
8 | BP | 137 | 15251 |
9 | CNS | 36 | 14427 |
10 | CSF | 29 | 14403 |
11 | IL | 55 | 14079 |
12 | PKC | 11 | 13389 |
13 | RT-PCR | 35 | 13122 |
14 | DA | 148 | 12893 |
15 | TNF-ALPHA | 13 | 12806 |
16 | ER | 113 | 12801 |
17 | HPLC | 19 | 12615 |
18 | TNF | 11 | 12461 |
19 | CI | 191 | 11982 |
20 | LDL | 23 | 11966 |
21 | 5-HT | 19 | 11577 |
22 | HR | 99 | 11350 |
23 | LV | 56 | 11001 |
24 | MHC | 29 | 10942 |
25 | MAP | 77 | 10781 |
26 | HCV | 13 | 10457 |
27 | EGF | 9 | 10421 |
28 | HIV-1 | 11 | 10241 |
29 | NE | 102 | 10234 |
30 | GH | 42 | 10220 |
31 | NK | 17 | 10142 |
32 | IFN-GAMMA | 3 | 10038 |
33 | CD | 261 | 9965 |
34 | BMI | 13 | 9909 |
35 | MAB | 16 | 9776 |
36 | PC | 392 | 9539 |
37 | SLE | 29 | 9506 |
38 | NMDA | 11 | 9352 |
39 | RT | 129 | 9224 |
40 | LH | 65 | 9199 |
41 | ACE | 65 | 9122 |
42 | TPA | 86 | 8971 |
43 | IL-2 | 5 | 8791 |
44 | AIDS | 15 | 8725 |
45 | PG | 142 | 8608 |
46 | PMA | 76 | 8429 |
47 | GABA | 20 | 8284 |
48 | EBV | 11 | 8249 |
49 | CAT | 86 | 5961 |
50 | AMI | 47 | 5803 |