AcroMine 'Common' Corpus

Overview of this corpus

The AcroMine 'Common' Corpus employs 50 short forms occurring frequently in the whole MEDLINE (as of March 2006). The criterion focuses on system performance for well known acronyms in the biomedical domain. The corpus includes 3,362 short/long-form pairs extracted manually from the 605,047 sentences that contain the target 50 short form.

The common misperception of this corpus is that the corpus might not contain rare acronym definitions, but it is not true. Even though the 50 short forms occur frequently in the whole MEDLINE by their own, each short form has multiple long-forms (i.e., definitions). For example, although the most frequent long-form for the short form CT (32,507 times) is computed tomography (18512 times), a great number of less frequent long-forms also exist in the corpus, e.g., contract time (40 times), closure time (36 times), cell transplant (10 times), constant time (7 times), cavernous tissue (2 times), complex tone (2 times), cortical threshold (2 times). In fact, as many as 1,076 short/long-form pairs occur only twice in the corpus.

Making of this corpus

A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.

The criteria for including long forms in the evaluation corpus were established:

  • a long form with minimum necessary elements (words) to produce its acronym is accepted
  • a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
  • a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task

Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.

Download

To be released.

List of acronyms

The following table shows the complete list of short/long forms in this corpus.

Table 1. List of the short forms and their statistics

Rank Short form # distinct longforms # contextual sentences
Total 3362 605047
1 CT 257 32507
2 PCR 48 26486
3 HIV 12 19032
4 LPS 52 18750
5 MRI 10 18396
6 ELISA 27 16502
7 SD 189 16362
8 BP 137 15251
9 CNS 36 14427
10 CSF 29 14403
11 IL 55 14079
12 PKC 11 13389
13 RT-PCR 35 13122
14 DA 148 12893
15 TNF-ALPHA 13 12806
16 ER 113 12801
17 HPLC 19 12615
18 TNF 11 12461
19 CI 191 11982
20 LDL 23 11966
21 5-HT 19 11577
22 HR 99 11350
23 LV 56 11001
24 MHC 29 10942
25 MAP 77 10781
26 HCV 13 10457
27 EGF 9 10421
28 HIV-1 11 10241
29 NE 102 10234
30 GH 42 10220
31 NK 17 10142
32 IFN-GAMMA 3 10038
33 CD 261 9965
34 BMI 13 9909
35 MAB 16 9776
36 PC 392 9539
37 SLE 29 9506
38 NMDA 11 9352
39 RT 129 9224
40 LH 65 9199
41 ACE 65 9122
42 TPA 86 8971
43 IL-2 5 8791
44 AIDS 15 8725
45 PG 142 8608
46 PMA 76 8429
47 GABA 20 8284
48 EBV 11 8249
49 CAT 86 5961
50 AMI 47 5803