AcroMine 'Common' Corpus -

Overview of this corpus

The AcroMine 'Common' Corpus employs 50 short forms occurring frequently in the whole MEDLINE (as of March 2006). The criterion focuses on system performance for well known acronyms in the biomedical domain. The corpus includes 3,362 short/long-form pairs extracted manually from the 605,047 sentences that contain the target 50 short form.

The common misperception of this corpus is that the corpus might not contain rare acronym definitions, but it is not true. Even though the 50 short forms occur frequently in the whole MEDLINE by their own, each short form has multiple long-forms (i.e., definitions). For example, although the most frequent long-form for the short form CT (32,507 times) is computed tomography (18512 times), a great number of less frequent long-forms also exist in the corpus, e.g., contract time (40 times), closure time (36 times), cell transplant (10 times), constant time (7 times), cavernous tissue (2 times), complex tone (2 times), cortical threshold (2 times). In fact, as many as 1,076 short/long-form pairs occur only twice in the corpus.

Making of this corpus

A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.

The criteria for including long forms in the evaluation corpus were established:

a long form with minimum necessary elements (words) to produce its acronym is accepted
a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task

Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.

Download

To be released.

List of acronyms

The following table shows the complete list of short/long forms in this corpus.

Table 1. List of the short forms and their statistics

Rank	Short form	# distinct longforms	# contextual sentences
Total		3362	605047
1	CT	257	32507
2	PCR	48	26486
3	HIV	12	19032
4	LPS	52	18750
5	MRI	10	18396
6	ELISA	27	16502
7	SD	189	16362
8	BP	137	15251
9	CNS	36	14427
10	CSF	29	14403
11	IL	55	14079
12	PKC	11	13389
13	RT-PCR	35	13122
14	DA	148	12893
15	TNF-ALPHA	13	12806
16	ER	113	12801
17	HPLC	19	12615
18	TNF	11	12461
19	CI	191	11982
20	LDL	23	11966
21	5-HT	19	11577
22	HR	99	11350
23	LV	56	11001
24	MHC	29	10942
25	MAP	77	10781
26	HCV	13	10457
27	EGF	9	10421
28	HIV-1	11	10241
29	NE	102	10234
30	GH	42	10220
31	NK	17	10142
32	IFN-GAMMA	3	10038
33	CD	261	9965
34	BMI	13	9909
35	MAB	16	9776
36	PC	392	9539
37	SLE	29	9506
38	NMDA	11	9352
39	RT	129	9224
40	LH	65	9199
41	ACE	65	9122
42	TPA	86	8971
43	IL-2	5	8791
44	AIDS	15	8725
45	PG	142	8608
46	PMA	76	8429
47	GABA	20	8284
48	EBV	11	8249
49	CAT	86	5961
50	AMI	47	5803