Table of Contents
Figure 1 shows the number of contextual sentences (i.e., the number of occurrences) of short forms (occurring eight times or more) arranged in descending order. The x-axis is the list of short forms arranged in descending order of their occurrence. The most frequent short-form CT appears at the leftmost position, followed by frequent acronyms occupying the small region in the left side of the graph.
We chose 248 short forms every 300 entries from left to right in the graph, i.e., the 1st, 301th, 601th, ..., 74101th frequent short forms. In other words, we sampled 1/300 of the short forms at random appearing more than 8 times in the whole MEDLINE. We retrieved 32,910 contextual sentences for the short forms and collected 657 long-forms in a similar manner to the other corpora.
A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.
The criteria for including long forms in the evaluation corpus were established:
- a long form with minimum necessary elements (words) to produce its acronym is accepted
- a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
- a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task
Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.
The following table shows the complete list of short/long forms in this corpus. A short form in this corpus may have no distinct long-form because: all long-forms for a short form occur less than twice; or no long form should be extracted since the short form is not a valid acronym.
Table 1. List of the short forms and their statistics
Rank | Short form | # distinct longforms | # contextual sentences |
---|---|---|---|
Total | 657 | 32910 | |
1 | CT | 257 | 32507 |
2 | PCP | 61 | 3606 |
3 | CFTR | 11 | 2079 |
4 | USA | 3 | 1441 |
5 | C3 | 22 | 1085 |
6 | ACD | 42 | 893 |
7 | CHX | 5 | 734 |
8 | QC | 14 | 632 |
9 | MAR | 31 | 539 |
10 | TUR | 9 | 471 |
11 | PSF | 24 | 417 |
12 | EPP | 21 | 372 |
13 | SEVERE | 0 | 337 |
14 | NSP | 21 | 309 |
15 | ARB | 21 | 283 |
16 | FNH | 2 | 261 |
17 | XI | 6 | 240 |
18 | MX | 19 | 223 |
19 | PRD | 23 | 207 |
20 | AOAA | 4 | 193 |
21 | DENA | 4 | 181 |
22 | 16 H | 0 | 171 |
23 | ASPS | 12 | 162 |
24 | SUS | 12 | 152 |
25 | P70S6K | 6 | 144 |
26 | HACAT | 1 | 138 |
27 | BSH | 5 | 132 |
28 | 1S,3R-ACPD | 10 | 126 |
29 | S+ | 0 | 120 |
30 | 69 PERCENT | 0 | 116 |
31 | EC 3.1.1.3 | 0 | 111 |
32 | PBGD | 3 | 106 |
33 | CYP7A1 | 1 | 102 |
34 | 5-10 MG/KG | 0 | 98 |
35 | D- | 0 | 94 |
36 | AZF | 4 | 91 |
37 | MSI-L | 3 | 87 |
38 | STL | 14 | 84 |
39 | UDC | 2 | 81 |
40 | DTMP | 3 | 79 |
41 | VICIA FABA | 0 | 76 |
42 | N20 | 0 | 74 |
43 | G-A | 0 | 72 |
44 | GTE | 2 | 70 |
45 | GABA-IR | 3 | 68 |
46 | DDDS | 3 | 66 |
47 | EXCEPT ONE | 0 | 64 |
48 | FSU | 5 | 62 |
49 | VIL-10 | 2 | 60 |
50 | HKC | 3 | 59 |
51 | 7 MIN | 0 | 58 |
52 | PMCT | 4 | 56 |
53 | EG2 | 1 | 55 |
54 | 14L:10D | 0 | 54 |
55 | PYRUVATE | 0 | 52 |
56 | MNPCES | 2 | 51 |
57 | DOWNSTREAM | 0 | 50 |
58 | CA2 | 1 | 49 |
59 | B-CELLS | 0 | 48 |
60 | AP3A | 3 | 47 |
61 | CARCINOMA | 0 | 46 |
62 | D GROUP | 0 | 45 |
63 | CT-SCAN | 1 | 44 |
64 | E-I | 4 | 43 |
65 | HR/HR | 0 | 42 |
66 | MAST CELLS | 0 | 41 |
67 | Q(A | 0 | 40 |
68 | SFV' | 0 | 39 |
69 | 7:3, V/V | 0 | 39 |
70 | GENOTYPE | 0 | 38 |
71 | 150 PPM | 0 | 37 |
72 | OR PLASMA | 0 | 37 |
73 | FASR | 2 | 36 |
74 | A CA(2+ | 0 | 35 |
75 | PRO-UPA | 1 | 35 |
76 | K(OW | 0 | 34 |
77 | SST1-5 | 1 | 33 |
78 | CODE | 1 | 33 |
79 | ALLOGRAFT | 0 | 32 |
80 | PA1 | 2 | 32 |
81 | AQUAPORINS | 0 | 31 |
82 | NIPAAM | 1 | 31 |
83 | N=203 | 0 | 30 |
84 | ARK | 3 | 30 |
85 | PROMM | 1 | 29 |
86 | CT/MRI | 3 | 29 |
87 | 30 MS | 0 | 28 |
88 | IHN | 4 | 28 |
89 | SP-I | 4 | 28 |
90 | OPB | 4 | 27 |
91 | CFG | 4 | 27 |
92 | IEPS | 4 | 26 |
93 | SEASONAL | 0 | 26 |
94 | AIK | 2 | 26 |
95 | AB- | 2 | 25 |
96 | IACT | 4 | 25 |
97 | RAT LIVER | 0 | 25 |
98 | RMCP II | 2 | 24 |
99 | AIJ | 1 | 24 |
100 | HPK | 3 | 24 |
101 | SJA | 2 | 23 |
102 | D(MAX | 2 | 23 |
103 | MAGNEVIST | 0 | 23 |
104 | 1,000 PPM | 0 | 23 |
105 | HDTMA | 2 | 22 |
106 | ANGI | 1 | 22 |
107 | PERSANTINE | 0 | 22 |
108 | T14 | 1 | 21 |
109 | GDIS | 2 | 21 |
110 | APND | 2 | 21 |
111 | NSILA | 1 | 21 |
112 | TFMS | 2 | 20 |
113 | C4-C6 | 0 | 20 |
114 | HOLES | 0 | 20 |
115 | P<0.013 | 0 | 20 |
116 | MG/KG/D | 0 | 19 |
117 | 10 MA | 0 | 19 |
118 | ZPT | 3 | 19 |
119 | RHODAMINE | 0 | 19 |
120 | F8C | 0 | 19 |
121 | B-FABP | 5 | 19 |
122 | CHIRAL | 0 | 18 |
123 | NA+/H+ | 0 | 18 |
124 | HEMOLYSIS | 0 | 18 |
125 | RTES | 1 | 18 |
126 | 3-HB | 1 | 18 |
127 | MERIT-HF | 3 | 17 |
128 | CK 19 | 2 | 17 |
129 | TCNE | 1 | 17 |
130 | PITPS | 1 | 17 |
131 | GP30 | 0 | 17 |
132 | 50-60 GY | 0 | 17 |
133 | NARGHI | 0 | 16 |
134 | EBE | 2 | 16 |
135 | IL-1A | 2 | 16 |
136 | QMC | 2 | 16 |
137 | UFN | 2 | 16 |
138 | AUTM | 2 | 16 |
139 | 22Q11 | 0 | 16 |
140 | U87MG | 0 | 15 |
141 | ABF1 | 1 | 15 |
142 | FR 30 | 1 | 15 |
143 | 10 BP | 0 | 15 |
144 | OR 2 | 0 | 15 |
145 | CLASS 0 | 0 | 15 |
146 | KNOBS | 0 | 15 |
147 | RMPS | 1 | 15 |
148 | R1-6 | 0 | 14 |
149 | NWL | 1 | 14 |
150 | BIND | 1 | 14 |
151 | DLL1 | 1 | 14 |
152 | TAU OFF | 0 | 14 |
153 | H2-AGONIST | 0 | 14 |
154 | 6% EACH | 0 | 14 |
155 | 0.005 MM | 0 | 14 |
156 | LEFT LOBE | 0 | 14 |
157 | HS- | 2 | 13 |
158 | CHV-1 | 3 | 13 |
159 | EMBRYOS | 0 | 13 |
160 | M22 | 0 | 13 |
161 | R=0.29 | 0 | 13 |
162 | SW-13 | 0 | 13 |
163 | OHDA | 0 | 13 |
164 | 3 MUMOL | 0 | 13 |
165 | ALPHA-AT | 2 | 13 |
166 | STEEL | 0 | 12 |
167 | CANA | 2 | 12 |
168 | GABOB | 2 | 12 |
169 | ALDB | 1 | 12 |
170 | DMNT | 1 | 12 |
171 | OSATS | 1 | 12 |
172 | R/G | 1 | 12 |
173 | 3-CB | 2 | 12 |
174 | VRNP | 1 | 12 |
175 | IONSPRAY | 0 | 12 |
176 | MFNS | 2 | 12 |
177 | MDC/CCL22 | 1 | 11 |
178 | T11TS | 1 | 11 |
179 | GLUD | 1 | 11 |
180 | PGASE | 1 | 11 |
181 | RPTKS | 1 | 11 |
182 | EC 2.8.1.2 | 0 | 11 |
183 | NOISY | 0 | 11 |
184 | 8.8 MG/KG | 0 | 11 |
185 | CMBA | 2 | 11 |
186 | X-GLUC | 0 | 11 |
187 | 20-25 G | 0 | 11 |
188 | IODIXANOL | 0 | 11 |
189 | BA6 | 1 | 11 |
190 | THETA MAX | 0 | 10 |
191 | DRB1*0101 | 0 | 10 |
192 | HEP 2 | 1 | 10 |
193 | AEU | 3 | 10 |
194 | 4-12 WEEKS | 0 | 10 |
195 | CLQ-BA | 1 | 10 |
196 | BETA1-4 | 0 | 10 |
197 | SEM 0.10 | 0 | 10 |
198 | 14 HR | 0 | 10 |
199 | FREE DRUG | 0 | 10 |
200 | IWQOL-LITE | 1 | 10 |
201 | NIGERICIN | 0 | 10 |
202 | PCB 118 | 0 | 10 |
203 | MDNCF | 1 | 10 |
204 | XOX | 1 | 10 |
205 | R2* | 0 | 10 |
206 | OPTIBOND | 0 | 9 |
207 | 0.1-3 NMOL | 0 | 9 |
208 | SERVO NULL | 0 | 9 |
209 | M 1 | 0 | 9 |
210 | PGOE | 1 | 9 |
211 | 44 MICROM | 0 | 9 |
212 | TFPI1-161 | 0 | 9 |
213 | AUTOPHAGY | 0 | 9 |
214 | WERNICKE | 0 | 9 |
215 | HERCEPTEST | 0 | 9 |
216 | 2-3 KG | 0 | 9 |
217 | AAQ | 0 | 9 |
218 | E/G | 2 | 9 |
219 | CV/VC | 0 | 9 |
220 | FRES. | 0 | 9 |
221 | MTRPS | 1 | 9 |
222 | ISVP | 3 | 9 |
223 | CASP4 | 0 | 9 |
224 | RC2 | 0 | 9 |
225 | HUEC | 1 | 8 |
226 | 6-13 YEARS | 0 | 8 |
227 | 3.3 DAYS | 0 | 8 |
228 | JDP2 | 1 | 8 |
229 | 18:1T | 0 | 8 |
230 | GROUP PE | 2 | 8 |
231 | FIOCRUZ | 1 | 8 |
232 | DAPM | 0 | 8 |
233 | 0.5 MUG | 0 | 8 |
234 | ASP-->GLU | 0 | 8 |
235 | LYS 4 | 0 | 8 |
236 | V I | 0 | 8 |
237 | OTS/FOETS | 0 | 8 |
238 | PXAS | 1 | 8 |
239 | PGLUAP | 1 | 8 |
240 | NEK | 0 | 8 |
241 | BURNS | 0 | 8 |
242 | RPSO | 0 | 8 |
243 | MK-CSF | 2 | 8 |
244 | ADH-C2 | 0 | 8 |
245 | TEF1 | 0 | 8 |
246 | CIMR | 1 | 8 |
247 | SJS/TEN | 2 | 8 |
248 | EARSS | 1 | 8 |