Acronyms result from a highly productive type of term variation which substitutes fully expanded terms (e.g., retinoic acid receptor alpha) with shortened term-forms (e.g., RARA). Even though no generic rules or exact patterns have been established for dealing with acronym creation, acronyms often appears in documents without the expanded form explicitly stated. Thus, an acronym dictionary is necessary for advanced text-mining tasks to establish associations between acronyms and their expanded forms.
Acromine is a system for building a good quality acronym dictionary from running text. Assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form, Acromine identifies acronym definitions in a similar manner to a statistical term recognition. Applied to the whole MEDLINE (7,811,582 abstracts) as of March 2006, Acromine extracted 920,425 acronym candidates and recognized 157,803 expanded forms in reasonable time (ca. 12 hours on a personal computer). This system achieves 99% precision and 82–95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. Please refer to the following paper for more detail.
-
[Okazaki06] “Building an abbreviation dictionary using a term recognition approach”. Bioinformatics. 2006. Oxford University Press.
You can try Acromine Acronym Dictionary at our demonstration site.
- Acromine Acronym Dictionary demonstration
- You can search expanded forms for an acronym or acronyms for an expanded form on the dictionary generated from the whole MEDLINE.
- Acromine software
- This page distributes the implementation of Acromine system.
- Acromine 'Common' Corpus
- Acromine 'Common' Corpus is an evaluation corpus of acronym recognition. The corpus employs 50 short forms occurring frequently in the whole MEDLINE (as of March 2006).
- Acromine 'Paper' Corpus
- Acromine 'Paper' Corpus employs 50 short forms chosen from those discussed in papers on acronym recognition.
- Acromine 'Random' Corpus
- Acromine 'Random' Corpus consists of 657 short/long-form pairs that were sampled at random appearing more than 8 times in the whole MEDLINE.