Best of both worlds – let’s cherish the features of parametric AND explicit knowledge retrieval
Contributor: Barbara Strasser
Mentors:JB Poline, Arman Jahanpour, Sebastian Urchs, Alyssa Dai, bcmcpher
The annotation of research data is essential to ensure its findability and reusability. High-quality data annotations require domain expertise, so that annotations are relevant, but also specific technical skills, such as knowing how to handle JSON/XML files. Additionally, people are often reluctant to change their workflows, and the technical affordances in the case of data annotation intensify this challenge. As a result, researchers tend to stick to their “data handling traditions” as soon as data operations become too complex. Unfortunately, this often means that even though projects like Neurobagel are working hard to make life easier for researchers, these tools are not widely adopted. My idea for contributing to the Neurobagel project is to combine a user-friendly interface with a Large Language Model (LLM) approach to make the annotation of tabular data even more effortless for researchers. For the end user, the process should be to provide a tabular file and get a first-pass annotation for review without any intermediate steps. From a technical perspective, this should be accomplished by using a Large Language Model (LLM) to categorize columns, such as participant ID or age. To improve the predictions of the LLM, already annotated data will be linked to explicit knowledge from existing ontologies such as SNOMED CT or the Cognitive Atlas and used to provide context for the LLM.
- Combine a user-friendly interface with a Large Language Model (LLM) approach to make the annotation of tabular data effortless for researchers.