This summer of 2024 my team and I developed an AI application for the non-profit Mind. For 10 weeks we built an application that utilises Retrieval-Augmented-Generation (RAG) techniques to summarize and explain research in a way that non-experts easily can understand. All while also linking the generated answer to credited research papers.
Apart from being an application for Mind to support their work within suicide prevention and mental health, it was also really insightful to meet with some of Sweden’s best researchers within the field of mental health, and hear them enthusiastically exclaim that the application also would support their everyday research process. An insight that appears again and again is as soon as you develop something, it usually has use cases far beyond your initial expectation💡✨
I’m really grateful for the opportunity to not only develop this application for social good, but also to work with team members and friends Finn Vaughankraska and Markus Rupp. It’s been a great learning experience and a lot of fun. A big thank you to AI Sweden and Mind.
KnowledgeSeeker is a tool built on the request of the non-profit Mind to effortlessly fetch and retrieve new research, in the aim to spread this new research to Minds internal team and external partners. The tool is built by three students from Uppsala University (Elise, Finn and Marcus), enrolled in the talent program AI for Impact jointly held by AI Sweden and Google.org. After the project duration of 10 weeks during the summer 2024, we can now present our final product! 🎉
Objectives of the platform
The platform has two objectives:
The goal of the first objective is to answer user questions in a manner that is easily understandable for non-experts, as well as provide direct sources that underpinned the generated answer.
The goal of the second objective is to enable users to continuously be well-informed in their chosen research areas by setting up a rule, that we call a “Lookout”, that scans the research community on the latest research. The aim is to deliver the latest research to the user, so the user can easily find the latest research directly in their inbox.
Primary features
Based on these two objectives, the platform has two primary features in the user interface.
One is to ask a research question, and get the generated answer along with the references.
The user can select which sources to use in answering their question. Sources can either be their own uploaded research or utilising research archives such as PubMed, ClinicalTrials.gov or Arxiv.
The other major feature is the Lookouts. Here the user can specify their research area of interest as well as specifying how long time range the lookout should take into account in searching and ranking relevant research. Apart from that, the user can also specify locations on where the research was conducted. When a lookout is created, relevant sources from ClinicalTrials.gov are fetched and ranked, where the top number of results specified by the user are shown in the interface. The user can read these results and their AI generated summary, as well as explore the original research source at hand.
This version allowed users to upload multiple PDFs and query them via a search bar. The system generates responses using Retrieval-Augmented Generation (RAG), leveraging various open-source large language models and embedding models. To find relevant articles, we utilized similarity search within a Postgres vector database that stores all the embeddings. We built the application using the Streamlit framework.
Our prototype successfully answered questions on topics explicitly mentioned in the documents. Additionally, we experimented with different embedding models to compare their performance. Since our queries were in Swedish while the papers were in English, we noticed that language often influenced the results more than context. For instance, text chunks containing Swedish words were automatically prioritized. Moreover, when generating responses, the answers were frequently returned in English.
To simplify integration with various LLMs, we decided to use the command R large language model and its corresponding embedding model from Cohere. For the new functionality, we revamped our framework using React Next.js for the frontend, FastAPI for the backend, and maintained the Postgres database. We introduced the ability to automatically retrieve new articles from research archives like arXiv, PubMed, and ClinicalTrials, eliminating the need for users to manually find and upload research articles beforehand. Through APIs, we extract abstracts and summaries, which are then used as input for the LLM instead of processing entire papers.
To prevent irrelevant text snippets—such as references or side notes—from being selected, we developed a custom algorithm using Unstructured and Spacy to filter out unnecessary information. This classifier achieved over 99% accuracy on test data. To further enhance the quality of the selected text chunks, we used LLMs to extract meaningful content and re-rank the chunks based on their relevance to the query. To address language-related issues, we decided to work exclusively with English texts for this functionality.
Researchers provided highly positive feedback on our application, noting its accuracy, relevance, and potential utility.
In the final version of our application, we introduced a feature called "Lookout."
This functionality allows users to create and maintain lists of relevant research articles on specific topics, which are automatically updated whenever new studies are published. Integrated with Zapier and Google Cloud Scheduler, "Lookout" can send email updates to subscribers, including summaries of the latest research. Users can also filter these updates by specific time windows, such as the last month or the last three months, and by regions like Sweden or Europe.
Lookouts:
This enhancement with lookout further strengthens the application’s utility, providing researchers and professionals with timely, targeted updates on their areas of interest.
Further improvements could include the design being enhanced, additional archives like Google Scholar being integrated, and sources such as Wikipedia being incorporated. Multilingual support, including Nordic languages, could also be added, and the services might be extended to support more organizations.
As we’re ending our project within the talent program AI for Impact - we’re really proud of the application we’ve built. We’re happy that what started as a brief for an application aimed for Mind and other non-profits to use to find the latest research, also resulted in an additional use case for the broader research community to find research and identify research gaps.
See more from our project description on MyAI 📄: https://lnkd.in/dNCYRzRr