We scraped 1177.se by using the Python package beautifulsoup4, and after that the content was converted from HTML format to PDF-format using pdfkit. We used 509 documents, where each document corresponded to a webpage on 1177.se.
The architecture of the chatbot is based on large language model (LLM) to process the user input, along with a vector database to store the PDF documents as well as a user interface built with Streamlit.
For the LLM model we chose a multimodal and multilingual modal, where we used Gemini-1.5-flash model by Google.
The vectordatabase was ChromaDB, which is a vector DB optimized for handling unstructured data. In addition to storing the PDFs, ChromaDB also stores metadata for each document which was useful for this project.