Building Reasoning on CPU with LangChain Chroma GPT4All and LocalAI Part 1

HIAAS is designed to provide domain-specific insights by leveraging advanced language models and data retrieval systems. But can we democratize AI-driven development for all and any OSS contributor?

Jan 27, 2025

Background

The Project XCoV19 Healthcare Platform is designed to provide domain-specific insights by leveraging advanced language models and data retrieval systems.

Enabling AI of sorts has long been due for this project. If you have tried our pilot whatsapp program or seen a preview (in the video inset in the given link), it leverages a whatsapp bot designer which connects to ChatGPT, Maps and a healthcare database in the backend. That was then, before we pivoted to becoming a more encompassing healthcare discovery infrastructure.

But there is a need to fetch relevant diagnostics answers which wouldn't end up costing the patient's life in a manner that is ‘privacy-first’ and HIPPA compliant. The diagnostics for the pilot was just the text passed onto GPT. There was no way to control the nature of the text and what private information it would pass.

Today, we are adding to our roadmap our objective to integrate, what is already part of the design architecture, an inference and reasoning service that is pseudo-air-gapped, meaning:

Patients will speak out their query for their health concerns, which we have limited control on by providing cues and selective multiple choice questions.
Our inference and reasoning service will sanitize and filter out sensitive information.
Additionally, all LLM would be privately hosted so that the input is inferred and computed internally and there is no data leak whatsoever.
Only the anonymized queries would be stored for further training. Again, this part is part of the open-source action plan so I don't need to lobby for this.

We achieve this by building upon this super-inspiring article by Ettore. As a software engineer, this is non-trivial because you would need to understand the fundamentals that Data/ML Scientists have utilized to build these systems. So to give a crash-course and understand the fundamentals, Rohit Patel has done a great service to humanity by breaking down this mysterious domain and explaining it through high school math (come back to this article once you are done reading it):

https://towardsdatascience.com/understanding-llms-from-scratch-using-middle-school-math-e602d27ec876

I wanted to start a prototype but there were several issues around:

Memory space is OK but storage is low in my machine. As a result, localAI is rendered useless.
There is not Nvidia GPU! I use the default gpu built into the motherboard on my machine. I don't want to invest a Five-eyes infrastructure and then experiment. I also wanted to reduce my time of leveraging cloud specific solutions and setup on their VM before starting my experiment.
I wanted it to be a quick and lean understanding of the necessary steps needed to get going from point A to B.
I wanted an easy and quick ramp-up using jupyter notebook to test my approach out.
I was not sure of the models to use and needed to time-box my research.
To reproduce this PoC, I should not rely much on an internet connection or HuggingFace or it's transformers or to any one specific vendor and try to be as neutral as possible.

While the initial implementation utilizes GPT4All, LangChain, and ChromaDB which has proven functional, it faces performance challenges, particularly with CPU-bound inference.

Identified Bottlenecks

• Inference Latency: The use of large models like GPT4All Falcon on CPU leads to query times exceeding 10 minutes, hindering real-time applicability.

• Resource Utilization: High memory consumption during embedding generation and inference limits scalability and increases operational costs.

• Scalability Constraints: The current setup struggles with handling concurrent queries efficiently, especially on limited hardware resources.

Proof of Concept for a HIAAS Platform Using Local LLM Stacks

The proposed solution demonstrates the ability to generate healthcare insights, powered by locally-hosted models and retrieval-augmented generation (RAG). We discuss the architectural design, challenges faced during implementation, and the potential for optimization and scaling using LocalAI in the next part of this artcle.

Motivation

Healthcare systems are inherently fragmented, making it challenging for patients and providers to access and share actionable insights. To get a brief understanding, check out my previous article:

Tech for Project XCoV19

Introducing H.I.A.A.S

Akul

September 15, 2024

Roughly 5.8 million or more Indians die because of diabetes, cancer, stroke, heart and lung diseases each year, more than ever did due to the contagious Pandemic. Non-communicable diseases, hashtag#NCDs, are the biggest killers and alot of the mortality rates are comorbidity led. This is patient management left unchecked.

Read full story

This PoC addresses the issue of reducing patient-centric healthcare queries by building a platform that integrates local AI capabilities to query medical knowledge efficiently without relying on cloud-based solutions.

Scope of the PoC

Enable efficient healthcare queries using localized and privacy-first AI service that is auditable.
Understand the architecture requirements that supports retrieval-augmented generation.
Understand open-source tooling to ensure accessibility and compliance with local data regulations.
Most importantly, can we develop out in the open utilizing ONLY CPU, that way democratizing our development for all and any OSS contributor! Huge win for us!

System Architecture

Core Components and Logic

Language Model (GPT4All Falcon): A local large language model (LLM) fine-tuned for reasoning and natural language understanding.
Vector Database (ChromaDB): Stores embeddings for medical data, enabling fast retrieval of relevant documents.
LangChain: Orchestrates workflows, integrates retrieval with LLM, and manages input-output flows.
Data Preprocessing Pipeline (Pandas for short): Processes raw medical data into chunks and generates embeddings for storage.

Workflow Overview

Input Query: The user submits a healthcare-related query.
Document Retrieval: ChromaDB retrieves relevant documents based on semantic similarity. The documents and the embedding have been passed already. Embeddings are computed offline and stored in our case.
RAG Pipeline: Retrieved documents are combined with the query and passed to GPT4All for reasoning.
Response Generation: The LLM generates a succinct response based on the query and context.

Key Concepts for Implementation

Ultimately, all the preparation boils down to this part:

# Vector Database Integration
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

# Initialize ChromaDB and store document embeddings
vector_db = Chroma.from_documents(
    embedding=embedding_instance,
    documents=documents,
    collection_name="health_llm",
    client=chroma_client
)

# RAG Pipeline
llm = GPT4All(model=f"{config['reasoning_model_path']}/{config['reasoning_model']}",
              device=config["device"],
              verbose=True,
              allow_download=False
       )
qa_chain = VectorDBQA.from_chain_type(llm=llm, vectorstore=vector_db)

Which gives:

qa_chain.invoke("What are the symptoms of diabetes?")

{'query': 'What are the symptoms of diabetes?',
 'result': ' The symptoms of diabetes include increased thirst and urination, fatigue, blurred vision, slow healing of cuts and bruises, frequent infections, and tingling or numbness in the hands and feet.'}

As an OSS developer exploring a project, this is all I want to test so that I can understand the flow quickly and get on with life developing stuff on top of it. If it works, its good to go. We can iterate and improve.

PlantUML Diagram — **Sequence Diagram explaining the workflow for a patient query**

Description of the Sequence Flow

Query Input: A user inputs a query (e.g., “What are the symptoms of diabetes?”) via the application interface.
Retrieval Process:
1. The query is processed and sent to the ChromaDB vector store.
2. ChromaDB retrieves relevant documents based on embeddings.
Document Processing: Retrieved documents are formatted as context for the LLM.
Reasoning:
1. The GPT4All Falcon model processes the query along with the retrieved context.
2. The LLM generates a response based on the reasoning process.
Response Delivery: The system sends the generated response back to the user.

Explanation of the Stack

Google Colab: See my rationale for not being able to utilize LocalAI above.
ChromaDB:
1. Stores precomputed embeddings of medical dialogues/documents.
2. Efficiently retrieves documents relevant to the user’s query using similarity search.
LangChain:
1. Orchestrates the interaction between ChromaDB and GPT4All Falcon.
2. Facilitates the Retrieval-Augmented Generation (RAG) process by combining retrieved context with user input.
GPT4All Falcon:
1. A locally hosted large language model for processing and reasoning.
2. Generates responses based on the context and user input.
Application Layer: The user-facing interface that handles query input and displays the generated response.

Challenges Encountered

1. Latency and Model Selection

As mentioned earlier, the inference time for GPT4All Falcon (falcon-Q4_0) on the CPU exceeded 10 minutes per query. This was a major bottleneck for real-time use cases. The GPT4All Falcon model provided good accuracy but was not optimized for performance on local hardware.

Mitigation Strategy:

Need to experiment with further quantized models like falcon-Q4_1 to reduce computational requirements.
Explore and Evaluate smaller LLMs like LLaMA-2-7B and WizardLM-7B for faster inference. Or go in with the newer DeepSeek v3 model which looks like would be better suited to my restrictive CPU needs.

2. Resource Constraints

High memory and CPU utilization during embedding generation and inference limited the scalability of the platform.

Mitigation Strategy:

Utilize a smaller chunk size for embeddings to balance accuracy and performance but lose out on accuracy eventually.
Explore profiling bottlenecks to optimize embedding generation although this was done prematurely for computing embeddings using multi-processing otherwise it was taking too long to generate and save embeddings in batches.

Roadmap Ahead

Watch out for my upcoming posts where I will cover our entire product roadmap. Enabling AI for a healthcare-commons platform is just one of the steps where the value to patients is narrowing down accurate diagnosis and fetching them recommendations based on these diagnosis. This is as per our tagline. We do leverage an internally developed traditional ML model to deliver on that tagline, which is a story of another article.

Concluding Thoughts

These are affirmatively wonderful times to be alive in and be able to ensure that the once exclusive club of GPU developers who could leverage such in-house implementation is now available to all developers without relying too much on third party vendors.

The first trial of the entire code flow is described here alongwith the code that follows it. Happy reading and let me know your thoughts!

https://colab.research.google.com/drive/1-gNuygzplXEzx1NHiTu6PQ8QLDCFdXYA?usp=sharing

Thanks for reading My Health Connect Society Newsletter! This post is public so feel free to share it.

My Health Connect Society Newsletter