{"id":89692,"date":"2025-06-26T13:47:47","date_gmt":"2025-06-26T10:47:47","guid":{"rendered":"https:\/\/intellias.com\/?post_type=blog&p=89692"},"modified":"2025-06-26T13:56:06","modified_gmt":"2025-06-26T10:56:06","slug":"how-to-run-local-llms","status":"publish","type":"blog","link":"https:\/\/intellias.com\/how-to-run-local-llms\/","title":{"rendered":"How to Run Local LLMs: A Guide for Enterprises Exploring Secure AI Solutions"},"content":{"rendered":"
But for many enterprises, the big question isn\u2019t whether to use generative AI \u2014 it\u2019s how to use it without giving up control.<\/p>\n
If your team handles sensitive financials, proprietary customer data, or competitive intel, sending prompts to a public model isn\u2019t ideal. That\u2019s where running a local LLM comes in. It\u2019s one way organizations are using GenAI on their own terms, with more privacy, faster performance, and tighter integration.<\/p>\n
In this guide, we\u2019ll show you how to run LLMs locally, walk through real enterprise use cases, and break down the tools and trade-offs of deploying local AI models for businesses. Whether you\u2019re just exploring or planning a full rollout, you\u2019ll get a clear view of how enterprise local LLMs can (or can\u2019t) fit into your stack.<\/p>\n
Make your business AI-ready. <\/p>\n
Every time you prompt a cloud-based model like ChatGPT, your data leaves the building. The more detailed the prompt, the better the output, but you\u2019re also sharing more information with a third party.<\/p>\n
For teams working with sensitive information, that\u2019s a non-starter. That\u2019s why some enterprises are exploring running local LLMs, keeping models on their own infrastructure, so data stays private, secure, and in their control.<\/p>\n
But privacy isn\u2019t the only reason local LLMs are getting attention:<\/p>\n
Anywhere you\u2019ve got unstructured information \u2014 long docs, scattered tickets, overflowing inboxes \u2014 there\u2019s a good chance GenAI can help. And for some teams, running a local LLM makes that help a lot more practical.<\/p>\n
Here are a few places where enterprise local LLMs are already making a real impact.<\/p>\n
Capgemini found that 63% of retailers use<\/a> Gen AI in their customer support chatbots<\/a>. And it\u2019s not just retailers. Salesforce uses its Einstein models to cut response times in half.<\/p>\n But offloading customer data to a public LLM? That\u2019s a risky move. That\u2019s why some companies are training local LLMs on internal knowledge bases, ticket histories, and FAQs and running them directly inside their support tools. You get faster answers, reduced agent load, and complete control over sensitive data.<\/p>\n Generative AI is helping developers make massive leaps in productivity. Research shows LLM-assisted devs are up to 55% more productive<\/a>, especially when writing boilerplate code, debugging, or generating tests.<\/p>\n We\u2019ve seen this firsthand. We\u2019ve built local chatbots for GitHub and VS Code to surface documentation, explain legacy code, and suggest improvements without sending a single line outside our firewall. It’s fast, accurate, and tailored to our codebase.<\/p>\n Let\u2019s face it \u2014 HR teams spend too much time answering the same questions. A local AI model for businesses can field the basics (leave balances, benefits, policy lookups) without involving a human.<\/p>\n But what gets interesting is personalization. A fine-tuned model can explain why someone’s payroll deduction changed or why they didn\u2019t qualify for a claim in a clear, conversational language.<\/p>\n Local LLMs also speed up hiring. Instead of basic keyword matching, they can scan resumes for skill fit, experience depth, and certifications. L’Or\u00e9al\u2019s AI assistant, Mya, screened over 12,000 internship applicants, collected data like visa status and availability, and helped the team hire 80 interns, saving over 200 hours<\/a> of recruiting time in the process.<\/p>\n Confluence docs. Jira tickets. Meeting notes. LLMs eat this stuff for breakfast.<\/p>\n A local model running on your internal content can summarize long pages, answer questions, or generate progress reports instantly. No more digging and no more toggling tabs.<\/p>\n Intellias built an LLM-powered platform<\/a> for just this purpose for one of our customers. It became the single-entry point for data search and management for all company employees.<\/p>\n According to Gartner, AI will automate 80% of project management tasks within a decade. But if you\u2019re using tools like Zoom AI, you\u2019ve probably seen it start already (think meeting summaries auto generated and delivered as soon as the call ends).<\/p>\n One retailer<\/a> we work with is taking it further: they\u2019re training a local large language model to negotiate with vendors. It\u2019s been trained on contracts, pricing history, and supplier behavior \u2014 so it can compare offers, counter-offer alternatives, and suggest fair terms in real time.<\/p>\n Here comes the hands-on part (aka, the best part). This section covers the tools you need to run an LLM locally. If you\u2019re looking for an easy setup with decent customization, start with Ollama. For more flexibility and low-level control, jump to llama.cpp section.<\/p>\n Running LLMs locally often seems complex. We\u2019re so accustomed to cloud solutions that setting up on-prem infrastructure can seem overwhelming. But that convenience comes at a cost: privacy \u2014 something enterprises can\u2019t ignore.<\/p>\n That’s why we suggest Ollama for anyone who wants to get started with enterprise local LLMs, especially if they don’t want to deal with the technical complexities of model deployment.<\/p>\n Ollama may be a bit heavier on system resources than lighter frameworks like llama.cpp, but that’s a trade-off for ease of use.<\/p>\n Here is a simplified breakdown of the OLlama workflow:<\/p>\n Install Ollama<\/strong><\/p>\n Step 1<\/strong>: Visit the official Ollama website<\/a> and download the application. I\u2019m using the Mac version for this tutorial.<\/p>\n Step 2<\/strong>: Open the downloaded application and click \u201cInstall\u201d.<\/p>\n That\u2019s it! Your machine now has Ollama installed on it. To verify the installation, open your terminal and run: `ollama –version`.<\/p>\n Run models in Ollama<\/strong><\/p>\n Ollama supports a variety of powerful LLMs<\/a>. Which one you choose depends on your use case and available resources. In this example, we\u2019re using Llama from Meta. It\u2019s lightweight, efficient, and a great starting point if you\u2019re working with limited hardware.<\/p>\n Not sure which model to choose? There\u2019s a section in this guide that breaks down some popular models and what they\u2019re best at.<\/p>\n Once you\u2019ve chosen the model, run the following command to load it from the Ollama library.<\/p>\n Talk with LLM<\/strong><\/p>\n That’s it. All the groundwork is complete, and you’re ready to start asking questions right from the terminal.<\/p>\n You can customize LLMs on the Ollama command line. However, the tool also offers a web UI. It\u2019s the easiest way to interact with and customize your models.<\/p>\n First, you\u2019ll need Docker Desktop installed to set up the Ollama web UI. Installing Docker is pretty straightforward; just visit the Docker website<\/a>, download the app, and run it. Once Docker Desktop is up and running, follow the instructions below<\/a> to get started with Ollama.<\/p>\n Step 1<\/strong>: Open your terminal and run the following command to pull the latest web UI Docker image from GitHub:<\/p>\n Step 2<\/strong>: Execute the `docker run` command. This will allocate the necessary system resources and environment configurations to start the container.<\/p>\n Now open up Docker Desktop. Go to the Containers tab, and you’ll see a link under Port(s)<\/strong> \u2014 go ahead and click it.<\/p>\n Here\u2019s the UI that opens.<\/p>\n Ollama excels at performance and user-friendliness. But what if you have very limited hardware and need lightweight software? That\u2019s where llama.cpp<\/strong> comes in.<\/p>\n It\u2019s a C\/C++ framework designed to execute LLMs with lightning speed, making it perfect for applications that demand real-time responses.<\/p>\n Llama.cpp offers two methods for running LLMs on your local machines:<\/p>\n Step 1<\/strong>: Clone the llama.cpp repository to local using the git clone command.<\/p>\n Step 2<\/strong>: Follow these commands to build the project using CMake.<\/p>\n Step 3<\/strong>: Download the desired GGFU formatted model from the Hugging Face library<\/a>. Once done, save it to a directory on your local machine. You\u2019ll need this path while running the model in the next step.<\/p>\n Step 4<\/strong>: Run the model using the following command:<\/p>\n Replace `\/path\/to\/your\/model.gguf` with the actual path to your downloaded model file.<\/p>\n Quantize models:<\/strong><\/p>\n You can quantize models in llama.cpp with this syntax on the terminal: `.\/quantize <input-model.gguf> <output-model.gguf> <quantization-type>`<\/p>\n Example command: `.\/quantize models\/llama-2-7b.gguf models\/llama-2-7b-q4_0.gguf Q4_0`<\/p>\n To run the quantized model: `.\/main -m models\/llama-2-7b-q4_0.gguf -p “Tell me a fun fact about space`.<\/p>\n Step 1<\/strong>: Open your terminal and run the following command to install llama.cpp on your Mac.<\/p>\n Step 2<\/strong>: Hugging Face hosts a vast collection of open-source models<\/a>. You can use its repo ID<\/strong> and model file name<\/strong> to serve a model directly in a CLI.<\/p>\n Syntax:<\/p>\n Find repo IDs and file names here<\/a>. Sample command that starts the Microsoft Phi model:<\/p>\n You can now interact with the model through the web UI or curl commands.<\/p>\n Web UI<\/strong><\/p>\n The model’s web UI is accessible at your local host: http:\/\/127.0.0.1:8080\/<\/a>.<\/p>\n Click the settings icon in the top right corner to customize the LLM.<\/p>\n Curl commands<\/strong><\/p>\n You can run this curl command directly in your terminal and get the results right there.<\/p>\n Gpt4All provides both a desktop app and command-line options to run LLMs locally. The interface is clean, and the setup is pretty straightforward. This tool also provides access to both local and remote models.<\/p>\n You can interact with the model by going to \u201cChats<\/strong>\u201d on the left menu bar.<\/p>\n LM Studio provides a beautiful desktop app to run and chat with gguf-based models, backed by llama.cpp.<\/p>\n In this section, we outline key features of some popular large language models to help you choose the right model.<\/p>\n Llama3<\/strong> – Llama3 handles complex NLP tasks. The model deeply understands the context and excels in response generation. It supports both text and image processing, and the depth and breadth make it worthwhile for research and market analysis. Llama3 is also ideal for conversational AI<\/a> like chatbots or customer support personalization.<\/p>\n Mistral models<\/strong> are built for low-latency tasks where every millisecond counts. They conduct high-speed text processing, making them perfect for real-time chatbots. For instance, Mistral 3B and Mistral 8B are designed to process data faster on limited hardware, ideal for IoT<\/a> and mobile applications. The Mistral family also includes Codestral and Codestral Mamba (7B), which excel at programming tasks.<\/p>\n Phi<\/strong>: Phi is a transformer-based architecture. Designed for compact devices, these models (ranging from 3.8B to 14B parameters) punch above their weight. These are especially sharp at reasoning-focused applications like solving logic problems, doing math, and following detailed instructions.<\/p>\n Code-gen<\/strong>: A Salesforce-developed model, Code-gen lets developers describe what they want in plain English and turns it into usable code.<\/p>\n BERT<\/strong>: BERT is a resource-efficient, encoder-only model. It can understand text well, but it\u2019s inefficient at generating it. That means it\u2019s brilliant at sentiment analysis, text classification, and research applications.<\/p>\n This section is more about you: how to keep LLM costs in check and scale them into your business applications.<\/p>\n Local LLMs aren\u2019t as expensive as you think. Here\u2019s how to make the most of them without going overboard on spending.<\/p>\n Use open source, don’t train<\/strong>: LLMs are costly when you train them from scratch. But your aim isn\u2019t to build the next ChatGPT or Deepseek. You just want to use Gen AI to enhance your business operations. So, use open-source models. These models are already trained on billions of data points; you just need to tune them for your use case.<\/p>\n Use smaller models<\/strong>: Most enterprise use cases aren\u2019t about solving general AI or building AGI (artificial general intelligence) that does everything. Typically, you need a focused, local LLM fine-tuned for a specific task, and smaller models pull their weight just fine for most of these, without eating up your compute too quickly.<\/p>\n RAG<\/strong> (Retrieval-Augmented Generation) allows your model to search instead of memorize. That means you don\u2019t need to encode all data into your model. Instead, you can store that data in a much cheaper and more scalable system and let the model pull in only the information it needs to answer the prompts. Here\u2019s how you can build RAG-based chatbots<\/a>.<\/p>\n Quantization<\/strong>: We finally got some space to talk about quantization. It is an innovative technique for storing model parameters (weights and biases) in lower-precision data formats, such as INT8 instead of float32. That single change can cut memory usage by up to 4x.<\/p>\n The real goal of running a local LLM is integrating it into your business applications for everyday tasks.<\/p>\n Firstly, fine-tune an open-source model on your company\u2019s specific data and deploy it on the cloud or your own data center<\/a>. Cloud solutions like AWS or Google Cloud are great for scalability. For greater privacy, if you have sophisticated data centers, you can host your models there, too.<\/p>\n Lastly, use a Python script or FastAPI to expose your model as an API endpoint and then integrate that endpoint into your business applications.<\/p>\nDeveloper productivity<\/h3>\n
HR and talent ops<\/h3>\n
Document processing<\/h3>\n
Smarter business ops<\/h3>\n
<\/p>\nStep-by-step: How to set up and start running a Local LLM<\/h2>\n
Ollama<\/h3>\n
\n
Set up Ollama on the command line<\/h4>\n
<\/p>\n
<\/p>\n
<\/p>\nOllama run llama2<\/code><\/p>\n
<\/p>\nSet up Ollama web UI<\/h4>\n
docker pull ghcr.io\/open-webui\/open-webui:main<\/code><\/p>\ndocker run -d -p 3000:8080 -e WEBUI_AUTH=False -v open-webui:\/app\/backend\/data --name open-webui ghcr.io\/open-webui\/open-webui:main<\/code><\/p>\n\n
<\/p>\n
<\/p>\nLlama.cpp<\/h3>\n
\n
Clone llama.cpp<\/h4>\n
git clone https:\/\/github.com\/ggerganov\/llama.cpp<\/code><\/p>\n
<\/p>\nmkdir build<\/code>
\ncd build<\/code>
\ncmake ..<\/code>
\ncmake --build . --config Release<\/code><\/p>\n
<\/p>\n
<\/p>\n.\/llama-cli -m \/path\/to\/your\/model.gguf<\/code><\/p>\nllama-server<\/h4>\n
brew install llama.cpp<\/code><\/p>\n
<\/p>\nllama-server --hf-repo <hugging-face-repo-id> --hf-file <gguf-model-name><\/code><\/p>\nllama-server --hf-repo microsoft\/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf<\/code><\/p>\n
<\/p>\n
<\/p>\n
<\/p>\ncurl --request POST \\<\/code>
\n--url http:\/\/localhost:8080\/completion \\<\/code>
\n--header \"Content-Type: application\/json\" \\<\/code>
\n--data '{<\/code>
\n\"prompt\": \"Tell me a fun and detailed fact about Earth.\",<\/code>
\n\"n_predict\": 100,<\/code>
\n\"temperature\": 0.9,<\/code>
\n\"top_p\": 0.95,<\/code>
\n\"top_k\": 40<\/code>
\n}'<\/code><\/p>\n
<\/p>\nMore ways to run an LLM locally:<\/h3>\n
Gpt4All<\/h4>\n
\n
<\/p>\nLM Studio<\/h4>\n
\n
<\/li>\nWhich open-source language model should you choose?<\/h2>\n
How to reduce costs and integrate enterprise local LLMs<\/h2>\n
Reduce costs<\/h3>\n
Integrate and scale<\/h3>\n