GarethNg

Gareth Ng

With a bamboo staff and straw sandals, I feel lighter than riding a horse, In a cloak amidst the misty rain, I live my life as it comes.
github
email
x
telegram

Deploying local large models (DeepSeek/Qwen) with LM studio on Mac + Translation

Thanks to the shared memory of the CPU and GPU on the Mac, as well as the large memory bandwidth, it has become possible to run large models locally on a MacBook. Riding the recent wave of popularity of DeepSeek, I also attempted to build a local AI translation system. This article will introduce how to correctly configure this system on a Mac computer. Once set up, you will be able to:

  • Use large language models for conversations on your Mac for free
  • Improve efficiency without waiting for server responses
  • Quickly translate any documents, screenshots, web pages, etc.

This article takes the MacBook as an example. In theory, Windows computers can achieve the same effect, for reference only.

The download tools in this article may encounter network issues in mainland China, please resolve them on your own.

Model Management Tools#

The so-called model management tools are those that can run and manage large models locally and provide server functionality, which can save some unnecessary trouble. In addition, model management tools can also provide local chat functionality, allowing you to use the most popular DeepSeek even in poor network conditions.

Popular model management tools include Ollama and LM Studio. Compared to these two tools, LM Studio has a GUI page, making it easier to download models and more user-friendly for beginners. Therefore, this article will use LM Studio.

Installing Models#

Finding the right model is key to starting everything. Currently popular open-source large models include deepseek-r1, Qwen, LLama, etc. Choose the one you prefer based on your needs. The author needs to use Chinese-English translation, so I chose the more Chinese-friendly DeepSeek and Qwen (Qianwen).

Next, choose the appropriate model size based on your Mac's configuration. LM Studio will disable models that cannot be downloaded based on the current configuration. Of course, even if a model can be used, there is still a difference in comfort.

The author's configuration is a MacBook Pro M3 Max with 36GB of memory. During testing, the 32B size of DeepSeek R1 can be used normally, but the running speed is relatively slow. Simple conversations are fine, but translation scenarios, especially for larger PDFs, can be quite frustrating. Additionally, DeepSeek R1 has a large inference process, making the speed of the 32B model even slower. Of course, if you have a better-configured machine, especially with large memory, the larger the model, the better the effect, which is up to personal preference.

One more thing to mention about DeepSeek R1 is that it displays a long thinking chain. While this is great, in translation scenarios, the thinking chain is not necessary and can even be a burden, slowing down translation speed. In comparison, the Qwen model is a better choice in this scenario, and a solution for the thinking chain will be provided later.

In summary, there are many models, each with its pros and cons. Choose the appropriate model based on your needs and configuration. I used qwen2.5-7b-instruct-1m for translation (14B should also be fine).

You can refer to the image below for installation and download.

image

Starting the Service#

Next, load the model and start the service, following the image below.

image

image

Once loaded successfully, you can use this large model. The top item in the left sidebar is the chat function, where you can converse with the large model you loaded.

image

You can also copy the code to the command line to check if the model is running normally and if the service has started correctly. At this point, the configuration related to the large model is complete. Congratulations, you now have a large model that can run locally, is not affected by the server, and responds quickly.

If you open this service on a local network or even the public network, you can access your large model from other devices, which is another story.

image

image

Translation - Easydict#

This article uses an open-source local translation tool Easydict. Of course, if you find other tools, you can also use them. This article only uses this as an example.

Installation#

You can install using one of the two methods below.

Easydict's latest version supports macOS 13.0+, and if your system version is macOS 11.0+, please use 2.7.2.

1. Manual Download and Installation#

Download the latest version of Easydict.

2. Homebrew Installation#

brew install --cask easydict

Configuration#

After successful installation, click the button to select configuration and enter the configuration page. Then click on the service to configure your own server address. Here, you can theoretically choose either Ollama translation or custom OpenAI translation.

Just enter your server address, port, and model name, which can all be found on the LM Studio page.

image

image

Usage#

Once configured, you can use it normally. For specific usage methods of Easydict, please refer to the corresponding official documentation.

image

Additionally, it should be noted that EasyDict also supports other forms of API and has built-in translation, so there is no need for the previous local large model setup. It is also a very useful application.

Translation - Immersive Translation#

Immersive translation has become one of the most popular browser translation plugins since OpenAI's emergence. It also supports using custom API interfaces for translation and supports both web page translation and PDF translation, with excellent display effects.

A pro membership can support this, so you don't have to fuss around; it's ready to use out of the box. If you want to continue experimenting with local large models, keep reading.

Here is the official website link.

image

After downloading the plugin, enter the configuration page, and similarly input the API address and model name to use the local large model for immersive translation.

image

image

Server Forwarding#

At this point, you can almost use it, but you will encounter the following issues:

  1. The translation results of DeepSeek R1 will carry thinking chains, affecting the translation experience.
  2. The API format of immersive translation cannot seamlessly connect with the API format of LM Studio, so even if the API is connected, immersive translation cannot display the translation results.

Therefore, I had no choice but to write the simplest local server using Python web.py to forward requests and make some minor adjustments in between to solve the above two small problems, improving the translation experience. Here is the code for reference. The default port for web.py is 8080, which can also be modified as needed.

Remember to install Python and use pip install web.py. I won't elaborate further here.

import web
import json
import requests
import re
import time

# Configure URL routes
urls = (
    '/v1/chat/completions', 'ChatCompletions',
    '/v1/models', 'Models'
)

def add_cors_headers():
    # Add CORS-related response headers
    web.header('Access-Control-Allow-Origin', '*')
    web.header('Access-Control-Allow-Credentials', 'true')
    web.header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
    web.header('Access-Control-Allow-Headers', 'Content-Type, Authorization')

def remove_think_tags(text):
    # Remove <think> tags and their content
    return re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)

class ChatCompletions:
    def OPTIONS(self):
        # Handle preflight requests
        add_cors_headers()
        return ''
        
    def POST(self):
        web.header('Content-Type', 'application/json')
        add_cors_headers()
        
        try:
            data = json.loads(web.data())
            lm_studio_url = "http://localhost:1234/v1/chat/completions"
            
            # Check if it's a streaming request
            is_stream = data.get('stream', False)
            
            # Forward request to LM Studio
            response = requests.post(
                lm_studio_url,
                json=data,
                headers={'Content-Type': 'application/json'},
                stream=is_stream  # Set stream parameter
            )
            
            if is_stream:
                # For streaming requests, collect full content first
                full_content = ""
                current_id = None
                
                def generate_stream():
                    nonlocal full_content, current_id
                    
                    for line in response.iter_lines():
                        if line:
                            line = line.decode('utf-8')
                            if line.startswith('data: '):
                                line = line[6:]
                            if line == '[DONE]':
                                # Process full content and send the last chunk
                                cleaned_content = remove_think_tags(full_content)
                                # Send cleaned full content
                                final_chunk = {
                                    "id": current_id,
                                    "object": "chat.completion.chunk",
                                    "created": int(time.time()),
                                    "model": "local-model",
                                    "choices": [{
                                        "index": 0,
                                        "delta": {
                                            "content": cleaned_content
                                        },
                                        "finish_reason": "stop"
                                    }]
                                }
                                yield f'data: {json.dumps(final_chunk)}\n\n'
                                yield 'data: [DONE]\n\n'
                                continue
                                
                            try:
                                chunk_data = json.loads(line)
                                current_id = chunk_data.get('id', current_id)
                                
                                if 'choices' in chunk_data:
                                    for choice in chunk_data['choices']:
                                        if 'delta' in choice:
                                            if 'content' in choice['delta']:
                                                # Accumulate content instead of sending directly
                                                full_content += choice['delta']['content']
                                
                                # Send empty progress update
                                progress_chunk = {
                                    "id": current_id,
                                    "object": "chat.completion.chunk",
                                    "created": int(time.time()),
                                    "model": "local-model",
                                    "choices": [{
                                        "index": 0,
                                        "delta": {},
                                        "finish_reason": None
                                    }]
                                }
                                yield f'data: {json.dumps(progress_chunk)}\n\n'
                                
                            except json.JSONDecodeError:
                                continue
                
                web.header('Content-Type', 'text/event-stream')
                web.header('Cache-Control', 'no-cache')
                web.header('Connection', 'keep-alive')
                return generate_stream()

            else:
                # Handle non-streaming requests
                response_data = json.loads(response.text)
            
                if 'choices' in response_data:
                    for choice in response_data['choices']:
                        if 'message' in choice and 'content' in choice['message']:
                            choice['message']['content'] = remove_think_tags(
                                choice['message']['content']
                            )
                return json.dumps(response_data)
            
        except Exception as e:
            print(e)
            return json.dumps({
                "error": {
                    "message": str(e),
                    "type": "proxy_error"
                }
            })

class Models:
    def OPTIONS(self):
        # Handle preflight requests
        add_cors_headers()
        return ''
        
    def GET(self):
        web.header('Content-Type', 'application/json')
        add_cors_headers()
        # Return a simulated model list
        return json.dumps({
            "data": [
                {
                    "id": "local-model",
                    "object": "model",
                    "owned_by": "local"
                }
            ]
        })

if __name__ == "__main__":
    app = web.application(urls, globals())
    app.run() 

At this point, a local deployment of a large model running on a Mac, along with tools for web, PDF document, and text translation, is complete. There are many alternative options available. Model deployment can also use Ollama, and large models can also use Phi, Llama, etc. The forwarding server can also use other solutions, or there may be options that do not require forwarding to explore. Translation tools can also be replaced with Bob. In summary, there are many technical choices, and this article only provides a reference. Time should still be spent on more important matters; now that you have the right tools, go ahead and make good use of them.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.