🎯 简介
Youtu-Embedding 是由腾讯优图实验室开发的尖端通用文本嵌入模型。该模型在信息检索(IR)、语义文本相似度(STS)、聚类、重排序和分类等各类自然语言处理任务中均展现出卓越性能。
- 顶尖性能表现:截至2025年9月,在权威的CMTEB(中文大规模文本嵌入基准)评测中以77.46分位列榜首,彰显其强大稳健的文本表征能力。
- 创新训练框架:采用协同判别式微调框架,通过统一数据格式、任务差异化损失函数及动态单任务采样机制,有效解决多任务学习中的"负迁移"问题。
注:您可基于自有数据轻松适配微调模型以适应领域任务,具体实现请参考训练代码。
🤗 模型下载
| 模型名称 | 参数量 | 维度 | 序列长度 | 下载 |
|---|---|---|---|---|
| Youtu-Embedding | 2B | 2048 | 8K | Model |
🚀 使用说明
1. 使用 transformers
📦 安装
1pip install transformers==4.51.3 liger_kernel==0.5.4 2
⚙️ 用法
1import torch 2import numpy as np 3from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer 4 5 6class LLMEmbeddingModel(): 7 8 def __init__(self, 9 model_name_or_path, 10 batch_size=128, 11 max_length=1024, 12 gpu_id=0): 13 self.model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True) 14 self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right") 15 16 self.device = torch.device(f"cuda:{gpu_id}") 17 self.model.to(self.device).eval() 18 19 self.max_length = max_length 20 self.batch_size = batch_size 21 22 query_instruction = "Given a search query, retrieve passages that answer the question" 23 if query_instruction: 24 self.query_instruction = f"Instruction: {query_instruction} \nQuery:" 25 else: 26 self.query_instruction = "Query:" 27 28 self.doc_instruction = "" 29 print(f"query instruction: {[self.query_instruction]}\ndoc instruction: {[self.doc_instruction]}") 30 31 def mean_pooling(self, hidden_state, attention_mask): 32 s = torch.sum(hidden_state * attention_mask.unsqueeze(-1).float(), dim=1) 33 d = attention_mask.sum(dim=1, keepdim=True).float() 34 embedding = s / d 35 return embedding 36 37 @torch.no_grad() 38 def encode(self, sentences_batch, instruction): 39 inputs = self.tokenizer( 40 sentences_batch, 41 padding=True, 42 truncation=True, 43 return_tensors="pt", 44 max_length=self.max_length, 45 add_special_tokens=True, 46 ).to(self.device) 47 48 with torch.no_grad(): 49 outputs = self.model(**inputs) 50 last_hidden_state = outputs[0] 51 52 instruction_tokens = self.tokenizer( 53 instruction, 54 padding=False, 55 truncation=True, 56 max_length=self.max_length, 57 add_special_tokens=True, 58 )["input_ids"] 59 if len(np.shape(np.array(instruction_tokens))) == 1: 60 inputs["attention_mask"][:, :len(instruction_tokens)] = 0 61 else: 62 instruction_length = [len(item) for item in instruction_tokens] 63 assert len(instruction) == len(sentences_batch) 64 for idx in range(len(instruction_length)): 65 inputs["attention_mask"][idx, :instruction_length[idx]] = 0 66 67 embeddings = self.mean_pooling(last_hidden_state, inputs["attention_mask"]) 68 embeddings = torch.nn.functional.normalize(embeddings, dim=-1) 69 return embeddings 70 71 def encode_queries(self, queries): 72 queries = queries if isinstance(queries, list) else [queries] 73 queries = [f"{self.query_instruction}{query}" for query in queries] 74 return self.encode(queries, self.query_instruction) 75 76 def encode_passages(self, passages): 77 passages = passages if isinstance(passages, list) else [passages] 78 passages = [f"{self.doc_instruction}{passage}" for passage in passages] 79 return self.encode(passages, self.doc_instruction) 80 81 def compute_similarity_for_vectors(self, q_reps, p_reps): 82 if len(p_reps.size()) == 2: 83 return torch.matmul(q_reps, p_reps.transpose(0, 1)) 84 return torch.matmul(q_reps, p_reps.transpose(-2, -1)) 85 86 def compute_similarity(self, queries, passages): 87 q_reps = self.encode_queries(queries) 88 p_reps = self.encode_passages(passages) 89 scores = self.compute_similarity_for_vectors(q_reps, p_reps) 90 scores = scores.detach().cpu().tolist() 91 return scores 92 93 94queries = ["What's the weather like?"] 95passages = [ 96 'The weather is lovely today.', 97 "It's so sunny outside!", 98 'He drove to the stadium.' 99] 100 101model_name_or_path = "tencent/Youtu-Embedding" 102model = LLMEmbeddingModel(model_name_or_path) 103scores = model.compute_similarity(queries, passages) 104print(f"scores: {scores}") 105
2. 使用 sentence-transformers
📦 安装
1pip install sentence-transformers==5.1.0 2
⚙️ 使用
1from sentence_transformers import SentenceTransformer 2 3model = SentenceTransformer("tencent/Youtu-Embedding", trust_remote_code=True) 4queries = ["What's the weather like?"] 5passages = [ 6 'The weather is lovely today.', 7 "It's so sunny outside!", 8 'He drove to the stadium.' 9] 10queries_embeddings = model.encode_query(queries) 11passages_embeddings = model.encode_document(passages) 12 13similarities = model.similarity(queries_embeddings, passages_embeddings) 14print(similarities) 15
3. 使用 LangChain 🦜
轻松将模型集成到你的 LangChain 应用中,例如 RAG 管道。
📦 安装
1pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0 2
⚙️ 用法
1import torch 2from langchain.docstore.document import Document 3from langchain_community.vectorstores import FAISS 4from langchain_huggingface.embeddings import HuggingFaceEmbeddings 5 6model_name_or_path = "tencent/Youtu-Embedding" 7device = "cuda" if torch.cuda.is_available() else "cpu" 8 9model_kwargs = { 10 'trust_remote_code': True, 11 'device': device 12} 13 14embedder = HuggingFaceEmbeddings( 15 model_name=model_name_or_path, 16 model_kwargs=model_kwargs, 17) 18 19query_instruction = "Instruction: Given a search query, retrieve passages that answer the question \nQuery:" 20doc_instruction = "" 21 22data = [ 23 "Venus is often called Earth's twin because of its similar size and proximity.", 24 "Mars, known for its reddish appearance, is often referred to as the Red Planet.", 25 "Jupiter, the largest planet in our solar system, has a prominent red spot.", 26 "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." 27] 28 29documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(data)] 30vector_store = FAISS.from_documents(documents, embedder, distance_strategy="MAX_INNER_PRODUCT") 31 32query = "Which planet is known as the Red Planet?" 33instructed_query = query_instruction + query 34results = vector_store.similarity_search_with_score(instructed_query, k=3) 35 36print(f"Original Query: {query}\n") 37print("Results:") 38for doc, score in results: 39 print(f"- Text: {doc.page_content} (Score: {score:.4f})") 40 41
4. 使用 LlamaIndex 🦙
这非常适合将模型集成到您的 LlamaIndex 搜索和检索系统中。
📦 安装
1pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1 2
⚙️ 使用说明
1import faiss 2import torch 3from llama_index.core.schema import TextNode 4from llama_index.core.vector_stores import VectorStoreQuery 5from llama_index.vector_stores.faiss import FaissVectorStore 6from llama_index.embeddings.huggingface import HuggingFaceEmbedding 7 8model_name_or_path = "tencent/Youtu-Embedding" 9device = "cuda" if torch.cuda.is_available() else "cpu" 10 11embeddings = HuggingFaceEmbedding( 12 model_name=model_name_or_path, 13 trust_remote_code=True, 14 device=device, 15 query_instruction="Instruction: Given a search query, retrieve passages that answer the question \nQuery:", 16 text_instruction="" 17) 18 19data = [ 20 "Venus is often called Earth's twin because of its similar size and proximity.", 21 "Mars, known for its reddish appearance, is often referred to as the Red Planet.", 22 "Jupiter, the largest planet in our solar system, has a prominent red spot.", 23 "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." 24] 25 26nodes = [TextNode(id_=str(i), text=text) for i, text in enumerate(data)] 27 28for node in nodes: 29 node.embedding = embeddings.get_text_embedding(node.get_content()) 30 31embed_dim = len(nodes[0].embedding) 32store = FaissVectorStore(faiss_index=faiss.IndexFlatIP(embed_dim)) 33store.add(nodes) 34 35query = "Which planet is known as the Red Planet?" 36query_embedding = embeddings.get_query_embedding(query) 37 38results = store.query( 39 VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=3) 40) 41 42print(f"Query: {query}\n") 43print("Results:") 44for idx, score in zip(results.ids, results.similarities): 45 print(f"- Text: {data[int(idx)]} (Score: {score:.4f})") 46 47
📊 CMTEB
| 模型 | 参数量 | 平均(任务) | 平均(类型) | 分类 | 聚类 | 配对分类 | 重排序 | 检索 | 语义相似度 |
|---|---|---|---|---|---|---|---|---|---|
| bge-multilingual-gemma2 | 9B | 67.64 | 68.52 | 75.31 | 59.30 | 79.30 | 68.28 | 73.73 | 55.19 |
| ritrieve_zh_v1 | 326M | 72.71 | 73.85 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 |
| Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
| Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 |
| Conan-embedding-v2 | 1.4B | 74.24 | 75.99 | 76.47 | 68.84 | 92.44 | 74.41 | 78.31 | 65.48 |
| Seed1.6-embedding | - | 75.63 | 76.68 | 77.98 | 73.11 | 88.71 | 71.65 | 79.69 | 68.94 |
| QZhou-Embedding | 7B | 76.99 | 78.58 | 79.99 | 70.91 | 95.07 | 74.85 | 78.80 | 71.89 |
| Youtu-Embedding | 2B | 77.58 | 78.86 | 78.65 | 84.27 | 86.12 | 75.10 | 80.21 | 68.82 |
注意:对比分数来自MTEB排行榜,记录日期为2025年9月28日。
🎉 Citation
1@misc{zhang2025codiemb, 2 title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity}, 3 author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing}, 4 year={2025}, 5 eprint={2508.11442}, 6 archivePrefix={arXiv}, 7 url={https://arxiv.org/abs/2508.11442}, 8} 9
《【腾讯拥抱开源】Youtu-Embedding:基于CoDiEmb的一个协作而独特的框架,用于信息检索与语义文本相似性中的统一表征学习》 是转载文章,点击查看原文。
