英语新闻丨AI赋能中文语言数据库建设指南
CD Voice - A podcast by China Daily

China is accelerating the digitalization of ancient texts and boosting access to oracle bone script data, aiming to integrate cultural heritage with digital Chinese, officials said on Monday.中国正加速推进古籍数字化进程并扩大甲骨文数据开放,旨在将文化遗产保护与数字中文建设相结合。The Ministry of Education, the National Language Commission and the Cyberspace Administration of China issued a guideline to promote the digitalization of the Chinese language and characters. The focus is on developing national language resources and large-scale Chinese language models to support artificial intelligence.有关部门周一表示,教育部、国家语言文字工作委员会及中央网信办已联合发布《关于推进语言文字数字化的指导意见》,重点开发国家语言资源和大规模中文语言模型,为人工智能发展提供支持。The guideline aims to establish a national corpus and strategic language resources information database by 2027. By 2035, the country hopes it will have significantly expanded the presence of the Chinese language in global digital and generative AI scenarios.该指南提出,到2027年将建成国家语料库和战略语言资源信息库;至2035年,中文在全球数字化场景及生成式人工智能领域的应用影响力将显著提升。Liu Peijun, head of the Department of Language Information Management at the Ministry of Education, said the guideline calls for the digitalization of linguistic and cultural heritage, while promoting the construction of a national digital language and script museum.教育部语言文字信息管理司司长刘培俊表示,该指南要求推进语言文化遗产数字化,同时推动建设国家数字语言文字博物馆。It emphasizes advancing key technologies for ancient text digitalization, enhancing the accessibility of oracle bone script data and launching a multilingual digital education program to facilitate Chinese language learning globally, Liu said at a news conference.刘培俊在新闻发布会上强调,需重点突破古籍数字化关键技术,增强甲骨文数据的可获取性,并启动多语种数字教育计划,助力中文教育的全球化发展。A key aspect of this initiative is the development of large-scale linguistic data resources. The guideline outlines a plan to build a national corpus with extensive Chinese language datasets to support AI applications.该计划聚焦大规模语言数据资源建设。根据指南要求,将系统性构建国家语料库,整合海量中文数据集,为人工智能应用提供支撑。Among the pilot projects, Beijing Normal University has launched a large-scale Classical Chinese language model, an AI-driven initiative that sets a new benchmark in the field, Liu said.在试点项目中,北京师范大学已推出大规模文言文语言模型。刘培俊指出,这一人工智能驱动的举措为该领域树立了新的标杆。Kang Zhen, vice-president of BNU, said the university has developed a range of digital language databases, including a comprehensive holographic Chinese character database, a digital resource of the ancient Chinese dictionary Shuowen Jiezi, and repositories for ancient inscriptions and handwritten texts.北师大副校长康震表示,该校已构建包括全息汉字数据库、《说文解字》数字资源库,古代铭文及手写文本库在内的系列数字化语言数据库体系。These resources have played a crucial role in linguistic research and cultural preservation, Kang added.康震补充称,这些资源对语言研究和文化保护发挥了关键作用。The university's AI Taiyan, a Classical Chinese large language model trained with 1.8 billion parameters, has been designed for high-accuracy interpretation of ancient texts, supporting tasks such as word and phrase explanations, as well as classical-to-modern Chinese translation.该校研发的文言文大语言模型“AI太炎”基于18亿个参数训练出来的古汉语大型语言模型,专为高精度古籍解读而设计,可支持字词释义、文言文与现代汉语互译等任务。China is also spearheading the construction of a new national corpus to strengthen linguistic infrastructure in the AI era, said Wang Hui, deputy head of the Ministry of Education's Department of Language Application and Administration.教育部语言文字应用管理司副司长王晖表示,中国正带头建设新型国家语料库,以强化人工智能时代的语言基础设施。"Currently, most linguistic datasets remain limited to single-text formats and specific academic domains, lacking the scale and diversity required for AI applications," Wang said.王晖指出,当前语言数据资源仍主要集中于纯文本形态与特定学术研究领域,在数据规模与类型多样性方面存在明显不足,难以满足人工智能技术发展的多维需求。The department has begun planning for the corpus this year, seeking to launch two flagship databases, the Chinese civilization corpus for AI-assisted teaching and research, and the Chinese grand reading system corpus, Wang said.王晖表示,该司今年已启动语料库规划,计划推出两大核心数据库:一是支撑人工智能辅助教学研究的中华文明语料库,二是中华经典诵读系统语料库。oracle bone script甲骨文national corpus国家语料库the National Language Commission国家语言文字工作委员会strategic language resources information database战略语言资源信息库cultural heritage文化遗产ancient text digitalization古籍数字化benchmarkn.标杆spearheadv.带头;先锋