文章摘要
孙晓,黄德根.基于最长次长匹配分词的一体化中文词法分析[J].,2010,(6):1028-1034
基于最长次长匹配分词的一体化中文词法分析
Chinese integrative lexical analysis based on maximum matching and second-maximum matching segmentation
  
DOI:10.7511/dllgxb201006034
中文关键词: 中文词法分析  一体化模型  最长次长匹配  未登录词  切分有向图
英文关键词: Chinese lexical analysis  integrative model  maximum matching and second-maximum matching  unknown word  segmentation directed graph
基金项目:中央高校基本科研业务费专项资金资助项目(DUT10RW202).
作者单位
孙晓,黄德根  
摘要点击次数: 1327
全文下载次数: 1371
中文摘要:
      针对当前大多数词法分析系统“流水线”式处理方式存在的不足,提出一种一体化同步词法分析机制.在最长次长匹配分词的基础上,在切分有向图中增加词性信息和候选未登录词节点,并拓展隐马尔可夫模型,在切分有向图内同步完成分词、歧义消解、未登录词识别和词性标注等词法分析任务.实现了分词与词性标注的一体化、未登录词识别与分词的一体化以及不确定词性未登录词处理的一体化.一体化机制使词法分析中各步骤实现真正意义上的同步完成,充分利用上下文词法信息提高整体精度并保证了系统的高效性,避免了各步骤间的冲突.开放测试表明,系统综合测试的 F 值为98.03%.
英文摘要:
      An integrative lexical analysis mechanism is proposed in order to solve the limitation of mostly existing lexical analysis system with ″pipelining″ mechanism. Based on maximum matching and second-maximum matching (MMSM) model, in the directed graph built by MMSM model, candidate words, parts-of-speech (POS) tags and all the candidate unknown words are added and considered, hidden Markov model (HMM) is extended, so Chinese word segmentation, ambiguity resolution, unknown word recognition and POS tagging are solved synchronously. The integrations of word segmentation and POS tagging, unknown words recognition and known word segmentation, uncertain unknown words recognition are realized. All the tasks of lexical analysis are accomplished synchronously, the conflicts between all the tasks in the Chinese lexical analysis are avoided, and high precision can be gained. The open test indicates that the F -score of the system is 98.03%.
查看全文   查看/发表评论  下载PDF阅读器
关闭