文章摘要
姚振军,黄德根,纪翔宇.正则表达式在汉英对照中国文化术语抽取中应用[J].,2010,(2):291-295
正则表达式在汉英对照中国文化术语抽取中应用
Application of regular expressions to extraction of Chinese cultural terms with their English translations
  
DOI:10.7511/dllgxb201002024
中文关键词: 正则表达式  元字符  生成器  中国文化术语
英文关键词: regular expression  meta-character  generating engine  Chinese cultural terms
基金项目:
作者单位
姚振军,黄德根,纪翔宇  
摘要点击次数: 1655
全文下载次数: 1762
中文摘要:
      运用正则表达式的字符串匹配功能对特定数据库中的汉英对照中国文化术语进行了抽取.抽取过程中,由于规则中特殊字符有11个,正则表达式中的一个字符可能要经过11次才能判断与待搜索文本中对应字符是否匹配.为加快抽取速度,根据待搜索文本的实际情况,选择使用了3个元字符,建立了符合特定需要的正则表达式,在保证相同正确率的前提下,抽取速度提高了1倍左右;同时,通过正则表达式生成器,尝试解决了正则表达式应用过程中可读性差、用户使用难度大的问题.
英文摘要:
      The matching system of the character string in regular expression (RE) is used to extract the Chinese cultural terms and their correspondent English translations from the specialized corpus. During the process of extraction, if the current RE is used, then 11 special characters would appear in the expressions. It means that a particular character in RE has to go through 11 judgments so as to make sure whether it matches the correspondent character in the to-be-searched text or not. To speed up extracting process, the target-oriented regular expressions are designed to fit the pattern of the to-be-searched text by reducing the number of meta-characters from 11 to 3. Experimental results show that processing speed is doubled while accuracy is maintained. At the same time, the generating engine of regular expressions is designed to improve the readability of RE and decrease the difficulty of its application.
查看全文   查看/发表评论  下载PDF阅读器
关闭