命名实体识别NER探索(1)

命名实体识别NER探索(1)

命名实体识别(Named-entity recognition ,NER)(也称为实体识别、实体分块和实体提取)是信息提取的一个子任务,旨在将非结构化文本中提到的命名实体定位并分类为预定义的类别,例如人名、组织、地名、医疗名称、时间表达式、数量,货币价值、百分比等。

Tensorflow 1.x 虚拟环境部署

新建虚拟环境

E:\>python -m venv 2020_vms_tensorflow_1

激活虚拟环境

E:\>cd E:\2020_vms_tensorflow_1\Scripts

E:\2020_vms_tensorflow_1\Scripts>activate.bat
(2020_vms_tensorflow_1) E:\2020_vms_tensorflow_1\Scripts>

安装Tensorflow 1.x
tensorflow-1.15.0-cp36-cp36m-win_amd64.whl

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Collecting wheel>=0.26 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a7/00/3df031b3ecd5444d572141321537080b40c1c25e1caa3d86cdd12e5e919c/wheel-0.35.1-py2.py3-none-any.whl
Collecting tensorflow-estimator==1.15.1 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl
Collecting keras-applications>=1.0.8 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
    100% |████████████████████████████████| 51kB 276kB/s
Collecting absl-py>=0.7.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b9/07/f69dd3367368ad69f174bfe426a973651412ec11d48ec05c000f19fe0561/absl_py-0.10.0-py3-none-any.whl (127kB)
    100% |████████████████████████████████| 133kB 488kB/s
Collecting google-pasta>=0.1.6 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a3/de/c648ef6835192e6e2cc03f40b19eeda4382c49b5bafb43d88b931c4c74ac/google_pasta-0.2.0-py3-none-any.whl
Collecting keras-preprocessing>=1.0.5 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/4c/7c3275a01e12ef9368a892926ab932b33bb13d55794881e3573482b378a7/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42kB)
    100% |████████████████████████████████| 51kB 2.1MB/s
Collecting grpcio>=1.8.6 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/15/3f/f311f382bb658387fe78a30e1ed55193fe94c5e78b37abd134c34bd256eb/grpcio-1.31.0-cp36-cp36m-win_amd64.whl
Collecting gast==0.2.2 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting protobuf>=3.6.1 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/fe/9d8e70a86add02cb1ef35540ec03fd5b210d76323fe4645d7121b13ae33e/protobuf-3.13.0-cp36-cp36m-win_amd64.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 99kB/s
Collecting astor>=0.6.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl
Collecting numpy<2.0,>=1.16.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1d/d7b100264346a8722325987f10061b66d3c560bfb292f2c0254736e7531e/numpy-1.19.1-cp36-cp36m-win_amd64.whl (12.9MB)
    100% |████████████████████████████████| 12.9MB 42kB/s
Collecting termcolor>=1.1.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting opt-einsum>=2.3.2 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65kB)
    100% |████████████████████████████████| 71kB 157kB/s
Collecting six>=1.10.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Collecting tensorboard<1.16.0,>=1.15.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
    100% |████████████████████████████████| 3.8MB 90kB/s
Collecting h5py (from keras-applications>=1.0.8->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0b/fa/bee65d2dbdbd3611702aafd128139c53c90a1285f169ba5467aab252e27a/h5py-2.10.0-cp36-cp36m-win_amd64.whl (2.4MB)
    100% |████████████████████████████████| 2.4MB 89kB/s
Requirement already satisfied: setuptools in e:\2020_vms_tensorflow_1\lib\site-packages (from protobuf>=3.6.1->tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl (88kB)
    100% |████████████████████████████████| 92kB 138kB/s
Collecting werkzeug>=0.11.15 (from tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl (298kB)
    100% |████████████████████████████████| 307kB 109kB/s
Collecting importlib-metadata; python_version < "3.8" (from markdown>=2.6.8->tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)

提示报错

Collecting zipp>=0.5 (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
  Running setup.py bdist_wheel for wrapt ... error
  Failed building wheel for wrapt
  Running setup.py clean for wrapt
Failed to build wrapt
Installing collected packages: wrapt, werkzeug, zipp, importlib-metadata, markdown, tensorboard, tensorflow
  Running setup.py install for wrapt ... error
Exception:
Traceback (most recent call last):
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
    return s.decode(sys.__stdout__.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\basecommand.py", line 215, in main
    status = self.run(options, args)
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\commands\install.py", line 342, in run
    prefix=options.prefix_path,
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_set.py", line 784, in install
    **kwargs
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_install.py", line 878, in install
    spinner=spinner,
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
    line = console_to_str(proc.stdout.readline())
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
    return s.decode('utf_8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

修改73行代码:

if sys.version_info >= (3,):
    def console_to_str(s):
        try:
            return s.decode(sys.__stdout__.encoding)
        except UnicodeDecodeError:
            return s.decode('utf_8')

修改为:

if sys.version_info >= (3,):
    def console_to_str(s):
        try:
            #return s.decode(sys.__stdout__.encoding)
			return s.decode('cp936')
        except UnicodeDecodeError:
            return s.decode('utf_8')
 

Tensorflow 1.x 安装成功!

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Requirement already satisfied: google-pasta>=0.1.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting tensorboard<1.16.0,>=1.15.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl
Requirement already satisfied: protobuf>=3.6.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: wheel>=0.26 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: opt-einsum>=2.3.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: six>=1.10.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: astor>=0.6.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-applications>=1.0.8 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Requirement already satisfied: grpcio>=1.8.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: numpy<2.0,>=1.16.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: absl-py>=0.7.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: gast==0.2.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: tensorflow-estimator==1.15.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: termcolor>=1.1.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl
Collecting werkzeug>=0.11.15 (from tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl
Collecting setuptools>=41.0.0 (from tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b0/8b/379494d7dbd3854aa7b85b216cb0af54edcb7fce7d086ba3e35522a713cf/setuptools-50.0.0-py3-none-any.whl (783kB)
    100% |████████████████████████████████| 788kB 121kB/s
Requirement already satisfied: h5py in e:\2020_vms_tensorflow_1\lib\site-packages (from keras-applications>=1.0.8->tensorflow==1.15.0)
Collecting importlib-metadata; python_version < "3.8" (from markdown>=2.6.8->tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/58/cdea07eb51fc2b906db0968a94700866fc46249bdc75cac23f9d13168929/importlib_metadata-1.7.0-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<1.16.0,>=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
  Running setup.py bdist_wheel for wrapt ... done
  Stored in directory: C:\Users\lenovo\AppData\Local\pip\Cache\wheels\68\e3\d7\4b6eee6f5d547bdfd97ba406128db66c5654dfb831fda163a2
Successfully built wrapt
Installing collected packages: zipp, importlib-metadata, markdown, werkzeug, setuptools, tensorboard, wrapt, tensorflow
  Found existing installation: setuptools 28.8.0
    Uninstalling setuptools-28.8.0:
      Successfully uninstalled setuptools-28.8.0
Successfully installed importlib-metadata-1.7.0 markdown-3.2.2 setuptools-50.0.0 tensorboard-1.15.0 tensorflow-1.15.0 werkzeug-1.0.1 wrapt-1.12.1 zipp-3.1.0
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
...
>>>
>>> print(tf.__version__)
1.15.0
>>>

数据的采集及清洗

本文采用医疗行业电子病历分析案例,数据及代码来源于互联网资料。电子病历文本自然语言处理研究主要关注病历文本的处理,包括句子边界识别、词性标注、句法分析等,信息抽取以自然语言处理研究为基础,主要关注病历文本中各类表达医疗知识的命名实体或医疗概念的识别和关系抽取。

  • 人工标注的实体数据源 0.ann:第一列是序号,第二列是实体名称,第三列、第四列是标识实体在对应的0.txt文件的起始位置和结束位置,第五列是标识的实体名称。这是人工打标标识的文件。
......
T1	Disease 1845 1850	1型糖尿病
T2	Disease 1983 1988	1型糖尿病
T4	Disease 30 35	2型糖尿病
T5	Disease 1822 1827	2型糖尿病
T6	Disease 2055 2060	2型糖尿病
T7	Disease 2324 2329	2型糖尿病
T8	Disease 4325 4330	2型糖尿病
T9	Disease 5223 5228	2型糖尿病 
.......

医生针对患者的诊疗活动可以概括为:通过患者自述(自诉症状)和检查结果(检查项目)发现疾病的表现(症状),给出诊断结论(疾病),并基于诊断结论,给出治疗措施(治疗方案),涉及信息包括:症状、疾病、检查和治疗。

  • 0.ann对应的原始文本数据源 0.txt:
......
1.一般将HBA1C  。控制于<6.5%,若降糖治疗无低
血糖或体重增加等不良反应者可使HBA1C  。<6%[10-15]。
目前,新诊断糖尿病患者的人数逐年增加,越来越年轻
化,尤其是3039岁比例增加引起了临床医生的关
注。这些患者绝大多数除糖尿病外并无并发症和其他
疾病,考虑对患者预期寿命和生活质量的影响,应该严
格控制血糖,目标值HBA1C  。<6.5%,我们也同意并推荐
IDF的建议,对于年轻、病程较短、治疗后无低血糖或
体重增加等不良反应发生的患者,血糖控制目标值应
该尽量使HBA1C  c<6%¨⋯。
研究表明,随着HBA1C  。水平的升高,发生CVD的
危险性相对增加,而且糖尿病病史越长的患者治疗后
CVD危险性减低越小¨⋯;流行病学调查结果的meta
分析显示,HBA1C  。每增加1%,2型糖尿病心血管事件
的相对危险性为1.18,1型糖尿病为1.15171;
ACCORD亚组分析也提示,既往无CVD且HBA1C  。<8%
的患者强化治疗后CVD的发生及其死亡率明显下降,
这与总体患者(平均病程10,35%发生过CVD)的
研究结果恰恰相反MJ。另一方面,DCCT.EDIC经过后
续8年的随访证实1型糖尿病患者早期强化治疗对微
血管有持久的保护作用,并延缓动脉粥样硬化的发
展.4o;UKPDS-80经过后续10年的长期随访发现对新
....

自动标注将文本转化为深度学习的格式

通过prepare_data.py文件加载0.ann及0.txt数据集文件,将原始文件转换为以下格式的文件,作为输入数据提供给深度学习模型使用,包括以下字段:第一列是提取的每个字符、第二列是获取的实体标签、第三列是词边界特征、第四列是词性特征、第五列是偏旁部首特征、第六列是拼音特征。并且,生成数据词典dict.pkl。

word,label,bound,flag,radical,pinyin
随,O,B,p,,suí
着,O,E,p,,zhuó
生,O,B,vn,,shēng
活,O,E,vn,,huó
方,O,B,n,,fāng
式,O,E,n,,shì
的,O,S,uj,,dí
改,O,B,v,,gǎi
变,O,E,v,,biàn
及,O,S,c,,jí
人,O,B,n,,rén
口,O,M,n,,kǒu
老,O,M,n,,lǎo
龄,O,M,n,齿,líng
化,O,E,n,,huà
的,O,S,uj,,dí
加,O,B,v,,jiā
速,O,E,v,,",",O,S,x,UNK,UNK

代码如下:

import os
import pandas as pd
import pickle
from collections import Counter
from data_process import split_text
from tqdm import tqdm
import jieba.posseg as psg
from cnradical import Radical,RunOption
import shutil
from random import shuffle
train_dir='ruijin_round1_train2_20181022'
def process_text(idx,split_method=None,split_name='train'):
    """
    读取文本  切割 然后打上标记  并提取词边界、词性、偏旁部首、拼音等文本特征
    :param idx: 文件的名字  不含扩展名
    :param split_method: 切割文本的方法   是一个函数
    :param split_name: 最终保存的文件夹名字
    :return:
    """
    data={}

    #------------------------------获取句子-----------------------------------
    if split_method is None:
        with open(f'datas/{train_dir}/{idx}.txt','r',encoding='utf-8') as f:
            texts=f.readlines()
    else:
        with open(f'datas/{train_dir}/{idx}.txt', 'r', encoding='utf-8') as f:
            texts=f.read()
            texts=split_method(texts)
    # data['word']=texts

    #---------------------------------获取标签----------------------------------
    tag_list=['O' for s in texts for x in s]
    tag =pd.read_csv(f'datas/{train_dir}/{idx}.ann',header=None,sep='\t')
    for i in range(tag.shape[0]):
        tag_item=tag.iloc[i][1].split(' ')#获取的实体类别以及起始位置
        cls,start,end=tag_item[0],int(tag_item[1]),int(tag_item[-1])#转换成对应的类型
        tag_list[start]='B-'+cls#其实位置写入B-实体类别
        for j in range(start+1,end):#后面的位置写I-实体类别
            tag_list[j]='I-'+cls
    assert len([x for s in texts for x in s])==len(tag_list)#保证两个序列长度一致

    text_list = ''
    for t in texts:
        text_list+=t
    textes = []
    tags = []
    start = 0
    end = 0
    max=len(tag_list)
    for s in texts:
        l = len(s)
        end += l
        if  end>=max or tag_list[end][0] != 'I':
            textes.append(text_list[start:end])
            tags.append(tag_list[start:end])
            start=end
    data['word']=textes
    data['label']=tags
    assert len([x for s in textes for x in s]) == len(tag_list)



    #-----------------------------提取词性和词边界特征----------------------------------
    word_bounds=['M' for item in tag_list]#首先给所有的字都表上B标记
    word_flags=[]#用来保存每个字的词性特征
    for text in textes:
        for word,flag in psg.cut(text):
            if len(word)==1:#判断是一个字的词
                start=len(word_flags)#拿到起始下标
                word_bounds[start]='S'#标记修改为S
                word_flags.append(flag)#将当前词的词性名加入到wordflags列表
            else:
                start=len(word_flags)#获取起始下标
                word_bounds[start]='B'#第一个字打上B
                word_flags+=[flag]*len(word)#将这个词的每个字都加上词性标记
                end=len(word_flags)-1#拿到这个词的最后一个字的下标
                word_bounds[end]='E'#将最后一个字打上E标记


    #--------------------------------------统一截断---------------------------------------
    bounds = []
    flags=[]
    start = 0
    end = 0
    for s in textes:
        l = len(s)
        end += l
        bounds.append(word_bounds[start:end])
        flags.append(word_flags[start:end])
        start += l
    data['bound'] = bounds
    data['flag']=flags


    #----------------------------------------获取拼音特征-------------------------------------
    radical=Radical(RunOption.Radical)#提取偏旁部首
    pinyin = Radical(RunOption.Pinyin)#用来提取拼音
    #提取偏旁部首特征  对于没有偏旁部首的字标上PAD
    data['radical']=[[radical.trans_ch(x) if radical.trans_ch(x) is not None else 'UNK' for x in s] for s in textes]
    # 提取拼音特征  对于没有拼音的字标上PAD
    data['pinyin'] = [[pinyin.trans_ch(x) if pinyin.trans_ch(x) is not None else 'UNK' for x in s] for s in textes]

    #------------------------------------------存储数据------------------------------------------------
    num_samples=len(textes)#获取有多少句话  等于是有多少个样本
    num_col=len(data.keys())#获取特征的个数 也就是列数

    dataset=[]
    for i in range(num_samples):
        records=list(zip(*[list(v[i]) for v in data.values()]))#解压
        dataset+=records+[['sep']*num_col]#每存完一个句子需要一行sep进行隔离
    dataset=dataset[:-1]#最后一行sep不要
    dataset=pd.DataFrame(dataset,columns=data.keys())#转换成dataframe
    save_path=f'data/prepare/{split_name}/{idx}.csv'

    def clean_word(w):
        if w=='\n':
            return 'LB'
        if w in [' ','\t','\u2003']:
            return 'SPACE'
        if w.isdigit():#将所有的数字都变成一种符号
            return 'num'
        return w
    dataset['word']=dataset['word'].apply(clean_word)
    dataset.to_csv(save_path,index=False,encoding='utf-8')


def multi_process(split_method=None,train_ratio=0.8):
    if os.path.exists('data/prepare/'):
        shutil.rmtree('data/prepare/')
    if not os.path.exists('data/prepare/trian/'):
        os.makedirs('data/prepare/train')
        os.makedirs('data/prepare/test')
    idxs=list(set([ file.split('.')[0] for file in os.listdir('datas/'+train_dir)]))#获取所有文件的名字
    shuffle(idxs)#打乱顺序
    index=int(len(idxs)*train_ratio)#拿到训练集的截止下标
    train_ids=idxs[:index]#训练集文件名集合
    test_ids=idxs[index:]#测试集文件名集合

    import multiprocessing as mp
    num_cpus=mp.cpu_count()#获取机器cpu的个数
    pool=mp.Pool(num_cpus)
    results=[]
    for idx in train_ids:
        result=pool.apply_async(process_text,args=(idx,split_method,'train'))
        results.append(result)
    for idx in test_ids:
        result=pool.apply_async(process_text,args=(idx,split_method,'test'))
        results.append(result)
    pool.close()
    pool.join()
    [r.get() for r in results]

def mapping(data,threshold=10,is_word=False,sep='sep',is_label=False):
    count=Counter(data)
    if sep is not None:
        count.pop(sep)
    if is_word:
        count['PAD']=100000001
        count['UNK']=100000000
        data = sorted(count.items(), key=lambda x: x[1], reverse=True)
        data=[ x[0]  for x in data if x[1]>=threshold]#去掉频率小于threshold的元素  未登录词
        id2item=data
        item2id={id2item[i]:i for i in range(len(id2item))}
    elif is_label:
        data = sorted(count.items(), key=lambda x: x[1], reverse=True)
        data = [x[0] for x in data]
        id2item = data
        item2id = {id2item[i]: i for i in range(len(id2item))}
    else:
        count['PAD'] = 100000001
        data = sorted(count.items(), key=lambda x: x[1], reverse=True)
        data = [x[0] for x in data]
        id2item = data
        item2id = {id2item[i]: i for i in range(len(id2item))}
    return id2item,item2id



def get_dict():
    map_dict={}
    from glob import glob
    all_w,all_bound,all_flag,all_label,all_radical,all_pinyin=[],[],[],[],[],[]
    for file in glob('data/prepare/train/*.csv')+glob('data/prepare/test/*.csv'):
        df=pd.read_csv(file,sep=',')
        all_w+=df['word'].tolist()
        all_bound += df['bound'].tolist()
        all_flag += df['flag'].tolist()
        all_label += df['label'].tolist()
        all_radical += df['radical'].tolist()
        all_pinyin += df['pinyin'].tolist()
    map_dict['word']=mapping(all_w,threshold=20,is_word=True)
    map_dict['bound']=mapping(all_bound)
    map_dict['flag']=mapping(all_flag)
    map_dict['label']=mapping(all_label,is_label=True)
    map_dict['radical']=mapping(all_radical)
    map_dict['pinyin']=mapping(all_pinyin)

    with open(f'data/prepare/dict.pkl','wb') as f:
        pickle.dump(map_dict,f)

if __name__ == '__main__':
    # print(process_text('0',split_method=split_text,split_name='train'))
    # multi_process()
    # print(set([ file.split('.')[0] for file in os.listdir('datas/'+train_dir)]))
    multi_process(split_text)
    get_dict()
    # with open(f'data/prepare/dict.pkl','rb') as f:
    #     data=pickle.load(f)
    # print(data['bound'])

data_process.py

import os
import re
train_dir='datas/train'
def get_entities(dir):
    """
    返回实体类别统计字典
    :param dir: 文件目录
    :return:
    """
    entities={}#用来存储实体名
    files=os.listdir(dir)
    files=list(set([file.split('.')[0] for file in files]))
    for file in files:
        path=os.path.join(dir,file+'.ann')
        with open(path,'r',encoding='utf8') as f:
            for line in f.readlines():
                name=line.split('\t')[1].split(' ')[0]
                if name in entities:
                    entities[name]+=1
                else:
                    entities[name]=1
    return entities

def get_labelencoder(entities):
    """
    功能是得到标签和下标的映射
    :param entities:
    :return:
    """
    entities = sorted(entities.items(), key=lambda x: x[1], reverse=True)
    entities = [x[0] for x in entities]
    id2label=[]
    id2label.append('O')
    for entity in entities:
        id2label.append('B-'+entity)
        id2label.append('I-'+entity)
    label2id={id2label[i]:i for i in range(len(id2label))}
    return id2label,label2id



def ischinese(char):
    if '\u4e00'<= char <='\u9fff':
        return True
    return False



def split_text(text):
    split_index=[]

    pattern1 = '。|,|,|;|;|\.|\?'

    for m in re.finditer(pattern1,text):
        idx=m.span()[0]
        if text[idx-1]=='\n':
            continue
        if text[idx-1].isdigit() and text[idx+1].isdigit():#前后是数字
            continue
        if text[idx-1].isdigit() and text[idx+1].isspace() and text[idx+2].isdigit():#前数字 后空格 后后数字
            continue
        if text[idx-1].islower() and text[idx+1].islower():#前小写字母后小写字母
            continue
        if text[idx-1].islower() and text[idx+1].isdigit():#前小写字母后数字
            continue
        if text[idx-1].isupper() and text[idx+1].isdigit():#前大写字母后数字
            continue
        if text[idx - 1].isdigit() and text[idx + 1].islower():#前数字后小写字母
            continue
        if text[idx - 1].isdigit() and text[idx + 1].isupper():#前数字后大写字母
            continue
        if text[idx+1] in set('.。;;,,'):#前句号后句号
            continue
        if text[idx-1].isspace() and text[idx-2].isspace() and text[idx-3]=='C':#HBA1C的问题
            continue
        if text[idx-1].isspace() and text[idx-2]=='C':
            continue
        if text[idx-1].isupper() and text[idx+1].isupper() :#前大些后大写
            continue
        if text[idx]=='.' and text[idx+1:idx+4]=='com':#域名
            continue
        split_index.append(idx+1)
    pattern2='\([一二三四五六七八九零十]\)|[一二三四五六七八九零十]、|'
    pattern2+='注:|附录 |表 \d|Tab \d+|\[摘要\]|\[提要\]|表\d[^。,,;]+?\n|图 \d|Fig \d|'
    pattern2+='\[Abstract\]|\[Summary\]|前  言|【摘要】|【关键词】|结    果|讨    论|'
    pattern2+='and |or |with |by |because of |as well as '
    for m in re.finditer(pattern2,text):
        idx=m.span()[0]
        if (text[idx:idx+2] in ['or','by'] or text[idx:idx+3]=='and' or text[idx:idx+4]=='with')\
            and (text[idx-1].islower() or text[idx-1].isupper()):
            continue
        split_index.append(idx)

    pattern3='\n\d\.'#匹配1.  2.  这些序号
    for m in re.finditer(pattern2, text):
        idx = m.span()[0]
        if ischinese(text[idx + 3]):
            split_index.append(idx+1)

    for m in re.finditer('\n\(\d\)',text):#匹配(1) (2)这样的序号
        idx = m.span()[0]
        split_index.append(idx+1)
    split_index = list(sorted(set([0, len(text)] + split_index)))

    other_index=[]
    for i in range(len(split_index)-1):
        begin=split_index[i]
        end=split_index[i+1]
        if text[begin] in '一二三四五六七八九零十' or \
                (text[begin]=='(' and text[begin+1] in '一二三四五六七八九零十'):#如果是一、和(一)这样的标号
            for j in range(begin,end):
                if text[j]=='\n':
                    other_index.append(j+1)
    split_index+=other_index
    split_index = list(sorted(set([0, len(text)] + split_index)))

    other_index=[]
    for i in range(len(split_index)-1):#对长句子进行拆分
        b=split_index[i]
        e=split_index[i+1]
        other_index.append(b)
        if e-b>150:
            for j in range(b,e):
                if (j+1-other_index[-1])>15:#保证句子长度在15以上
                    if text[j]=='\n':
                        other_index.append(j+1)
                    if text[j]==' ' and text[j-1].isnumeric() and text[j+1].isnumeric():
                        other_index.append(j+1)
    split_index += other_index
    split_index = list(sorted(set([0, len(text)] + split_index)))

    for i in range(1,len(split_index)-1):# 10   20  干掉全部是空格的句子
        idx=split_index[i]
        while idx>split_index[i-1]-1 and text[idx-1].isspace():
            idx-=1
        split_index[i]=idx
    split_index = list(sorted(set([0, len(text)] + split_index)))


    #处理短句子
    temp_idx=[]
    i=0
    while i<len(split_index)-1:#0 10 20 30 45
        b=split_index[i]
        e=split_index[i+1]

        num_ch=0
        num_en=0
        if e-b<15:
            for ch in text[b:e]:
                if ischinese(ch):
                    num_ch+=1
                elif ch.islower() or ch.isupper():
                    num_en+=1
                if num_ch+0.5*num_en>5:#如果汉字加英文超过5个  则单独成为句子
                    temp_idx.append(b)
                    i+=1
                    break
            if num_ch+0.5*num_en<=5:#如果汉字加英文不到5个  和后面一个句子合并
                temp_idx.append(b)
                i+=2
        else:
            temp_idx.append(b)
            i+=1
    split_index=list(sorted(set([0, len(text)] + temp_idx)))
    result=[]
    for i in range(len(split_index)-1):
        result.append(text[split_index[i]:split_index[i+1]])

    #做一个检查
    s=''
    for r in result:
        s+=r
    assert  len(s)==len(text)
    return result

    # lens=[split_index[i+1]-split_index[i] for i in range(len(split_index)-1)][:-1]
    # print(max(lens),min(lens))
    # for i in range(len(split_index)-1):
    #     print(i,'||||',text[split_index[i]:split_index[i+1]])
 
if __name__ == '__main__':
    # entities=get_entities(train_dir)
    # label=get_labelencoder(entities)
    # print(label)
    # pattern='。|,|,|;|;|\.'
    # with open('datas/ruijin_round1_train2_20181022/0.txt','r',encoding='utf8') as f:
    #     text=f.read()
    #     for m in re.finditer(pattern,text):
    #         # print(m)
    #         start=m.span()[0]-5
    #         end=m.span()[1]+5
    #         print('****',text[start:end],'*****')
    #         print(text[start+5])
    files=os.listdir(train_dir)
    files=list(set([file.split('.')[0] for file in files]))
    # pattern2 = '\([一二三四五六七八九零十]\)|[一二三四五六七八九零十]、|'
    # pattern2 += '注:|附录 |表 \d|Tab \d+|\[摘要\]|\[提要\]|表\d[^。,,;]+?\n|图 \d|Fig \d|'
    # pattern2 += '\[Abstract\]|\[Summary\]|前  言|【摘要】|【关键词】|结    果|讨    论|'
    # pattern2 += 'and |or |with |by |because of |as well as '
    # pattern2 = '\n\(\d\)'
    # l=[]
    # for file in files:
    #     path = os.path.join(train_dir, file + '.txt')
    #     with open(path, 'r', encoding='utf8') as f:
    #         text=f.read()
    #         l.append(split_text(text)[-1])
    # print(l)
    path = os.path.join(train_dir, files[1] + '.txt')
    with open(path, 'r', encoding='utf8') as f:
        text = f.read()
        print(split_text(text))



段智华 CSDN认证博客专家 Spark AI 企业级AI技术
本人从事大数据人工智能开发和运维工作十余年,码龄5年,深入研究Spark源码,参与王家林大咖主编出版Spark+AI系列图书5本,清华大学出版社最新出版2本新书《Spark大数据商业实战三部曲:内核解密|商业案例|性能调优》第二版、《企业级AI技术内幕:深度学习框架开发+机器学习案例实战+Alluxio解密》,《企业级AI技术内幕》新书分为盘古人工智能框架开发专题篇、机器学习案例实战篇、分布式内存管理系统Alluxio解密篇。Spark新书第二版以数据智能为灵魂,包括内核解密篇,商业案例篇,性能调优篇和Spark+AI解密篇。从2015年开始撰写博文,累计原创1059篇,博客阅读量达155万次
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页
实付 19.90元
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值