命名实体识别NER探索(3)-Bi-LSTM+CRF模型

系列文章目录

命名实体识别NER探索(1) https://duanzhihua.blog.csdn.net/article/details/108338970 命名实体识别NER探索(2) https://duanzhihua.blog.csdn.net/article/details/108391645 Viterbi算法实战案例(天气变化、词性预测) https://duanzhihua.blog.csdn.net/article/details/104992597


前言

NER系列前2篇文章中讲解了数据的清理转换及自动标注。 本文实现Bi-LSTM+CRF模型。

一、Bi-LSTM+CRF模型简介

隐马尔可夫模型(Hidden Markov Model,HMM)

隐马尔可夫模型描述由一个隐藏的马尔科夫链随机生成不可观测的状态随机序列,再由各个状态生成一个观测而产生观测随机序列的过程。隐马尔可夫模型由初始状态分布,状态转移概率矩阵以及观测概率矩阵所确定。命名实体识别是一种序列标注问题,观测到的是字组成的序列(观测序列),观测不到的是每个字对应的标注(状态序列)。

条件随机场(Conditional Random Field, CRF)

HMM模型中存在两个假设,一是输出观察值之间独立,二是状态转移过程中当前状态只与前一状态有关。条件随机场通过引入自定义的特征函数,可表达观测之间的依赖,还可表示当前观测与前后多个状态之间的依赖。

Bi-LSTM

LSTM(Long Short-Term Memory)是RNN(Recurrent Neural Network)的一种,非常适合用于对文本时序数据的建模。BiLSTM是双向LSTM,是由前向LSTM与后向LSTM组合而成。

LSTM 模型示意图。
在这里插入图片描述
如以字为单位进行处理, 下图中w0,w1…表示句子里面的每一个字,经过BiLSTM处理,输出每个字对应每个标签的概率,将最大概率值表示对应字符预测的标签。
在这里插入图片描述

BiLSTM模型其实已经可以实现实体标签识别,为什么还要加上CRF层?

Bi-LSTM+CRF模型

因为BiLSTM只能够预测文本序列与标签的关系,而不能预测标签与标签之间的关系,标签之间的相互关系是CRF中的转移矩阵。例如, "B"表示单词的开始,“B"后面不能连续标注"B”(只有是一个字是开始的),上图中的I-person表示人名中的中间的一个字,其前面的字不可能是一个地名的字(I-Organization),由于没有状态转移的条件约束,LSTM模型很有可能输出一个误差的标注序列。因此,需加上CRF层,文本序列经过BiLSTM模型处理,输出结果传入CRF层,输出一个整体的时序预测结果。
Bi-LSTM+CRF模型示意图。
在这里插入图片描述

二、Bi-LSTM+CRF模型代码实现

和深度学习神经网络不同的地方,代码里面实现了CRF层,是用tensorflow.contrib.crf实现的,代码里面构建了一个转移矩阵,start_logits在标签状态矩阵基础上多加了1个维度,将初始状态包括在里面。

model.py代码如下(示例):

 # encoding = utf8
import numpy as np
import tensorflow as tf
from tensorflow.contrib.crf import crf_log_likelihood
from tensorflow.contrib.crf import viterbi_decode
from tensorflow.contrib.layers.python.layers import initializers
from tensorflow.contrib import rnn
from utils import result_to_json
from data_utils import create_input, iobes_iob,iob_iobes

 
def network(inputs,shapes,num_tags,lstm_dim=100,initializer = tf.truncated_normal_initializer()):
    '''
    接收一个批次样本的特征数据,计算网络的输出值
    :param char: type of int, a tensor of shape 2-D [None,None]
    :param bound: a tensor of shape 2-D [None,None] with type of int
    :param flag: a tensor of shape 2-D [None,None] with type of int
    :param radical: a tensor of shape 2-D [None,None] with type of int
    :param pinyin: a tensor of shape 2-D [None,None] with type of int
    :return:
    '''

    # -----------------------------------特征嵌入-------------------------------------
    #将所有特征的id转换成一个固定长度的向量
    embedding=[]
    keys = list(shapes.keys())
    for key in keys:
        with tf.variable_scope(key+'_embedding'):
            char_lookup = tf.get_variable(
                name=key+'_embedding',
                shape=shapes[key],
                initializer=initializer
            )
            # 每一个char的id找到char_lookup对应饿行,即该字对应的向量
            embedding.append(tf.nn.embedding_lookup(char_lookup, inputs[key]))#实现特征的嵌入
    embed = tf.concat(embedding,axis=-1)#shape [None, None, char_dim+bound_dim+flag_dim+radical_dim+pinyin_dim]

    #拿到输入里面的字符数据,正数变成1,0变成0
    sign = tf.sign(tf.abs(inputs[keys[0]]))
    #得到每个句子真实的长度
    lengths = tf.reduce_sum(sign,reduction_indices = 1)
    #得到序列的长度
    num_time = tf.shape(inputs[keys[0]])[1]



    # --------------------------------循环神经网络编码--------------------------------
    with tf.variable_scope('BiLSTM_layer1'):
        lstm_cell = {}
        for name in ['forward1','backward1']:
            with tf.variable_scope(name):
                lstm_cell[name] = rnn.BasicLSTMCell(
                    #有多少个神经元是init指定好传过来的
                    lstm_dim

                )
        #双向的动态rnn,来回都是100,拼接起来是200
        outputs1,finial_states1 = tf.nn.bidirectional_dynamic_rnn(
            lstm_cell['forward1'],
            lstm_cell['backward1'],
            embed,
            dtype = tf.float32,
            #告知实际的长度
            sequence_length = lengths
        )
    outputs1 = tf.concat(outputs1,axis = -1) #b,L,2*lstm_dim

    with tf.variable_scope('BiLSTM_layer2'):
        lstm_cell = {}
        for name in ['forward','backward']:
            with tf.variable_scope(name):
                lstm_cell[name] = rnn.BasicLSTMCell(
                    #有多少个神经元是init指定好传过来的
                    lstm_dim
                )
        #双向的动态rnn,来回都是100,拼接起来是200
        outputs,finial_statesl = tf.nn.bidirectional_dynamic_rnn(
            lstm_cell['forward'],
            lstm_cell['backward'],
            outputs1,
            dtype = tf.float32,
            #告知实际的长度
            sequence_length = lengths
        )
    output = tf.concat(outputs,axis = -1) #b,L,2*lstm_dim



    # --------------------------------输出映射--------------------------------
    #矩阵乘法只能是两维的
    #reshape成二维矩阵 batch_size*maxlength,2*lstm_dim
    output = tf.reshape(output,[-1,2*lstm_dim])
    with tf.variable_scope('project_layer1'):
        w = tf.get_variable(
            name = 'w',
            shape = [2*lstm_dim,lstm_dim],
            initializer = initializer
        )
        b = tf.get_variable(
            name = 'b',
            shape = [lstm_dim],
            initializer = tf.zeros_initializer()
        )
        output  =tf.nn.relu(tf.matmul(output,w)+b)
    with tf.variable_scope('project_layer2'):
        w = tf.get_variable(
            name = 'w',
            shape = [lstm_dim,num_tags],
            initializer = initializer
        )
        b = tf.get_variable(
            name = 'b',
            shape = [num_tags],
            initializer = tf.zeros_initializer()
        )
        output  =tf.matmul(output,w)+b
    output = tf.reshape(output,[-1,num_time,num_tags])
    #batch_size,max_length,num_tags
    return output,lengths



class Model(object):
    def __init__(self, dict,lr = 0.0001):
        # --------------------------------用到的参数值--------------------------------

        #可以选择读字典计算长度,也可以直接给出一个数字
        self.num_char = len(dict['word'][0])
        self.num_bound = len(dict['bound'][0])
        self.num_flag = len(dict['flag'][0])
        self.num_radical = len(dict['radical'][0])
        self.num_pinyin = len(dict['pinyin'][0])
        self.num_tags = len(dict['label'][0])
        #指定每一个字被映射为多少长度的向量
        self.char_dim = 100
        self.bound_dim = 20
        self.flag_dim = 50
        self.radical_dim = 50
        self.pinyin_dim = 50
        self.lstm_dim = 100
        self.lr = lr
        self.map = dict

        # -----------------------定义接受数据的placeholder----------------------------
        self.char_inputs = tf.placeholder(dtype = tf.int32,
                                          shape = [None,None],
                                          name = 'char_inputs')
        self.bound_inputs = tf.placeholder(dtype=tf.int32,
                                           shape=[None, None],
                                           name='bound_inputs')
        self.flag_inputs = tf.placeholder(dtype=tf.int32,
                                          shape=[None, None],
                                          name='flag_inputs')
        self.radical_inputs = tf.placeholder(dtype=tf.int32,
                                             shape=[None, None],
                                             name='radical_inputs')
        self.pinyin_inputs = tf.placeholder(dtype=tf.int32,
                                            shape=[None, None],
                                            name='pinyin_inputs')
        self.targets = tf.placeholder(dtype=tf.int32,
                                      shape=[None, None],
                                      name='targets')
        self.global_step = tf.Variable(0,trainable = False)#不需要训练,只是用来计算
        self.batch_size = tf.shape(self.char_inputs)[0]
        self.num_steps = tf.shape(self.char_inputs)[-1]

        # ------------------------------计算模型输出值-------------------------------
        self.logits,self.lengths = self.get_logits(self.char_inputs,
                                                   self.bound_inputs,
                                                   self.flag_inputs,
                                                   self.radical_inputs,
                                                   self.pinyin_inputs
                                                   )

        # ------------------------------计算损失-------------------------------
        self.cost = self.loss(self.logits,self.targets,self.lengths)

        # ----------------------------优化器优化-------------------------------
        #采用梯度截断技术
        with tf.variable_scope('optimizer'):
            opt = tf.train.AdamOptimizer(self.lr)
            grad_vars = opt.compute_gradients(self.cost)#计算出所有参数的导数
            clip_grad_vars = [[tf.clip_by_value(g,-5,5),v] for g,v in grad_vars]#得到截断之后的梯度
            self.train_op  =opt.apply_gradients(clip_grad_vars,self.global_step)#使用截断后的梯度对参数进行更新

        self.saver = tf.train.Saver(tf.global_variables(),max_to_keep = 5)


    def get_logits(self,char,bound,flag,radical,pinyin):
        '''
        接收一个批次样本的特征数据,计算网络的输出值
        :param char: type of int, a tensor of shape 2-D [None,None]
        :param bound: a tensor of shape 2-D [None,None] with type of int
        :param flag: a tensor of shape 2-D [None,None] with type of int
        :param radical: a tensor of shape 2-D [None,None] with type of int
        :param pinyin: a tensor of shape 2-D [None,None] with type of int
        :return: 3-d tensor  [batch_size,max_length,num_tags]
        '''
        shapes = {}
        #有多少个元素*每个元素的维度
        shapes['char']=[self.num_char,self.char_dim]
        shapes['bound']=[self.num_bound,self.bound_dim]
        shapes['flag']=[self.num_flag,self.flag_dim]
        shapes['radical']=[self.num_radical,self.radical_dim]
        shapes['pinyin']=[self.num_pinyin,self.pinyin_dim]
        inputs= {}
        inputs['char'] = char
        inputs['bound'] = bound
        inputs['flag'] = flag
        inputs['radical'] =radical
        inputs['pinyin'] = pinyin

        return network(inputs,shapes,num_tags=self.num_tags,lstm_dim=self.lstm_dim,initializer = tf.truncated_normal_initializer())


    def loss(self, output, targets, lengths, initializer=None):
        '''
        该函数的主要功能:计算损失
        :param output:
        :param targets:
        :param lengths:
        :param initializer:
        :return:
        '''
        b = tf.shape(lengths)[0]
        num_steps = tf.shape(output)[1]
        with tf.variable_scope('crf_loss'):
            small = -1000.0
            start_logits = tf.concat(
                [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])],
                axis=-1


            )
            pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)

            logits = tf.concat([output,pad_logits],axis = -1)
            logits = tf.concat([start_logits,logits],axis = 1)
            targets = tf.concat(
                [tf.cast(self.num_tags * tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1
            )

            self.trans = tf.get_variable(
                name = 'trans',
                shape = [self.num_tags+1,self.num_tags+1],
                initializer = tf.truncated_normal_initializer()
            )
            log_likehood,self.trans = tf.contrib.crf.crf_log_likelihood(
                inputs = logits,
                tag_indices = targets,
                transition_params = self.trans,
                sequence_lengths = lengths
            )
            return tf.reduce_mean(-log_likehood)

    def run_step(self,sess,batch,istrain = True,istest=False):
        '''
        该函数的主要功能:判断是否为训练集,并且分批读入数据
        :param sess:
        :param batch:
        :param istrain:
        :return:
        '''
        if istrain:
            feed_dict = {
                self.char_inputs:batch[0],
                self.targets: batch[1],
                self.bound_inputs:batch[2],
                self.flag_inputs:batch[3],
                self.radical_inputs:batch[4],
                self.pinyin_inputs:batch[5]
             }
            _, loss = sess.run([self.train_op, self.cost],feed_dict = feed_dict)
            return loss
        elif istest:
            feed_dict = {
                self.char_inputs:batch[0],
                self.bound_inputs:batch[2],
                self.flag_inputs:batch[3],
                self.radical_inputs:batch[4],
                self.pinyin_inputs:batch[5],
             }
            logits,lengths = sess.run([self.logits,self.lengths],feed_dict = feed_dict)
            return logits,lengths
        else:
            feed_dict = {
                self.char_inputs: batch[0],
                self.bound_inputs: batch[1],
                self.flag_inputs: batch[2],
                self.radical_inputs: batch[3],
                self.pinyin_inputs: batch[4],
            }
            logits, lengths = sess.run([self.logits, self.lengths], feed_dict=feed_dict)
            return logits, lengths



    def decode(self, logits, lengths, matrix):
        '''
        该函数的主要功能:对测试集进行预测
        :param logits:
        :param lengths:
        :param matrix:
        :return: 解码出的id
        '''
        paths = []
        small = -1000.0
        start = np.asarray([[small] * self.num_tags + [0]])
        for score, length in zip(logits, lengths):
            # 只取有效字符的输出
            score = score[:length]
            pad = small * np.ones([length, 1])
            logits = np.concatenate([score, pad], axis=-1)
            logits = np.concatenate([start, logits], axis=0)
            path, _ = viterbi_decode(logits,matrix)

            paths.append(path[1:])
        return paths

    def result_to_json(self,string, tags):
        item = {"string": string, "entities": []}
        entity_name = ""
        entity_start = 0
        idx = 0
        for char, tag in zip(string, tags):
            if tag[0] == "S":
                item["entities"].append({"word": char, "start": idx, "end": idx+1, "type":tag[2:]})
            elif tag[0] == "B":
                entity_name += char
                entity_start = idx
            elif tag[0] == "I":
                entity_name += char
            elif tag[0] == "E":
                entity_name += char
                item["entities"].append({"word": entity_name, "start": entity_start, "end": idx + 1, "type": tag[2:]})
                entity_name = ""
            else:
                entity_name = ""
                entity_start = idx
            idx += 1
        return item

    def predict(self,sess,batch,istrain=False,istest=True):
        '''
        该函数的主要功能:进行实际的预测,并且展示字和每个字的标记
        :param sess:
        :param batch:
        :return:
        '''
        results = []
        items = []
        matrix = self.trans.eval()
        logits,lengths = self.run_step(sess,batch,istrain,istest)
        paths = self.decode(logits,lengths,matrix)
        chars = batch[0]
        judge = 0
        total_length = 0
        if istest:
            for i in range(len(paths)):
                #第i句话对应的真实的长度
                length = lengths[i]
                string = [self.map['word'][0][index] for index in chars[i][:length]]
                tags = [self.map['label'][0][index] for index in paths[i]]
                result = [k for k in zip(string,tags)]
                results.append(result)
                #计算准确率
                labels = batch[1]
                # print('path[{}]:{}'.format(i,paths[i]))
                # print('label[{}]:{}'.format(i,labels[i]))
                judge += sum(np.array([paths[i][index]==labels[i][index] for index in range(length)]).astype(int))
                total_length += length
            presicion = judge/total_length*100
            return results,presicion
        else:
            for i in range(len(paths)):
                # 第i句话对应的真实的长度
                length = lengths[i]
                string = [self.map['word'][0][index] for index in chars[i][:length]]
                tags = [self.map['label'][0][index] for index in paths[i]]
                result = [k for k in zip(string, tags)]
                results.append(result)
                print(result)
                items = self.result_to_json(string, tags)
            return results,items
# encoding = utf8
import re
import math
import codecs
import random
import os
import numpy as np
import pandas as pd
import jieba
import pickle
from tqdm import tqdm

jieba.initialize()

def get_data(name = 'train'):
    '''
    该函数的主要功能是:把所有的数据都放在一个文件里面一起获取,并且将数据进行不同形式的拼接,进行数据增强
    :param name:所有数据所在的位置
    :return:
    '''
    with open(f'data/Prepare/dict.pkl','rb') as f:
        map_dict = pickle.load(f)


    def item2id(data,w2i):
        '''
        该函数的主要功能是:把字符转变成id
        :param data: 等待转化的数据
        :param w2i: 转化的方法
        :return: 如果是认识的值就返回对应的ID,如果不认识,就返回UNK的id
        '''
        return [w2i[x] if x in w2i else w2i['UNK'] for x in data]

    results = []
    root = os.path.join('data/prepare/',name)
    files = list(os.listdir(root))
    fileindex=-1
    file_index = []


    for file in tqdm(files):
    #for file in files:
        result=[]

        path = os.path.join(root,file)

        try:
            #samples = pd.read_csv(path, sep=',', encoding='gbk')
            samples = pd.read_csv(path, sep=',' )
        except UnicodeEncodeError:
            #samples = pd.read_csv(path, sep=',', encoding='UTF-8',errors='ignore')
            samples = pd.read_csv(path, sep=',' , errors='ignore')
        except Exception as e:
            print(e)

        num_samples = len(samples)
        fileindex += num_samples
        file_index.append(fileindex)
        # 存储好每个句子开始的下标
        sep_index = [-1]+samples[samples['word']=='sep'].index.tolist()+[num_samples]#-1,20,40,50

        # -----------------------------获取句子并且将句子全部转换成id----------------------------
        for i in range(len(sep_index)-1):
            start = sep_index[i]+1
            end = sep_index[i+1]
            data = []
            for feature in samples.columns:
                #print(list(samples[feature])[start:end],map_dict[feature][1])
                try:
                    data.append(item2id(list(samples[feature])[start:end],map_dict[feature][1]))
                except:
                    print(item2id(list(samples[feature])[start:end],map_dict[feature][1]))
                #print(data)
            result.append(data)
        #按照数据进行不同的拼接,不拼接、拼接1个、拼接2个...从而增强数据学习的能力

        # ----------------------------------------数据增强-------------------------------------
        if name == 'task':
            results.extend(result)
        else:
            two=[]
            for i in range(len(result)-1):
                first = result[i]
                second = result[i+1]
                two.append([first[k]+second[k] for k in range(len(first))])

            three = []
            for i in range(len(result) - 2):
                first = result[i]
                second = result[i + 1]
                third = result[i + 2]
                three.append([first[k] + second[k]+third[k] for k in range(len(first))])
            #应该用extend而不是append
            results.extend(result+two+three)

    with open(f'data/prepare/'+name+'.pkl','wb') as f:
        pickle.dump(results,f)

def create_dico(item_list):
    """
    Create a dictionary of items from a list of list of items.
    """
    assert type(item_list) is list
    dico = {}
    for items in item_list:
        for item in items:
            if item not in dico:
                dico[item] = 1
            else:
                dico[item] += 1
    return dico


def create_mapping(dico):
    """
    Create a mapping (item to ID / ID to item) from a dictionary.
    Items are ordered by decreasing frequency.
    """
    sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0]))
    id_to_item = {i: v[0] for i, v in enumerate(sorted_items)}
    item_to_id = {v: k for k, v in id_to_item.items()}
    return item_to_id, id_to_item


def zero_digits(s):
    """
    Replace every digit in a string by a zero.
    """
    return re.sub('\d', '0', s)


def iob2(tags):
    """
    Check that tags have a valid IOB format.
    Tags in IOB1 format are converted to IOB2.
    """
    for i, tag in enumerate(tags):
        if tag == 'O':
            continue
        split = tag.split('-')
        if len(split) != 2 or split[0] not in ['I', 'B']:
            return False
        if split[0] == 'B':
            continue
        elif i == 0 or tags[i - 1] == 'O':  # conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
        elif tags[i - 1][1:] == tag[1:]:
            continue
        else:  # conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
    return True


def iob_iobes(tags):
    """
    IOB -> IOBES
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag == 'O':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'B':
            if i + 1 != len(tags) and \
               tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('B-', 'S-'))
        elif tag.split('-')[0] == 'I':
            if i + 1 < len(tags) and \
                    tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('I-', 'E-'))
        else:
            raise Exception('Invalid IOB format!')
    return new_tags


def iobes_iob(tags):
    """
    IOBES -> IOB
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag.split('-')[0] == 'B':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'I':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'S':
            new_tags.append(tag.replace('S-', 'B-'))
        elif tag.split('-')[0] == 'E':
            new_tags.append(tag.replace('E-', 'I-'))
        elif tag.split('-')[0] == 'O':
            new_tags.append(tag)
        else:
            raise Exception('Invalid format!')
    return new_tags


def insert_singletons(words, singletons, p=0.5):
    """
    Replace singletons by the unknown word with a probability p.
    """
    new_words = []
    for word in words:
        if word in singletons and np.random.uniform() < p:
            new_words.append(0)
        else:
            new_words.append(word)
    return new_words


def get_seg_features(string):
    """
    Segment text with jieba
    features are represented in bies format
    s donates single word
    """
    #def features(self,string):
        #def _w2f(word):
            #lenth=len(word)
            #if lenth==1:
                #r=[0]
            #if lenth>1:
                #r=[2]*lenth
                #r[0]=1
                #r[-1]=3
            #return r
        #return list(chain.from_iterable([_w2f(word) for word in jieba.cut(string) if len(word.strip())>0]))    
    
    seg_feature = []

    for word in jieba.cut(string):
        if len(word) == 1:
            seg_feature.append(0)
        else:
            tmp = [2] * len(word)
            tmp[0] = 1
            tmp[-1] = 3
            seg_feature.extend(tmp)
    
    return seg_feature
    #return [i for word in jieba.cut(string) for i in range(1,len(word)+1) ]

def create_input(data):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    inputs = list()
    inputs.append(data['chars'])
    inputs.append(data["segs"])
    inputs.append(data['tags'])
    return inputs


def load_word2vec(emb_path, id_to_word, word_dim, old_weights):
    """
    Load word embedding from pre-trained file
    embedding size must match
    """
    new_weights = old_weights
    print('Loading pretrained embeddings from {}...'.format(emb_path))
    pre_trained = {}
    emb_invalid = 0
    for i, line in enumerate(codecs.open(emb_path, 'r', 'utf-8')):
        line = line.rstrip().split()
        if len(line) == word_dim + 1:
            pre_trained[line[0]] = np.array(
                [float(x) for x in line[1:]]
            ).astype(np.float32)
        else:
            emb_invalid += 1
    if emb_invalid > 0:
        print('WARNING: %i invalid lines' % emb_invalid)
    c_found = 0
    c_lower = 0
    c_zeros = 0
    n_words = len(id_to_word)
    # Lookup table initialization
    for i in range(n_words):
        word = id_to_word[i]
        if word in pre_trained:
            new_weights[i] = pre_trained[word]
            c_found += 1
        elif word.lower() in pre_trained:
            new_weights[i] = pre_trained[word.lower()]
            c_lower += 1
        elif re.sub('\d', '0', word.lower()) in pre_trained:
            new_weights[i] = pre_trained[
                re.sub('\d', '0', word.lower())
            ]
            c_zeros += 1
    print('Loaded %i pretrained embeddings.' % len(pre_trained))
    print('%i / %i (%.4f%%) words have been initialized with '
          'pretrained embeddings.' % (
        c_found + c_lower + c_zeros, n_words,
        100. * (c_found + c_lower + c_zeros) / n_words)
    )
    print('%i found directly, %i after lowercasing, '
          '%i after lowercasing + zero.' % (
        c_found, c_lower, c_zeros
    ))
    return new_weights


def full_to_half(s):
    """
    Convert full-width character to half-width one 
    """
    n = []
    for char in s:
        num = ord(char)
        if num == 0x3000:
            num = 32
        elif 0xFF01 <= num <= 0xFF5E:
            num -= 0xfee0
        char = chr(num)
        n.append(char)
    return ''.join(n)


def cut_to_sentence(text):
    """
    Cut text to sentences 
    """
    sentence = []
    sentences = []
    len_p = len(text)
    pre_cut = False
    for idx, word in enumerate(text):
        sentence.append(word)
        cut = False
        if pre_cut:
            cut=True
            pre_cut=False
        if word in u"!?\n":
            cut = True
            if len_p > idx+1:
                if text[idx+1] in ".\"\'?!":
                    cut = False
                    pre_cut=True

        if cut:
            sentences.append(sentence)
            sentence = []
    if sentence:
        sentences.append("".join(list(sentence)))
    return sentences


def replace_html(s):
    s = s.replace('&quot;','"')
    s = s.replace('&amp;','&')
    s = s.replace('&lt;','<')
    s = s.replace('&gt;','>')
    s = s.replace('&nbsp;',' ')
    s = s.replace("&ldquo;", "")
    s = s.replace("&rdquo;", "")
    s = s.replace("&mdash;","")
    s = s.replace("\xa0", " ")
    return(s)

def get_dict(path):
   with open(path,'rb') as f:
       dict = pickle.load(f)
   return dict

def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs


class BatchManager(object):
    '''
    def __init__(self, data,  batch_size):
        self.batch_data = self.sort_and_pad(data, batch_size)
        self.len_data = len(self.batch_data)
    '''
    def __init__(self,batch_size,name='train'):
        with open(f'data/prepare/' + name + '.pkl', 'rb') as f:
            data = pickle.load(f)
        self.batch_data = self.sort_and_pad(data,batch_size,name)
        self.len_data = len(self.batch_data)

    def sort_and_pad(self, data, batch_size, name):
        # 总共有多少批次
        num_batch = int(math.ceil(len(data) / batch_size))
        # print(len(data[0][0]))
        # 按照句子长度进行排序
        sorted_data = sorted(data, key=lambda x: len(x[0]))
        batch_data = list()
        for i in range(num_batch):
            batch_data.append(self.pad_data(sorted_data[i * int(batch_size):(i + 1) * int(batch_size)], name))
        return batch_data

    @staticmethod
    def pad_data(data, name):
        if name != 'task':
            chars = []
            targets = []
            bounds = []
            flags = []
            radicals = []
            pinyins = []

            max_length = max([len(sentence[0]) for sentence in data])  # len(data[-1][0])
            for line in data:
                char, target, bound, flag, radical, pinyin = line
                padding = [0] * (max_length - len(char))
                chars.append(char + padding)
                targets.append(target + padding)
                bounds.append(bound + padding)
                flags.append(flag + padding)
                radicals.append(radical + padding)
                pinyins.append(pinyin + padding)
            return [chars, targets, bounds, flags, radicals, pinyins]
        else:
            chars = []
            bounds = []
            flags = []
            radicals = []
            pinyins = []

            max_length = max([len(sentence[0]) for sentence in data])  # len(data[-1][0])
            for line in data:
                char, bound, flag, radical, pinyin = line
                padding = [0] * (max_length - len(char))
                chars.append(char + padding)
                bounds.append(bound + padding)
                flags.append(flag + padding)
                radicals.append(radical + padding)
                pinyins.append(pinyin + padding)
            return [chars, bounds, flags, radicals, pinyins]

    def iter_batch(self, shuffle=False):
        if shuffle:
            random.shuffle(self.batch_data)
        for idx in range(self.len_data):
            yield self.batch_data[idx]

'''
    def sort_and_pad(self, data, batch_size):
        num_batch = int(math.ceil(len(data) /batch_size))
        sorted_data = sorted(data, key=lambda x: len(x[0]))
        batch_data = list()
        for i in range(num_batch):
            batch_data.append(self.pad_data(sorted_data[i*int(batch_size) : (i+1)*int(batch_size)]))
        return batch_data

    @staticmethod
    def pad_data(data):
        strings = []
        chars = []
        segs = []
        targets = []
        max_length = max([len(sentence[0]) for sentence in data])  #len(data[-1][0])
        for line in data:
            string, char, seg, target = line
            padding = [0] * (max_length - len(string))
            strings.append(string + padding)
            chars.append(char + padding)
            segs.append(seg + padding)
            targets.append(target + padding)
        return [strings, chars, segs, targets]

    def iter_batch(self, shuffle=False):
        if shuffle:
            random.shuffle(self.batch_data)
        for idx in range(self.len_data):
            yield self.batch_data[idx]
'''
if __name__ == '__main__':
    get_data('train')
    get_data('test')

三、模型训练代码

import tensorflow as tf
from data_utils import BatchManager,get_dict
from model import Model
import time
batch_size=20
dict_file='data/prepare/dict.pkl'



def train():
    #-------数据准备---------
    train_manager=BatchManager(batch_size=20,name='train')
    test_manager = BatchManager(batch_size=100, name='test')

    #--------读取字典--------
    mapping_dict=get_dict(dict_file)
    #--------搭建模型-----
    model=Model(mapping_dict)

    init=tf.global_variables_initializer()
    with tf.Session() as sess:
        sess.run(init)
        for i in range(10):
            j=1
            for batch in train_manager.iter_batch(shuffle=True):
                start=time.time()
                loss=model.run_step(sess,batch)
                end=time.time()
                if j%10==0:
                    print('epoch:{},step:{}/{},loss:{},elapse:{},estimate:{}'.format(i+1,
                                                                                     j,
                                                                                     train_manager.len_data,
                                                                                     loss,
                                                                                     end-start,
                                                                                     (end - start)*(train_manager.len_data-j)))
                j+=1
            for batch in test_manager.iter_batch(shuffle=True):
                print(model.predict(sess,batch))

if __name__ == '__main__':
    train()

四、模型运行结果

......
('静', 'O'), ('脉', 'O'), ('血', 'O'), ('SPACE', 'O'), ('num', 'B-Test_Value'), ('.', 'I-Test_Value'), ('SPACE', 'I-Test_Value'), ('num', 'I-Test_Value'), ('SPACE', 'I-Test_Value'), ('m', 'I-Test_Value'), ('l', 'I-Test_Value'), (',', 'O'), ('静', 'O'), ('置', 'O'), ('SPACE', 'O'), ('num', 'O'), ('num', 'O'), ('SPACE', 'O'), ('m', 'O'), ('i', 'O'), ('n', 'O'), ('SPACE', 'O'), ('后', 'O'), ('SPACE', 'O'), ('num', 'O'), ('SPACE', 'O'), ('num', 'O'), ('num', 'O'), ('num', 'O'), ('SPACE', 'O'), ('转', 'O'), ('SPACE', 'O'), ('/', 'O'), ('SPACE', 'O'), ('m', 'O'), ('i', 'O'), ('n', 'O'), ('SPACE', 'O'), ('离', 'O'), ('心', 'O'), ('SPACE', 'O'), ('num', 'O'), ('SPACE', 'O'), ('m', 'O'), ('i', 'O'), ('n', 'O'), (',', 'O'), ('LB', 'O'), ('分', 'O'), ('离', 'O'), ('血', 'O'), ('清', 'O'), (',', 'O'), ('收', 'O'), ('集', 'O'), ('于', 'O'), ('无', 'O'), ('菌', 'O'), ('SPACE', 'O'), ('num', 'O'), ('.', 'O'), ('SPACE', 'O'), ('num', 'O'), ('SPACE', 'O'), ('m', 'O'), ('l', 'O'), ('SPACE', 'O'), ('E', 'O'), ('p', 'O'), ('SPACE', 'O'), ('管', 'O'), ('中', 'O'), (',', 'O')], [('LB', 'O'), ('num', 'O'), ('.', 'O'), ('SPACE', 'O'), ('横', 'O'), ('纹', 'O'), ('肌', 'O'), ('溶', 'O'), ('解', 'O'), ('症', 'O'), ('(', 'O'), ('SPACE', 'O'), ('r', 'O'), ('h', 'O'), ('a', 'O'), ('b', 'O'), ('d', 'O'), ('o', 'O'), ('m', 'O'), ('y', 'O'), ('o', 'O'), ('l', 'O'), ('y', 'O'), ('s', 'O'), ('i', 'O'), ('s', 'O'), (',', 'O'), ('SPACE', 'O'), ('R', 'O'), ('M', 'O'), (')', 'O'), ('是', 'O'), ('强', 'O'), ('体', 'O'), ('力', 'O'), ('活', 'O'), ('动', 'O'), ('、', 'O'), ('感', 'O'), ('LB', 'O'), ('染', 'O'), ('、', 'O'), ('肌', 'B-Symptom'), ('肉', 'I-Symptom'), ('UNK', 'I-Symptom'), ('压', 'I-Symptom'), ('、', 'O'), ('药', 'O'), ('物', 'O'), ('、', 'O'), ('电', 'O'), ('解', 'O'), ('质', 'I-Symptom'), ('紊', 'I-Symptom'), ('乱', 'I-Symptom'), ('、', 'O'), ('内', 'B-Reason'), ('分', 'I-Reason'), ('泌', 'I-Reason'), ('失', 'I-Reason'), ('衡', 'I-Reason'), ('等', 'O'), ('各', 'O'), ('种', 'O'), ('病', 'O'), ('因', 'O'), ('导', 'O'), ('致', 'O'), ('LB', 'O'), ('横', 'O'), ('纹', 'I-Symptom'), ('肌', 'I-Disease'), ('损', 'I-Disease'), ('伤', 'I-Disease'), (',', 'O'), ('肌', 'B-Anatomy'), ('细', 'I-Anatomy'), ('胞', 'I-Anatomy'), ('内', 'O'), ('成', 'O'), ('分', 'O'), ('进', 'O'), ('入', 'O'), ('血', 'O'), ('液', 'O'), ('循', 'O'), ('环', 'O'), ('引', 'O'), ('起', 'O'), ('一', 'O'), ('系', 'O'), ('列', 'O'), ('生', 'O'), ('化', 'O'), ('紊', 'O'), ('乱', 'O'), ('LB', 'O'), ('及', 'O'), ('组', 'O'), ('织', 'O'), ('器', 'O'), ('官', 'O'), ('损', 'O'), ('害', 'O'), ('的', 'O'), ('临', 'O'), ('床', 'O'), ('综', 'O'), ('合', 'O'), ('征', 'O'), ('。', 'O')]], 87.88071065989847)
epoch:2,step:10/12830,loss:11.601938247680664,elapse:0.7420175075531006,estimate:9512.66444683075
......

五、简化版:输入单特征模型训练及结果

简化版本:

# encoding = utf-8
import numpy as np
import tensorflow as tf
from tensorflow.contrib.crf import crf_log_likelihood
from tensorflow.contrib.crf import viterbi_decode
from tensorflow.contrib.layers.python.layers import initializers

from utils import result_to_json
from data_utils import create_input, iobes_iob,iob_iobes


class Model(object):
    #初始化模型参数
    def __init__(self, config):

        self.config = config
        
        self.lr = config["lr"]
        self.char_dim = config["char_dim"]
        self.lstm_dim = config["lstm_dim"]
        self.seg_dim = config["seg_dim"]

        self.num_tags = config["num_tags"]
        self.num_chars = config["num_chars"]#样本中总字数
        self.num_segs = 4

        self.global_step = tf.Variable(0, trainable=False)
        self.best_dev_f1 = tf.Variable(0.0, trainable=False)
        self.best_test_f1 = tf.Variable(0.0, trainable=False)
        self.initializer = initializers.xavier_initializer()
        
        

        # add placeholders for the model

        self.char_inputs = tf.placeholder(dtype=tf.int32,
                                          shape=[None, None],
                                          name="ChatInputs")
        self.seg_inputs = tf.placeholder(dtype=tf.int32,
                                         shape=[None, None],
                                         name="SegInputs")

        self.targets = tf.placeholder(dtype=tf.int32,
                                      shape=[None, None],
                                      name="Targets")
        # dropout keep prob
        self.dropout = tf.placeholder(dtype=tf.float32,
                                      name="Dropout")

        used = tf.sign(tf.abs(self.char_inputs))
        length = tf.reduce_sum(used, reduction_indices=1)
        self.lengths = tf.cast(length, tf.int32)
        self.batch_size = tf.shape(self.char_inputs)[0]
        self.num_steps = tf.shape(self.char_inputs)[-1]
        
        
        #Add model type by crownpku bilstm or idcnn
        self.model_type = config['model_type']
        #parameters for idcnn
        self.layers = [
            {
                'dilation': 1
            },
            {
                'dilation': 1
            },
            {
                'dilation': 2
            },
        ]
        self.filter_width = 3
        self.num_filter = self.lstm_dim 
        self.embedding_dim = self.char_dim + self.seg_dim
        self.repeat_times = 4
        self.cnn_output_width = 0
        
        # embeddings for chinese character and segmentation representation
        embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config)

        if self.model_type == 'bilstm':
            # apply dropout before feed to lstm layer
            model_inputs = tf.nn.dropout(embedding, self.dropout)

            # bi-directional lstm layer
            model_outputs = self.biLSTM_layer(model_inputs, self.lstm_dim, self.lengths)

            # logits for tags
            self.logits = self.project_layer_bilstm(model_outputs)
        
        elif self.model_type == 'idcnn':
            # apply dropout before feed to idcnn layer
            model_inputs = tf.nn.dropout(embedding, self.dropout)

            # ldcnn layer
            model_outputs = self.IDCNN_layer(model_inputs)

            # logits for tags
            self.logits = self.project_layer_idcnn(model_outputs)
        
        else:
            raise KeyError

        # loss of the model
        self.loss = self.loss_layer(self.logits, self.lengths)

        with tf.variable_scope("optimizer"):
            optimizer = self.config["optimizer"]
            if optimizer == "sgd":
                self.opt = tf.train.GradientDescentOptimizer(self.lr)
            elif optimizer == "adam":
                self.opt = tf.train.AdamOptimizer(self.lr)
            elif optimizer == "adgrad":
                self.opt = tf.train.AdagradOptimizer(self.lr)
            else:
                raise KeyError

            # apply grad clip to avoid gradient explosion
            grads_vars = self.opt.compute_gradients(self.loss)
            capped_grads_vars = [[tf.clip_by_value(g, -self.config["clip"], self.config["clip"]), v]
                                 for g, v in grads_vars]
            self.train_op = self.opt.apply_gradients(capped_grads_vars, self.global_step)

        # saver of the model
        self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)

    def embedding_layer(self, char_inputs, seg_inputs, config, name=None):
        """
        :param char_inputs: one-hot encoding of sentence
        :param seg_inputs: segmentation feature
        :param config: wither use segmentation feature
        :return: [1, num_steps, embedding size], 
        """
        #高:3 血:22 糖:23 和:24 高:3 血:22 压:25 char_inputs=[3,22,23,24,3,22,25]
        #高血糖 和 高血压 seg_inputs 高血糖=[1,2,3] 和=[0] 高血压=[1,2,3]  seg_inputs=[1,2,3,0,1,2,3]
        embedding = []
        self.char_inputs_test=char_inputs
        self.seg_inputs_test=seg_inputs
        with tf.variable_scope("char_embedding" if not name else name), tf.device('/gpu:0'):
            self.char_lookup = tf.get_variable(
                    name="char_embedding",
                    shape=[self.num_chars, self.char_dim],
                    initializer=self.initializer)
            #输入char_inputs='常' 对应的字典的索引/编号/value为:8
            #self.char_lookup=[2677*100]的向量,char_inputs字对应在字典的索引/编号/key=[1]
            embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            #self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            if config["seg_dim"]:
                with tf.variable_scope("seg_embedding"), tf.device('/gpu:0'):
                    self.seg_lookup = tf.get_variable(
                        name="seg_embedding",
                        #shape=[4*20]
                        shape=[self.num_segs, self.seg_dim],
                        initializer=self.initializer)
                    embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs))
            embed = tf.concat(embedding, axis=-1)
        self.embed_test=embed
        self.embedding_test=embedding
        return embed

    
    #IDCNN layer 
    def IDCNN_layer(self, model_inputs, 
                    name=None):
        """
        :param idcnn_inputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, cnn_output_width]
        """
        #tf.expand_dims会向tensor中插入一个维度,插入位置就是参数代表的位置(维度从0开始)。
        model_inputs = tf.expand_dims(model_inputs, 1)
        self.model_inputs_test=model_inputs
        reuse = False
        if self.dropout == 1.0:
            reuse = True
        with tf.variable_scope("idcnn" if not name else name):
            #shape=[1*3*120*100]
            shape=[1, self.filter_width, self.embedding_dim,
                       self.num_filter]
            print(shape)
            filter_weights = tf.get_variable(
                "idcnn_filter",
                shape=[1, self.filter_width, self.embedding_dim,
                       self.num_filter],
                initializer=self.initializer)
            
            """
            shape of input = [batch, in_height, in_width, in_channels]
            shape of filter = [filter_height, filter_width, in_channels, out_channels]
            """
            layerInput = tf.nn.conv2d(model_inputs,
                                      filter_weights,
                                      strides=[1, 1, 1, 1],
                                      padding="SAME",
                                      name="init_layer",use_cudnn_on_gpu=True)
            self.layerInput_test=layerInput
            finalOutFromLayers = []
            
            totalWidthForLastDim = 0
            for j in range(self.repeat_times):
                for i in range(len(self.layers)):
                    #1,1,2
                    dilation = self.layers[i]['dilation']
                    isLast = True if i == (len(self.layers) - 1) else False
                    with tf.variable_scope("atrous-conv-layer-%d" % i,
                                           reuse=True
                                           if (reuse or j > 0) else False):
                        #w 卷积核的高度,卷积核的宽度,图像通道数,卷积核个数
                        w = tf.get_variable(
                            "filterW",
                            shape=[1, self.filter_width, self.num_filter,
                                   self.num_filter],
                            initializer=tf.contrib.layers.xavier_initializer())
                        if j==1 and i==1:
                            self.w_test_1=w
                        if j==2 and i==1:
                            self.w_test_2=w                            
                        b = tf.get_variable("filterB", shape=[self.num_filter])
#tf.nn.atrous_conv2d(value,filters,rate,padding,name=None)
    #除去name参数用以指定该操作的name,与方法有关的一共四个参数:                  
    #value: 
    #指需要做卷积的输入图像,要求是一个4维Tensor,具有[batch, height, width, channels]这样的shape,具体含义是[训练时一个batch的图片数量, 图片高度, 图片宽度, 图像通道数] 
    #filters: 
    #相当于CNN中的卷积核,要求是一个4维Tensor,具有[filter_height, filter_width, channels, out_channels]这样的shape,具体含义是[卷积核的高度,卷积核的宽度,图像通道数,卷积核个数],同理这里第三维channels,就是参数value的第四维
    #rate: 
    #要求是一个int型的正数,正常的卷积操作应该会有stride(即卷积核的滑动步长),但是空洞卷积是没有stride参数的,
    #这一点尤其要注意。取而代之,它使用了新的rate参数,那么rate参数有什么用呢?它定义为我们在输入
    #图像上卷积时的采样间隔,你可以理解为卷积核当中穿插了(rate-1)数量的“0”,
    #把原来的卷积核插出了很多“洞洞”,这样做卷积时就相当于对原图像的采样间隔变大了。
    #具体怎么插得,可以看后面更加详细的描述。此时我们很容易得出rate=1时,就没有0插入,
    #此时这个函数就变成了普通卷积。  
    #padding: 
    #string类型的量,只能是”SAME”,”VALID”其中之一,这个值决定了不同边缘填充方式。
    #ok,完了,到这就没有参数了,或许有的小伙伴会问那“stride”参数呢。其实这个函数已经默认了stride=1,也就是滑动步长无法改变,固定为1。
    #结果返回一个Tensor,填充方式为“VALID”时,返回[batch,height-2*(filter_width-1),width-2*(filter_height-1),out_channels]的Tensor,填充方式为“SAME”时,返回[batch, height, width, out_channels]的Tensor,这个结果怎么得出来的?先不急,我们通过一段程序形象的演示一下空洞卷积。                        
                        conv = tf.nn.atrous_conv2d(layerInput,
                                                   w,
                                                   rate=dilation,
                                                   padding="SAME")
                        self.conv_test=conv 
                        conv = tf.nn.bias_add(conv, b)
                        conv = tf.nn.relu(conv)
                        if isLast:
                            finalOutFromLayers.append(conv)
                            totalWidthForLastDim += self.num_filter
                        layerInput = conv
            finalOut = tf.concat(axis=3, values=finalOutFromLayers)
            keepProb = 1.0 if reuse else 0.5
            finalOut = tf.nn.dropout(finalOut, keepProb)
            #Removes dimensions of size 1 from the shape of a tensor. 
                #从tensor中删除所有大小是1的维度
            
                #Given a tensor input, this operation returns a tensor of the same type with all dimensions of size 1 removed. If you don’t want to remove all size 1 dimensions, you can remove specific size 1 dimensions by specifying squeeze_dims. 
            
                #给定张量输入,此操作返回相同类型的张量,并删除所有尺寸为1的尺寸。 如果不想删除所有尺寸1尺寸,可以通过指定squeeze_dims来删除特定尺寸1尺寸。
            finalOut = tf.squeeze(finalOut, [1])
            finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim])
            self.cnn_output_width = totalWidthForLastDim
            return finalOut

    def project_layer_bilstm(self, lstm_outputs, name=None):
        """
        hidden layer between lstm layer and logits
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            with tf.variable_scope("hidden"):
                W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())
                output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2])
                hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b))

            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())

                pred = tf.nn.xw_plus_b(hidden, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])
    
    #Project layer for idcnn by crownpku
    #Delete the hidden layer, and change bias initializer
    def project_layer_idcnn(self, idcnn_outputs, name=None):
        """
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            
            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b",  initializer=tf.constant(0.001, shape=[self.num_tags]))

                pred = tf.nn.xw_plus_b(idcnn_outputs, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])

    def loss_layer(self, project_logits, lengths, name=None):
        """
        calculate crf loss
        :param project_logits: [1, num_steps, num_tags]
        :return: scalar loss
        """
        with tf.variable_scope("crf_loss"  if not name else name):
            small = -1000.0
            # pad logits for crf loss
            start_logits = tf.concat(
                [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
            pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
            logits = tf.concat([project_logits, pad_logits], axis=-1)
            logits = tf.concat([start_logits, logits], axis=1)
            targets = tf.concat(
                [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)

            self.trans = tf.get_variable(
                "transitions",
                shape=[self.num_tags + 1, self.num_tags + 1],
                initializer=self.initializer)
            #crf_log_likelihood在一个条件随机场里面计算标签序列的log-likelihood
            #inputs: 一个形状为[batch_size, max_seq_len, num_tags] 的tensor,
            #一般使用BILSTM处理之后输出转换为他要求的形状作为CRF层的输入. 
            #tag_indices: 一个形状为[batch_size, max_seq_len] 的矩阵,其实就是真实标签. 
            #sequence_lengths: 一个形状为 [batch_size] 的向量,表示每个序列的长度. 
            #transition_params: 形状为[num_tags, num_tags] 的转移矩阵    
            #log_likelihood: 标量,log-likelihood 
            #transition_params: 形状为[num_tags, num_tags] 的转移矩阵               
            log_likelihood, self.trans = crf_log_likelihood(
                inputs=logits,
                tag_indices=targets,
                transition_params=self.trans,
                sequence_lengths=lengths+1)
            return tf.reduce_mean(-log_likelihood)

    def create_feed_dict(self, is_train, batch):
        """
        :param is_train: Flag, True for train batch
        :param batch: list train/evaluate data 
        :return: structured data to feed
        """
        _, chars, segs, tags = batch
        feed_dict = {
            self.char_inputs: np.asarray(chars),
            self.seg_inputs: np.asarray(segs),
            self.dropout: 1.0,
        }
        if is_train:
            feed_dict[self.targets] = np.asarray(tags)
            feed_dict[self.dropout] = self.config["dropout_keep"]
        return feed_dict

    def run_step(self, sess, is_train, batch):
        """
        :param sess: session to run the batch
        :param is_train: a flag indicate if it is a train batch
        :param batch: a dict containing batch data
        :return: batch result, loss of the batch or logits
        """
        feed_dict = self.create_feed_dict(is_train, batch)
        if is_train:
            global_step, loss,_,char_lookup_out,seg_lookup_out,char_inputs_test,seg_inputs_test,embed_test,embedding_test,\
                model_inputs_test,layerInput_test,conv_test,w_test_1,w_test_2,char_inputs_test= sess.run(
                [self.global_step, self.loss, self.train_op,self.char_lookup,self.seg_lookup,self.char_inputs_test,self.seg_inputs_test,\
                 self.embed_test,self.embedding_test,self.model_inputs_test,self.layerInput_test,self.conv_test,self.w_test_1,self.w_test_2,self.char_inputs],
                feed_dict)
            return global_step, loss
        else:
            lengths, logits = sess.run([self.lengths, self.logits], feed_dict)
            return lengths, logits

    def decode(self, logits, lengths, matrix):
        """
        :param logits: [batch_size, num_steps, num_tags]float32, logits
        :param lengths: [batch_size]int32, real length of each sequence
        :param matrix: transaction matrix for inference
        :return:
        """
        # inference final labels usa viterbi Algorithm
        paths = []
        small = -1000.0
        start = np.asarray([[small]*self.num_tags +[0]])
        for score, length in zip(logits, lengths):
            score = score[:length]
            pad = small * np.ones([length, 1])
            logits = np.concatenate([score, pad], axis=1)
            logits = np.concatenate([start, logits], axis=0)
            path, _ = viterbi_decode(logits, matrix)

            paths.append(path[1:])
        return paths

    def evaluate(self, sess, data_manager, id_to_tag):
        """
        :param sess: session  to run the model 
        :param data: list of data
        :param id_to_tag: index to tag name
        :return: evaluate result
        """
        results = []
        trans = self.trans.eval()
        for batch in data_manager.iter_batch():
            strings = batch[0]
            tags = batch[-1]
            lengths, scores = self.run_step(sess, False, batch)
            batch_paths = self.decode(scores, lengths, trans)
            for i in range(len(strings)):
                result = []
                string = strings[i][:lengths[i]]
                gold = iobes_iob([id_to_tag[int(x)] for x in tags[i][:lengths[i]]])
                pred = iobes_iob([id_to_tag[int(x)] for x in batch_paths[i][:lengths[i]]])
                #gold = iob_iobes([id_to_tag[int(x)] for x in tags[i][:lengths[i]]])
                #pred = iob_iobes([id_to_tag[int(x)] for x in batch_paths[i][:lengths[i]]])                
                for char, gold, pred in zip(string, gold, pred):
                    result.append(" ".join([char, gold, pred]))
                results.append(result)
        return results

    def evaluate_line(self, sess, inputs, id_to_tag):
        trans = self.trans.eval(session=sess)
        lengths, scores = self.run_step(sess, False, inputs)
        batch_paths = self.decode(scores, lengths, trans)
        tags = [id_to_tag[idx] for idx in batch_paths[0]]
        return result_to_json(inputs[0][0], tags)

# encoding = utf8
import re
import math
import codecs
import random

import numpy as np
import jieba
jieba.initialize()


def create_dico(item_list):
    """
    Create a dictionary of items from a list of list of items.
    """
    assert type(item_list) is list
    dico = {}
    for items in item_list:
        for item in items:
            if item not in dico:
                dico[item] = 1
            else:
                dico[item] += 1
    return dico


def create_mapping(dico):
    """
    Create a mapping (item to ID / ID to item) from a dictionary.
    Items are ordered by decreasing frequency.
    """
    sorted_items = sorted(dico.items(), key=lambda x: (-x[1], x[0]))
    id_to_item = {i: v[0] for i, v in enumerate(sorted_items)}
    item_to_id = {v: k for k, v in id_to_item.items()}
    return item_to_id, id_to_item


def zero_digits(s):
    """
    Replace every digit in a string by a zero.
    """
    return re.sub('\d', '0', s)


def iob2(tags):
    """
    Check that tags have a valid IOB format.
    Tags in IOB1 format are converted to IOB2.
    """
    for i, tag in enumerate(tags):
        if tag == 'O':
            continue
        split = tag.split('-')
        if len(split) != 2 or split[0] not in ['I', 'B']:
            return False
        if split[0] == 'B':
            continue
        elif i == 0 or tags[i - 1] == 'O':  # conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
        elif tags[i - 1][1:] == tag[1:]:
            continue
        else:  # conversion IOB1 to IOB2
            tags[i] = 'B' + tag[1:]
    return True


def iob_iobes(tags):
    """
    IOB -> IOBES
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag == 'O':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'B':
            if i + 1 != len(tags) and \
               tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('B-', 'S-'))
        elif tag.split('-')[0] == 'I':
            if i + 1 < len(tags) and \
                    tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('I-', 'E-'))
        else:
            raise Exception('Invalid IOB format!')
    return new_tags


def iobes_iob(tags):
    """
    IOBES -> IOB
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag.split('-')[0] == 'B':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'I':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'S':
            new_tags.append(tag.replace('S-', 'B-'))
        elif tag.split('-')[0] == 'E':
            new_tags.append(tag.replace('E-', 'I-'))
        elif tag.split('-')[0] == 'O':
            new_tags.append(tag)
        else:
            raise Exception('Invalid format!')
    return new_tags


def insert_singletons(words, singletons, p=0.5):
    """
    Replace singletons by the unknown word with a probability p.
    """
    new_words = []
    for word in words:
        if word in singletons and np.random.uniform() < p:
            new_words.append(0)
        else:
            new_words.append(word)
    return new_words


def get_seg_features(string):
    """
    Segment text with jieba
    features are represented in bies format
    s donates single word
    """
    seg_feature = []

    for word in jieba.cut(string):
        if len(word) == 1:
            seg_feature.append(0)
        else:
            tmp = [2] * len(word)
            tmp[0] = 1
            tmp[-1] = 3
            seg_feature.extend(tmp)
    return seg_feature


def create_input(data):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    inputs = list()
    inputs.append(data['chars'])
    inputs.append(data["segs"])
    inputs.append(data['tags'])
    return inputs


def load_word2vec(emb_path, id_to_word, word_dim, old_weights):
    """
    Load word embedding from pre-trained file
    embedding size must match
    """
    #把字典中所有的字转化为向量,假设字在字向量文件中,那就用字向量文件中的值初始化向量,
    new_weights = old_weights
    print('Loading pretrained embeddings from {}...'.format(emb_path))
    pre_trained = {}
    emb_invalid = 0
    for i, line in enumerate(codecs.open(emb_path, 'r', 'utf-8')):
        line = line.rstrip().split()
        if len(line) == word_dim + 1:
            pre_trained[line[0]] = np.array(
                [float(x) for x in line[1:]]
            ).astype(np.float32) 
        else:
            emb_invalid += 1
    if emb_invalid > 0:
        print('WARNING: %i invalid lines' % emb_invalid)
    c_found = 0
    c_lower = 0
    c_zeros = 0
    n_words = len(id_to_word)
    # Lookup table initialization
    for i in range(n_words):
        word = id_to_word[i]
        if word in pre_trained:
            new_weights[i] = pre_trained[word]
            c_found += 1
        elif word.lower() in pre_trained:
            new_weights[i] = pre_trained[word.lower()]
            c_lower += 1
        elif re.sub('\d', '0', word.lower()) in pre_trained:
            new_weights[i] = pre_trained[
                re.sub('\d', '0', word.lower())
            ]
            c_zeros += 1
    print('Loaded %i pretrained embeddings.' % len(pre_trained))
    print('%i / %i (%.4f%%) words have been initialized with '
          'pretrained embeddings.' % (
        c_found + c_lower + c_zeros, n_words,
        100. * (c_found + c_lower + c_zeros) / n_words)
    )
    print('%i found directly, %i after lowercasing, '
          '%i after lowercasing + zero.' % (
        c_found, c_lower, c_zeros
    ))
    return new_weights


def full_to_half(s):
    """
    Convert full-width character to half-width one 
    """
    n = []
    for char in s:
        num = ord(char)
        if num == 0x3000:
            num = 32
        elif 0xFF01 <= num <= 0xFF5E:
            num -= 0xfee0
        char = chr(num)
        n.append(char)
    return ''.join(n)


def cut_to_sentence(text):
    """
    Cut text to sentences 
    """
    sentence = []
    sentences = []
    len_p = len(text)
    pre_cut = False
    for idx, word in enumerate(text):
        sentence.append(word)
        cut = False
        if pre_cut:
            cut=True
            pre_cut=False
        if word in u"!?\n":
            cut = True
            if len_p > idx+1:
                if text[idx+1] in ".\"\'?!":
                    cut = False
                    pre_cut=True

        if cut:
            sentences.append(sentence)
            sentence = []
    if sentence:
        sentences.append("".join(list(sentence)))
    return sentences


def replace_html(s):
    s = s.replace('&quot;','"')
    s = s.replace('&amp;','&')
    s = s.replace('&lt;','<')
    s = s.replace('&gt;','>')
    s = s.replace('&nbsp;',' ')
    s = s.replace("&ldquo;", "")
    s = s.replace("&rdquo;", "")
    s = s.replace("&mdash;","")
    s = s.replace("\xa0", " ")
    return(s)


def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs


class BatchManager(object):

    def __init__(self, data,  batch_size):
        self.batch_data = self.sort_and_pad(data, batch_size)
        self.len_data = len(self.batch_data)

    def sort_and_pad(self, data, batch_size):
        num_batch = int(math.ceil(len(data) /batch_size))
        sorted_data = sorted(data, key=lambda x: len(x[0]))
        batch_data = list()
        for i in range(num_batch):
            batch_data.append(self.pad_data(sorted_data[i*int(batch_size) : (i+1)*int(batch_size)]))
        return batch_data

    @staticmethod
    def pad_data(data):
        strings = []
        chars = []
        segs = []
        targets = []
        max_length = max([len(sentence[0]) for sentence in data])
        for line in data:
            string, char, seg, target = line
            padding = [0] * (max_length - len(string))
            strings.append(string + padding)
            chars.append(char + padding)
            segs.append(seg + padding)
            targets.append(target + padding)
        return [strings, chars, segs, targets]

    def iter_batch(self, shuffle=False):
        if shuffle:
            random.shuffle(self.batch_data)
        for idx in range(self.len_data):
            yield self.batch_data[idx]

import os
import json
import shutil
import logging

import tensorflow as tf
from conlleval import return_report

models_path = "./models"
eval_path = "./evaluation"
eval_temp = os.path.join(eval_path, "temp")
eval_script = os.path.join(eval_path, "conlleval")


def get_logger(log_file):
    logger = logging.getLogger(log_file)
    logger.setLevel(logging.DEBUG)
    fh = logging.FileHandler(log_file)
    fh.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    return logger


# def test_ner(results, path):
#     """
#     Run perl script to evaluate model
#     """
#     script_file = "conlleval"
#     output_file = os.path.join(path, "ner_predict.utf8")
#     result_file = os.path.join(path, "ner_result.utf8")
#     with open(output_file, "w") as f:
#         to_write = []
#         for block in results:
#             for line in block:
#                 to_write.append(line + "\n")
#             to_write.append("\n")
#
#         f.writelines(to_write)
#     os.system("perl {} < {} > {}".format(script_file, output_file, result_file))
#     eval_lines = []
#     with open(result_file) as f:
#         for line in f:
#             eval_lines.append(line.strip())
#     return eval_lines


def test_ner(results, path):
    """
    Run perl script to evaluate model
    """
    output_file = os.path.join(path, "ner_predict.utf8")
    with open(output_file, "w",encoding='utf8') as f:
        to_write = []
        for block in results:
            for line in block:
                to_write.append(line + "\n")
            to_write.append("\n")

        f.writelines(to_write)
    eval_lines = return_report(output_file)
    return eval_lines


def print_config(config, logger):
    """
    Print configuration of the model
    """
    for k, v in config.items():
        logger.info("{}:\t{}".format(k.ljust(15), v))


def make_path(params):
    """
    Make folders for training and evaluation
    """
    if not os.path.isdir(params.result_path):
        os.makedirs(params.result_path)
    if not os.path.isdir(params.ckpt_path):
        os.makedirs(params.ckpt_path)
    if not os.path.isdir("log"):
        os.makedirs("log")


def clean(params):
    """
    Clean current folder
    remove saved model and training log
    """
    if os.path.isfile(params.vocab_file):
        os.remove(params.vocab_file)

    if os.path.isfile(params.map_file):
        os.remove(params.map_file)

    if os.path.isdir(params.ckpt_path):
        shutil.rmtree(params.ckpt_path)

    if os.path.isdir(params.summary_path):
        shutil.rmtree(params.summary_path)

    if os.path.isdir(params.result_path):
        shutil.rmtree(params.result_path)

    if os.path.isdir("log"):
        shutil.rmtree("log")

    if os.path.isdir("__pycache__"):
        shutil.rmtree("__pycache__")

    if os.path.isfile(params.config_file):
        os.remove(params.config_file)

    if os.path.isfile(params.vocab_file):
        os.remove(params.vocab_file)


def save_config(config, config_file):
    """
    Save configuration of the model
    parameters are stored in json format
    """
    with open(config_file, "w", encoding="utf8") as f:
        json.dump(config, f, ensure_ascii=False, indent=4)


def load_config(config_file):
    """
    Load configuration of the model
    parameters are stored in json format
    """
    with open(config_file, encoding="utf8") as f:
        return json.load(f)


def convert_to_text(line):
    """
    Convert conll data to text
    """
    to_print = []
    for item in line:

        try:
            if item[0] == " ":
                to_print.append(" ")
                continue
            word, gold, tag = item.split(" ")
            if tag[0] in "SB":
                to_print.append("[")
            to_print.append(word)
            if tag[0] in "SE":
                to_print.append("@" + tag.split("-")[-1])
                to_print.append("]")
        except:
            print(list(item))
    return "".join(to_print)


def save_model(sess, model, path, logger):
    checkpoint_path = os.path.join(path, "ner.ckpt")
    model.saver.save(sess, checkpoint_path)
    logger.info("model saved")


def create_model(session, Model_class, path, load_vec, config, id_to_char, logger):
    # create model, reuse parameters if exists
    model = Model_class(config)

    ckpt = tf.train.get_checkpoint_state(path)
    if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
        logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
        model.saver.restore(session, ckpt.model_checkpoint_path)
    else:
        logger.info("Created model with fresh parameters.")
        session.run(tf.global_variables_initializer())
        if config["pre_emb"]:
            emb_weights = session.run(model.char_lookup.read_value())
            emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights)
            session.run(model.char_lookup.assign(emb_weights))
            logger.info("Load pre-trained embedding.")
    return model


def result_to_json(string, tags):
    item = {"string": string, "entities": []}
    entity_name = ""
    entity_start = 0
    idx = 0
    for char, tag in zip(string, tags):
        if tag[0] == "S":
            item["entities"].append({"word": char, "start": idx, "end": idx+1, "type":tag[2:]})
        elif tag[0] == "B":
            entity_name += char
            entity_start = idx
        elif tag[0] == "I":
            entity_name += char
        elif tag[0] == "E":
            entity_name += char
            item["entities"].append({"word": entity_name, "start": entity_start, "end": idx + 1, "type": tag[2:]})
            entity_name = ""
        else:
            entity_name = ""
            entity_start = idx
        idx += 1
    return item


main2.py

# encoding=utf8

import codecs
import pickle
import itertools
from collections import OrderedDict
import os
import tensorflow as tf
import numpy as np
from model import Model
from loader import load_sentences, update_tag_scheme
from loader import char_mapping, tag_mapping
from loader import augment_with_pretrained, prepare_dataset
from utils import get_logger, make_path, clean, create_model, save_model
from utils import print_config, save_config, load_config, test_ner
from data_utils import load_word2vec, create_input, input_from_line, BatchManager
root_path=os.getcwd()+os.sep
flags = tf.app.flags
flags.DEFINE_boolean("clean",       True,      "clean train folder")
flags.DEFINE_boolean("train",       False,      "Whether train the model")
# configurations for the model
flags.DEFINE_integer("seg_dim",     20,         "Embedding size for segmentation, 0 if not used")
flags.DEFINE_integer("char_dim",    100,        "Embedding size for characters")
flags.DEFINE_integer("lstm_dim",    100,        "Num of hidden units in LSTM, or num of filters in IDCNN")
flags.DEFINE_string("tag_schema",   "iobes",    "tagging schema iobes or iob")

# configurations for training
flags.DEFINE_float("clip",          5,          "Gradient clip")
flags.DEFINE_float("dropout",       0.5,        "Dropout rate")
flags.DEFINE_float("batch_size",    60,         "batch size")
flags.DEFINE_float("lr",            0.001,      "Initial learning rate")
flags.DEFINE_string("optimizer",    "adam",     "Optimizer for training")
flags.DEFINE_boolean("pre_emb",     True,       "Wither use pre-trained embedding")
flags.DEFINE_boolean("zeros",       True,      "Wither replace digits with zero")
flags.DEFINE_boolean("lower",       False,       "Wither lower case")

flags.DEFINE_integer("max_epoch",   100,        "maximum training epochs")
flags.DEFINE_integer("steps_check", 100,        "steps per checkpoint")
flags.DEFINE_string("ckpt_path",    "ckpt",      "Path to save model")
flags.DEFINE_string("summary_path", "summary",      "Path to store summaries")
flags.DEFINE_string("log_file",     "train.log",    "File for log")
flags.DEFINE_string("map_file",     "maps.pkl",     "file for maps")
flags.DEFINE_string("vocab_file",   "vocab.json",   "File for vocab")
flags.DEFINE_string("config_file",  "config_file",  "File for config")
flags.DEFINE_string("script",       "conlleval",    "evaluation script")
flags.DEFINE_string("result_path",  "result",       "Path for results")
flags.DEFINE_string("emb_file",     os.path.join(root_path+"data", "vec.txt"),  "Path for pre_trained embedding")
flags.DEFINE_string("train_file",   os.path.join(root_path+"data", "example.train"),  "Path for train data")
flags.DEFINE_string("dev_file",     os.path.join(root_path+"data", "example.dev"),    "Path for dev data")
flags.DEFINE_string("test_file",    os.path.join(root_path+"data", "example.test"),   "Path for test data")

flags.DEFINE_string("model_type", "idcnn", "Model type, can be idcnn or bilstm")
#flags.DEFINE_string("model_type", "bilstm", "Model type, can be idcnn or bilstm")

FLAGS = tf.app.flags.FLAGS
assert FLAGS.clip < 5.1, "gradient clip should't be too much"
assert 0 <= FLAGS.dropout < 1, "dropout rate between 0 and 1"
assert FLAGS.lr > 0, "learning rate must larger than zero"
assert FLAGS.optimizer in ["adam", "sgd", "adagrad"]


# config for the model
def config_model(char_to_id, tag_to_id):
    config = OrderedDict()
    config["model_type"] = FLAGS.model_type
    config["num_chars"] = len(char_to_id)
    config["char_dim"] = FLAGS.char_dim
    config["num_tags"] = len(tag_to_id)
    config["seg_dim"] = FLAGS.seg_dim
    config["lstm_dim"] = FLAGS.lstm_dim
    config["batch_size"] = FLAGS.batch_size

    config["emb_file"] = FLAGS.emb_file
    config["clip"] = FLAGS.clip
    config["dropout_keep"] = 1.0 - FLAGS.dropout
    config["optimizer"] = FLAGS.optimizer
    config["lr"] = FLAGS.lr
    config["tag_schema"] = FLAGS.tag_schema
    config["pre_emb"] = FLAGS.pre_emb
    config["zeros"] = FLAGS.zeros
    config["lower"] = FLAGS.lower
    return config


def evaluate(sess, model, name, data, id_to_tag, logger):
    logger.info("evaluate:{}".format(name))
    ner_results = model.evaluate(sess, data, id_to_tag)
    eval_lines = test_ner(ner_results, FLAGS.result_path)
    for line in eval_lines:
        logger.info(line)
    f1 = float(eval_lines[1].strip().split()[-1])

    if name == "dev":
        best_test_f1 = model.best_dev_f1.eval()
        if f1 > best_test_f1:
            tf.assign(model.best_dev_f1, f1).eval()
            logger.info("new best dev f1 score:{:>.3f}".format(f1))
        return f1 > best_test_f1
    elif name == "test":
        best_test_f1 = model.best_test_f1.eval()
        if f1 > best_test_f1:
            tf.assign(model.best_test_f1, f1).eval()
            logger.info("new best test f1 score:{:>.3f}".format(f1))
        return f1 > best_test_f1


def train():
    # load data sets
    train_sentences = load_sentences(FLAGS.train_file, FLAGS.lower, FLAGS.zeros)
    dev_sentences = load_sentences(FLAGS.dev_file, FLAGS.lower, FLAGS.zeros)
    test_sentences = load_sentences(FLAGS.test_file, FLAGS.lower, FLAGS.zeros)

    # Use selected tagging scheme (IOB / IOBES)
    update_tag_scheme(train_sentences, FLAGS.tag_schema)
    update_tag_scheme(test_sentences, FLAGS.tag_schema)
    update_tag_scheme(dev_sentences, FLAGS.tag_schema)
    # create maps if not exist
    if not os.path.isfile(FLAGS.map_file):
        # create dictionary for word
        if FLAGS.pre_emb:
            dico_chars_train = char_mapping(train_sentences, FLAGS.lower)[0]
            dico_chars, char_to_id, id_to_char = augment_with_pretrained(
                dico_chars_train.copy(),
                FLAGS.emb_file,
                list(itertools.chain.from_iterable(
                    [[w[0] for w in s] for s in test_sentences])
                )
            )
        else:
            _c, char_to_id, id_to_char = char_mapping(train_sentences, FLAGS.lower)

        # Create a dictionary and a mapping for tags
        _t, tag_to_id, id_to_tag = tag_mapping(train_sentences)
        #with open('maps.txt','w',encoding='utf8') as f1:
            #f1.writelines(str(char_to_id)+" "+id_to_char+" "+str(tag_to_id)+" "+id_to_tag+'\n')
        with open(FLAGS.map_file, "wb") as f:
            pickle.dump([char_to_id, id_to_char, tag_to_id, id_to_tag], f)
    else:
        with open(FLAGS.map_file, "rb") as f:
            char_to_id, id_to_char, tag_to_id, id_to_tag = pickle.load(f)

    # prepare data, get a collection of list containing index
    train_data = prepare_dataset(
        train_sentences, char_to_id, tag_to_id, FLAGS.lower
    )
    dev_data = prepare_dataset(
        dev_sentences, char_to_id, tag_to_id, FLAGS.lower
    )
    test_data = prepare_dataset(
        test_sentences, char_to_id, tag_to_id, FLAGS.lower
    )
    print("%i / %i / %i sentences in train / dev / test." % (
        len(train_data), 0, len(test_data)))

    train_manager = BatchManager(train_data, FLAGS.batch_size)
    dev_manager = BatchManager(dev_data, 100)
    test_manager = BatchManager(test_data, 100)
    # make path for store log and model if not exist
    make_path(FLAGS)
    if os.path.isfile(FLAGS.config_file):
        config = load_config(FLAGS.config_file)
    else:
        config = config_model(char_to_id, tag_to_id)
        save_config(config, FLAGS.config_file)
    make_path(FLAGS)

    log_path = os.path.join("log", FLAGS.log_file)
    logger = get_logger(log_path)
    print_config(config, logger)

    # limit GPU memory
    #tf_config = tf.ConfigProto()
    tf_config = tf.ConfigProto(allow_soft_placement = True)
    tf_config.gpu_options.allow_growth = True
    steps_per_epoch = train_manager.len_data
    with tf.Session(config=tf_config) as sess:
        model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
        logger.info("start training")
        loss = []
        with tf.device("/gpu:0"):
            for i in range(100):
                for batch in train_manager.iter_batch(shuffle=True):
                    step, batch_loss = model.run_step(sess, True, batch)
                    loss.append(batch_loss)
                    if step % FLAGS.steps_check == 0:
                        iteration = step // steps_per_epoch + 1
                        logger.info("iteration:{} step:{}/{}, "
                                    "NER loss:{:>9.6f}".format(
                            iteration, step%steps_per_epoch, steps_per_epoch, np.mean(loss)))
                        loss = []
    
               # best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger)
                if i%7==0:
                    save_model(sess, model, FLAGS.ckpt_path, logger)
            #evaluate(sess, model, "test", test_manager, id_to_tag, logger)


def evaluate_line():
    config = load_config(FLAGS.config_file)
    logger = get_logger(FLAGS.log_file)
    # limit GPU memory
    #tf_config = tf.ConfigProto()
    tf_config = tf.ConfigProto(allow_soft_placement=True)
    tf_config.gpu_options.allow_growth = True
    with open(FLAGS.map_file, "rb") as f:
        char_to_id, id_to_char, tag_to_id, id_to_tag = pickle.load(f)
    with tf.Session(config=tf_config) as sess:
        model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
        while True:
            # try:
            #     line = input("请输入测试句子:")
            #     result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)
            #     print(result)
            # except Exception as e:
            #     logger.info(e)

                line = input("请输入测试句子:")
                result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)
                print(result)


def main(_):

    #if 1:
    if 0:
        if FLAGS.clean:
            clean(FLAGS)
        train()
    else:
        evaluate_line()


if __name__ == "__main__":
    tf.app.run(main)




运行main2.py,结果如下:

  .......
  optimizer/Adam/update_char_embedding/seg_embedding/seg_embedding/AssignSub (AssignSub) /device:GPU:0
  optimizer/Adam/update_char_embedding/seg_embedding/seg_embedding/group_deps (NoOp) /device:GPU:0
  save/Assign_4 (Assign) /device:GPU:0
  save/Assign_17 (Assign) /device:GPU:0
  save/Assign_18 (Assign) /device:GPU:0

请输入测试句子:现患者一般情况可,双肺呼吸音清晰,未闻及啰音,律齐,各瓣膜听诊区未闻及病理性杂音,腹平坦,软,全腹无压痛、反跳痛及肌紧张,全腹未触及异常包块。右腕及右膝部压痛,表面轻度红肿,活动稍受限。


{'string': '现患者一般情况可,双肺呼吸音清晰,未闻及啰音,律齐,各瓣膜听诊区未闻及病理性杂音,腹平坦,软,全腹无压痛、反跳痛及肌紧张,全腹未触及异常包块。右腕及右膝部压痛,表面轻度红肿,活动稍受限。', 'entities': [{'word': '情况', 'start': 5, 'end': 7, 'type': 'DRU'}, {'word': '双肺呼吸音', 'start': 9, 'end': 14, 'type': 'SYM'}, {'word': '啰音', 'start': 20, 'end': 22, 'type': 'SGN'}, {'word': '瓣膜听诊', 'start': 27, 'end': 31, 'type': 'TES'}, {'word': '病理性杂音', 'start': 35, 'end': 40, 'type': 'SGN'}, {'word': '平坦', 'start': 42, 'end': 44, 'type': 'DRU'}, {'word': '全腹', 'start': 47, 'end': 49, 'type': 'REG'}, {'word': '压痛', 'start': 50, 'end': 52, 'type': 'SGN'}, {'word': '反跳痛', 'start': 53, 'end': 56, 'type': 'SGN'}, {'word': '肌紧张', 'start': 57, 'end': 60, 'type': 'SGN'}, {'word': '全腹', 'start': 61, 'end': 63, 'type': 'REG'}, {'word': '异常包块', 'start': 66, 'end': 70, 'type': 'SGN'}, {'word': '膝部', 'start': 75, 'end': 77, 'type': 'REG'}, {'word': '压痛', 'start': 77, 'end': 79, 'type': 'SGN'}, {'word': '表面', 'start': 80, 'end': 82, 'type': 'ORG'}, {'word': '轻度', 'start': 82, 'end': 84, 'type': 'DEG'}, {'word': '红肿', 'start': 84, 'end': 86, 'type': 'SYM'}, {'word': '活动稍受限', 'start': 87, 'end': 92, 'type': 'SYM'}]}
 
请输入测试句子:

六、总结

本文简单介绍了Bi-LSTM+CRF模型的概念,及Bi-LSTM+CRF模型的代码实现。
段智华 CSDN认证博客专家 Spark AI 企业级AI技术
本人从事大数据人工智能开发和运维工作十余年,码龄5年,深入研究Spark源码,参与王家林大咖主编出版Spark+AI系列图书5本,清华大学出版社最新出版2本新书《Spark大数据商业实战三部曲:内核解密|商业案例|性能调优》第二版、《企业级AI技术内幕:深度学习框架开发+机器学习案例实战+Alluxio解密》,《企业级AI技术内幕》新书分为盘古人工智能框架开发专题篇、机器学习案例实战篇、分布式内存管理系统Alluxio解密篇。Spark新书第二版以数据智能为灵魂,包括内核解密篇,商业案例篇,性能调优篇和Spark+AI解密篇。从2015年开始撰写博文,累计原创1059篇,博客阅读量达155万次
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页
实付 19.90元
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值