Introduction


In this blog, we will understand how to train a neural network for Named Entity Recognition using SpaCy.



What is Name Entity Recognition ?


Name Entity Recognition is essentially identifying entities present in the text and classifying them into predefined classes. This task is also known as entity extraction and is a very common problem in the field of Natural Language Processing.


What is SpaCy ?


SpaCy is a popular NLP library known for its Industrial strength NLP. It uses Cython which optimizes processing time.

Dataset


For this blog, we will be using this dataset from kaggle.com.
The ner.csv contains several columns such as prev-word, sentence_idx, word, tag and so on. A row in the dataset signifies a word part of a sentence identified by sentence_idx that has a tag. Tag values are present in the BILUO format. Possible values of tags are the following:


  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time indicator
  • art = Artifact
  • eve = Event
  • nat = Natural Phenomenon

How to train a NER model using SpaCy


Let us understand step by step how we teach machines to identify entities in sentences. We’ll be using Kaggle-cli which is a command-line utility used for performing various operations related to Kaggle competitions.


  • Upload your kaggle.json file which can be obtained from kaggle.com. It is used for authenticating yourself while using the kaggle-cli. Kaggle.json can be obtained from your Kaggle account.

        1 from google.colab import files
        2 files.upload()
        

  • We’ll copy uploaded JSON file to ~/.kaggle directory (in Linux) so that it can be used for verifying credentials using kaggle-cli,

        1 # ensure kaggle.json is present
        2 !ls -lha kaggle.json
        3
        4 # copy kaggle.json to .kaggle directory
        5 ! cp kaggle.json ~/.kaggle
        6
        7 # ensure kaggle.json is present in ~/.kaggle
        8 ! ls -al ~/.kaggle
        

  • Download the dataset using kaggle-cli.

        1 !kaggle datasets download -d abhinavwalia95/entity-annotated-corpus
        

  • We’ll extract the downloaded ZIP file using unzip in order to extract ner.csv file.

        1 !unzip entity-annotated-corpus.zip 
        

  • In order to create pandas dataframe, we’ll read ner.csv file. Pandas is a popular python library used for in-memory data processing. We’ll use encoding as ‘​ISO-8859-1’.

        1 ner_df = pd.read_csv("ner.csv",encoding = "ISO-8859-1", error_bad_lines=False)
        2 ner_df.head()
        

    Output [ ]
    Unnamed: 0 lemma next-lemma next-next-lemma next-next-pos next-next-shape next-next-word next-pos next-shape next-word pos prev-iob prev-lemma prev-pos prev-prev-iob prev-prev-lemma prev-prev-pos prev-prev-shape prev-prev-word prev-shape prev-word sentence_idx shape word tag
    0 thousand of demonstr NNS lowercase demonstrators IN lowercase of NNS __START1__ __start1__ __START1__ __START2__ __start2__ __START2__ wildcard __START2__ wildcard __START1__ 1.0 capitalized Thousands O
    1 of demonstr have VBP lowercase have NNS lowercase demonstrators IN O thousand NNS __START1__ __start1__ __START1__ wildcard __START1__ capitalized Thousands 1.0 lowercase of O
    2 demonstr have march VBN lowercase marched VBP lowercase have NNS O of IN O thousand NNS capitalized Thousands lowercase of 1.0 lowercase demonstrators O
    3 have march through IN lowercase through VBN lowercase marched VBP O demonstr NNS O of IN lowercase of lowercase demonstrators 1.0 lowercase have O
    4 march through london NNP capitalized London IN lowercase through VBN O have VBP O demonstr NNS lowercase demonstrators lowercase have 1.0 lowercase marched O


  • As ner.csv file contains lot of columns however we only need sentence_idx, word, tag. So we’ll exclude other columns and retain only these columns.

        1 ner_df_cleaned = ner_df[['sentence_idx','word','tag']]
        2 ner_df_cleaned.head()
        

    Output [ ]
    sentence_idx word tag
    1.0 Thousands O
    1.0 of O
    1.0 demonstrators O
    1.0 have O
    1.0 marched O



  • Group rows by sentence_idx to create pair of words, tags for sentences.

        1 # all data is duplicated after index 281835 so we retrain only rows before it.
        2 # grouping by sentence_idx just to form sentence.
        3 sentence_groups = ner_df_cleaned[:281835].groupby(by='sentence_idx')
        4    
        5 # visualization of single group
        6 sentence_groups.get_group(1.0)
        

    Output [ ]
    sentence_idx word tag
    1.0 Thousands O
    1.0 of O
    1.0 demonstrators O
    1.0 have O
    1.0 marched O
    1.0 through O
    1.0 London B-geo
    1.0 to O
    1.0 protest O
    1.0 the O
    1.0 war O
    1.0 in O
    1.0 Iraq B-geo
    1.0 and O
    1.0 demand O
    1.0 the O
    1.0 withdrawal O
    1.0 of O
    1.0 British B-gpe
    1.0 troops O
    1.0 from O
    1.0 that O
    1.0 country O
    1.0 . O



  • We’ll combine words, tags from a group to form a list of all the sentences and list of their tags.

        1   def generate_data():
        2 
        3    '''generate sentences, word_tags from groups'''
        4    sentences = []
        5    word_tags = []
        6    for group_index, group in enumerate(sentence_groups):
        7        key = float(group_index+1)
        8        try:
        9           sentence_df = sentence_groups.get_group(key)
        10       except KeyError:
        11            continue 
        12        sentence_words = sentence_df['word'].values
        13        sentence_word_tags = sentence_df['tag'].values
        14        sentences.append(sentence_words)
        15        word_tags.append(sentence_word_tags)
        16        
        17    return sentences, word_tags
            
        

  • Transform the sentences and associated entities into format expected by SpaCy for training Neural Networks. We’ll skip the ‘O’ tag as it signifies the associated word is not an entity.

        1 def generate_sentence_offsets_tuple(sentence, tags):
        2    '''generate tag offsets for words of sentence'''    
        3    offsets = []
        4    index = 0
        5    sentence_str = " ".join(sentence)
        6    for word, biluo_tag in zip(sentence, tags):
        7        if biluo_tag != 'O':            
        8            offsets.append((index, index+len(word), biluo_tag))
        9            #print(sentence_str[index: index+len(word)])
        10        #print(word, index, index+len(word), biluo_tag, sentence_str[index: index+len(word)])
        11        index = index+len(word)+1   
        12    return (sentence_str, {'entities': offsets})
        

    Sample training example looks like: ('Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .', {'entities': [(48, 54, 'B-geo'), (77, 81, 'B-geo'), (111,118, 'B-gpe')]})


  • During training a machine learning model, we split the dataset into training and test sets so that we can evaluate the performance of the model on the unseen test dataset. We’ll use 80% dataset as the training set and 20% set for testing.

        1 train_data, test_data = train_test_split(training_data, test_size=0.2)
        2 print(len(train_data))
        3 print(len(test_data))
        4
        5 #Output
        6 10324 
        7 2582
        

  • We’ll train a neural network using SpaCy from a blank English model which is equivalent to training a network from scratch. We’ll use minibatch gradient descent for training the SpaCy Neural Network with a batch size of 32 for 100 epochs.

        1 def train(model, training_data, n_epochs=100):
        2    if model is not None:
        3       nlp = spacy.load(model)  # load existing spaCy model
        4       print("Loaded model '%s'" % model)
        5   else:
        6       nlp = spacy.blank("en")  # create blank Language class
        7       print("Created blank 'en' model")
        8       # create the built-in pipeline components and add them to the pipeline
        9       # nlp.create_pipe works for built-ins that are registered with spaCy
        10   if "ner" not in nlp.pipe_names:
        11      ner = nlp.create_pipe("ner")
        12      nlp.add_pipe(ner, last=True)
        13  # otherwise, get it so we can add labels
        14    else:
        15        ner = nlp.get_pipe("ner")
        16
        17    # add labels
        18        for _, annotations in training_data:
        19            for ent in annotations.get("entities"):
        20                ner.add_label(ent[2])
        21   
        22        # get names of other pipes to disable them during training
        23        other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
        24       with nlp.disable_pipes(*other_pipes):  # only train NER
        25           # reset and initialize the weights randomly – but only if we're
        26            # training a new model
        27            if model is None:
        28               nlp.begin_training()
        29            for itn in range(n_epochs):
        30                random.shuffle(training_data)
        31                losses = {}
        32                # batch up the examples using spaCy's minibatch
        33                batches = minibatch(training_data, size=compounding(4.0, 32.0, 1.001))
        34                for batch in batches:
        35                    texts, annotations = zip(*batch)
        36                    nlp.update(
        37                            texts,  # batch of texts
        38                            annotations,  # batch of annotations
        39                            drop=0.5,  # dropout - make it harder to memorise data
        40                            losses=losses,
        41                        )
        42                print("epoch: {} Losses: {}".format(itn, str(losses)))
        43        return nlp
        

    Model training in progress

        1 trained_model = train(None, train_data)
        2
        3 #output
        4
        5 Created blank 'en' model
        6 epoch: 0 Losses: {'ner': 30446.27592967073}
        7 epoch: 1 Losses: {'ner': 20947.85654213172}
        8 epoch: 2 Losses: {'ner': 18412.71664152751}
        9 epoch: 3 Losses: {'ner': 16969.81513162673}
        10 epoch: 4 Losses: {'ner': 16026.488122032146}
        11 epoch: 5 Losses: {'ner': 15296.464615362096}
        12 epoch: 6 Losses: {'ner': 14495.416703844461}
        13 epoch: 7 Losses: {'ner': 14148.917832007817}
        14 epoch: 8 Losses: {'ner': 13590.593759803203}
        15 epoch: 9 Losses: {'ner': 13251.893195170831}
        16 epoch: 10 Losses: {'ner': 13000.065728428637}
        17 epoch: 11 Losses: {'ner': 12603.155538117413}
        18 epoch: 12 Losses: {'ner': 12232.10885540372}
        19 epoch: 13 Losses: {'ner': 11984.518902786567}
        20 epoch: 14 Losses: {'ner': 11817.013570609211}
        21 epoch: 15 Losses: {'ner': 11610.024949308161}
        22 epoch: 16 Losses: {'ner': 11431.544553471413}
        23 epoch: 17 Losses: {'ner': 11252.836709610057}
        24 epoch: 18 Losses: {'ner': 11130.41010567033}
        25 epoch: 19 Losses: {'ner': 10969.373726896125}
        26 epoch: 20 Losses: {'ner': 10748.728134025527}
        27 epoch: 21 Losses: {'ner': 10652.114904867602}
        28 epoch: 22 Losses: {'ner': 10504.075748611576}
        29 epoch: 23 Losses: {'ner': 10339.76256633853}
        30 epoch: 24 Losses: {'ner': 10161.508989424752}
        31 epoch: 25 Losses: {'ner': 10099.111647699492}
        32 epoch: 26 Losses: {'ner': 10121.724719470036}
        33 epoch: 27 Losses: {'ner': 9903.8388610793}
        34 epoch: 28 Losses: {'ner': 9862.204669761943}
        35 epoch: 29 Losses: {'ner': 9845.34579209974}
        36 epoch: 30 Losses: {'ner': 9640.179330477367}
        37 epoch: 31 Losses: {'ner': 9601.439875621109}
        38 epoch: 32 Losses: {'ner': 9453.592612513428}
        39 epoch: 33 Losses: {'ner': 9405.77650552677}
        40 epoch: 34 Losses: {'ner': 9448.270602199977}
        41 epoch: 35 Losses: {'ner': 9412.674316488057}
        42 epoch: 36 Losses: {'ner': 9237.340063827713}
        43 epoch: 37 Losses: {'ner': 9246.711826243118}
        44 epoch: 38 Losses: {'ner': 9032.161621644267}
        45 epoch: 39 Losses: {'ner': 9091.07784351118}
        46 epoch: 40 Losses: {'ner': 9092.80138208692}
        47 epoch: 41 Losses: {'ner': 8858.207376862454}
        48 epoch: 42 Losses: {'ner': 8770.933757544191}
        49 epoch: 43 Losses: {'ner': 8771.057086230298}
        50 epoch: 44 Losses: {'ner': 8733.567732685347}
        51 epoch: 45 Losses: {'ner': 8748.407910649174}
        52 epoch: 46 Losses: {'ner': 8626.405233896752}
        53 epoch: 47 Losses: {'ner': 8583.270576869752}
        54 epoch: 48 Losses: {'ner': 8549.231337915131}
        55 epoch: 49 Losses: {'ner': 8597.107577511417}
        56 epoch: 50 Losses: {'ner': 8403.992361298118}
        57 epoch: 51 Losses: {'ner': 8598.280064643119}
        58 epoch: 52 Losses: {'ner': 8387.013841765835}
        59 epoch: 53 Losses: {'ner': 8355.211445250077}
        60 epoch: 54 Losses: {'ner': 8341.128360502365}
        61 epoch: 55 Losses: {'ner': 8448.177048899855}
        62 epoch: 56 Losses: {'ner': 8385.22883401729}
        63 epoch: 57 Losses: {'ner': 8170.639822315583}
        64 epoch: 58 Losses: {'ner': 8326.154393525414}
        65 epoch: 59 Losses: {'ner': 8154.260715844838}
        66 epoch: 60 Losses: {'ner': 8223.534263023816}
        67 epoch: 61 Losses: {'ner': 8117.09298854536}
        68 epoch: 62 Losses: {'ner': 8223.052133213543}
        69 epoch: 63 Losses: {'ner': 7950.622636359937}
        70 epoch: 64 Losses: {'ner': 7973.79760990279}
        71 epoch: 65 Losses: {'ner': 7939.64661235293}
        72 epoch: 66 Losses: {'ner': 7831.876079153071}
        73 epoch: 67 Losses: {'ner': 7776.884191541234}
        74 epoch: 68 Losses: {'ner': 7935.922633586549}
        75 epoch: 69 Losses: {'ner': 7880.616704411476}
        76 epoch: 70 Losses: {'ner': 7851.195972533759}
        77 epoch: 71 Losses: {'ner': 7756.105965185443}
        78 epoch: 72 Losses: {'ner': 7867.308930305662}
        79 epoch: 73 Losses: {'ner': 7659.757575999508}
        80 epoch: 74 Losses: {'ner': 7769.828882991424}
        81 epoch: 75 Losses: {'ner': 7693.0472415891145}
        82 epoch: 76 Losses: {'ner': 7741.959972308407}
        83 epoch: 77 Losses: {'ner': 7630.463335377948}
        84 epoch: 78 Losses: {'ner': 7549.603027146961}
        85 epoch: 79 Losses: {'ner': 7572.446941241869}
        86 epoch: 80 Losses: {'ner': 7650.523531033635}
        87 epoch: 81 Losses: {'ner': 7619.652739454669}
        88 epoch: 82 Losses: {'ner': 7417.0719545611255}
        89 epoch: 83 Losses: {'ner': 7470.896579577343}
        90 epoch: 84 Losses: {'ner': 7496.3789243864485}
        91 epoch: 85 Losses: {'ner': 7459.097479669072}
        92 epoch: 86 Losses: {'ner': 7639.487902472665}
        93 epoch: 87 Losses: {'ner': 7717.349279418356}
        94 epoch: 88 Losses: {'ner': 7485.849232150642}
        95 epoch: 89 Losses: {'ner': 7558.215015125212}
        96 epoch: 90 Losses: {'ner': 7440.613367810749}
        97 epoch: 91 Losses: {'ner': 7376.921568825308}
        98 epoch: 92 Losses: {'ner': 7412.506807567684}
        99 epoch: 93 Losses: {'ner': 7476.959613479633}
        100 epoch: 94 Losses: {'ner': 7386.341574123635}
        101 epoch: 95 Losses: {'ner': 7281.48060595249}
        102 epoch: 96 Losses: {'ner': 7355.238130780924}
        103 epoch: 97 Losses: {'ner': 7225.663035039537}
        104 epoch: 98 Losses: {'ner': 7092.297096430188}
        105 epoch: 99 Losses: {'ner': 7235.541923270879}
        

  • Once the model training is finished we’ll save the model to disk.

        1 trained_model.to_disk("./ner_model")
        

  • Evaluating the performance of the model on the test set in order to check how well the model can perform on unseen sentences. As we can see the trained model is pretty much able to identify the expected set of entities in sentences.

        1 def evaluate_model(pretrained_model, test_data):
        2 '''evaluates model on sample of 6 examples by printing out correct entities and predicted entities by SpaCy model'''
        3    for test_instance in test_data[4:10]:
        4        test_sentence, entities = test_instance        
        5        test_doc = pretrained_model(test_sentence)
        6        predicted_entities = [(ent.text, ent.label_) for ent in test_doc.ents]        
        7        original_entities = [(test_sentence[int(original_entity[0]): int(original_entity[1])], original_entity[2]) \
        8                             for original_entity in entities['entities']]
        9        print("\n--->" + test_sentence)
        10        print('predicted entities', predicted_entities)        
        11        print('original entities', original_entities)
        

    Following are some predictions on the test set
    1 --->The United States wants to install 10 land-based interceptor missile silos in Poland and associated radar bases in the Czech Republic .
    2 predicted entities [('United', 'B-geo'), ('States', 'I-geo'), ('Poland', 'B-geo'), ('Czech', 'B-gpe'), ('Republic', 'B-geo')]
    3 original entities [('United', 'B-geo'), ('States', 'I-geo'), ('Poland', 'B-geo'), ('Czech', 'B-gpe'), ('Republic', 'B-geo')]
    4
    5 --->The USDA said it is extremely unlikely any of the sick animals at the plant had Mad Cow disease .
    6 predicted entities [('USDA', 'B-org'), ('Cow', 'B-org')]
    7 original entities [('USDA', 'B-org'), ('Cow', 'B-org')]
    8
    9 --->A 2005 report from the Centers for Disease Control and Prevention estimated that secondhand smoke causes some 3,000 deaths each year from lung cancer , and 46,000 deaths from heart disease .
    10 predicted entities [('2005', 'B-tim'), ('Centers', 'B-org'), ('for', 'I-org'), ('Disease', 'I-org'), ('Control', 'I-org'), ('and', 'I-org'), ('Prevention', 'I-org')]
    11 original entities [('2005', 'B-tim'), ('Disease', 'B-org'), ('Control', 'I-org'), ('and', 'I-org'), ('Prevention', 'I-org')]
    12
    13 --->The delegation members told reporters they were met at the airport by Zimbabwean officials who turned the group back .
    14 predicted entities [('Zimbabwean', 'B-gpe')]
    15 original entities [('Zimbabwean', 'B-gpe')]
    16
    17 --->Two years ago , Henin-Hardenne won this tournament and went on to take the season 's first major at the Australian Open two weeks later .
    18 predicted entities [('Two', 'B-tim'), ('Henin-Hardenne', 'B-per'), ('Australian', 'B-gpe'), ('Open', 'I-eve'), ('two', 'B-tim'), ('later', 'B-tim')]
    19 original entities [('Two', 'B-tim'), ('Henin-Hardenne', 'B-per'), ('Australian', 'B-gpe'), ('Open', 'B-org'), ('later', 'B-tim')]
    20
    21 --->President Bush has telephoned his Afghan counterpart Hamid Karzai to express his support for Sunday 's legislative elections .
    22 predicted entities [('President', 'B-per'), ('Bush', 'I-per'), ('Afghan', 'B-gpe'), ('Hamid', 'B-per'), ('Karzai', 'I-per'), ('Sunday', 'B-tim')]
    23 original entities [('President', 'B-per'), ('Bush', 'I-per'), ('Afghan', 'B-gpe'), ('Hamid', 'B-per'), ('Karzai', 'I-per'), ('Sunday', 'B-tim')]
    

  • Let’s compute the evaluation metrics of the model on the test set. We’ll use seqeval, a package used for evaluating NLP metrics, for computing different metrics such as accuracy, precision, recall, and f1-score. We’ll also compute the class-wise confusion matrix just to see how our model performed on each class.

        1 get_eval_metrics(pretrained_nlp_model, test_data)
        2    
        3 #output
        4 accuracy score: 0.9697575273114841
        5 f1 score: 0.9615673040869891
        6 precision: 0.9583333333333334
        7 recall: 0.9648231753197893
        8 classification report per class
        9           precision    recall  f1-score   support
        10
        11      geo       0.96      0.97      0.96      1795
        12      gpe       0.96      0.97      0.97       813
        13      tim       0.99      0.99      0.99       945
        14      org       0.94      0.93      0.94       918
        15      per       0.93      0.96      0.95       794
        16      art       0.83      0.96      0.89        26
        17      nat       1.00      1.00      1.00        15
        18      eve       0.82      0.90      0.86        10
        19
        20 micro avg       0.96      0.96      0.96      5316
        21 macro avg       0.96      0.96      0.96      5316
        

  • If required we can reload the same model from the filesystem and retrain with more no. of examples. So the model is updated with a new set of examples.

Real-world use cases of NER :


  • NER can be used to identify relevant words from customer support queries. For instance, if a customer complains about a product with relevant details then the NER model can be used to identify product name, product version, etc from the queries.
  • NER can be trained to identify relevant entities from insurance documents. It will help in reducing manual efforts for searching.
  • NER is used with other NLP models for the development of chatbots, which learns patterns from user utterances and helps to identify relevant keywords from the input query.

Summary


We have trained a ner model using SpaCy’s neural network with an F1-score of 0.961.


References




Source code for this blog is available here