NLP入门+实战必读：一文教会你最常见的10种自然言语处理技术

电脑数码精通
2020-06-07 12:30:20 7

+关注

NLP常见任务及其资源介绍

自然语言处理（NLP）是一种结合了技术和技巧的艺术，旨在从文本数据中提取有用信息。借助NLP，我们可以将文本转化为计算机算法能够理解的形式。从自动翻译、文本分类到情感分析，NLP已成为数据科学家不可或缺的一项技能。

为什么撰写这篇文章？

在研究和处理NLP问题的过程中，我需要查阅大量的资料，包括研究报告、博客文章和相关比赛的内容。为了方便自己和其他人更好地掌握这一领域的发展和挑战，我决定将这些资源整合起来，提供一份涵盖NLP常见任务及其相关资源的指南。

1. 词干提取

什么是词干提取？ 词干提取是一种将词语转换为其基本形式的技术，以便将相关词汇归一化为同一词干。例如，“beautiful”和“beautifully”的词干都是“beauti”。

相关资源

论文：Martin Porter的《波特词干算法》
算法：Porter2词干算法
实现：Python中的stemming库

python from stemming.porter2 import stem print(stem("casually"))

2. 词形还原

什么是词形还原？ 词形还原是一种将词语还原为其原始形态的技术，通常考虑词语在句子中的位置和上下文。例如，“beautiful”和“beautifully”分别还原为“beautiful”和“beautiful”。

相关资源

论文：关于词形还原的传统方法
论文：关于深度学习在词形还原中的应用
数据集：Treebank-3数据集
实现：Python中的Spacy库

python import spacy nlp = spacy.load("en") doc = "good better best" for token in nlp(doc): print(token, token.lemma_)

3. 词向量化

什么是词向量化？ 词向量化是一种将词语转换为向量的技术，使得计算机能够理解和处理自然语言。例如，“man”可以用一个五维向量表示。

相关资源

文章：关于词向量化的基本概念
论文：关于词向量化技术的详细介绍
工具：词向量可视化工具
预训练词向量：Facebook的FastText预训练词向量
预训练词向量：Google News预训练词向量

python from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) print(word_vectors['human'])

4. 词性标注

什么是词性标注？ 词性标注是指对句子中的词语标注其词性，如名词、动词等。例如，对句子“Ashok killed the snake with a stick”，词性标注会识别出Ashok是代词，killed是动词。

相关资源

论文：关于词性标注的最新方法
论文：关于使用隐马尔科夫模型的词性标注方法
实现：Python中的Spacy库

python import spacy nlp = spacy.load('en') sentence = "Ashok killed the snake with a stick" for token in nlp(sentence): print(token, token.pos_)

5. 命名实体消歧

什么是命名实体消歧？ 命名实体消歧是指对句子中的实体进行识别和分类的过程，例如识别“Apple”是指苹果公司而非水果。

相关资源

论文：基于深度神经网络和知识库的命名实体消歧方法
论文：关于部分神经注意模型的应用

6. 命名实体识别

什么是命名实体识别？ 命名实体识别是指识别句子中的实体并将其分类为人名、机构名、日期等。例如，将“Ram of Apple Inc. travelled to Sydney on 5th October 2017”识别为Ram、Apple ORG、Sydney GPE等。

相关资源

论文：关于命名实体识别的最新研究成果
实现：Python中的Spacy库

python import spacy nlp = spacy.load('en') sentence = "Ram of Apple Inc. travelled to Sydney on 5th October 2017" for token in nlp(sentence): print(token, token.ent_type_)

7. 情感分析

什么是情感分析？ 情感分析是指通过自然语言处理技术来判断文本中的情感倾向，如正面、负面或中性。

相关资源

文章：关于情感分析的文章
论文：关于情感分析的多种方法
材料库：关于情感分析的研究和实现
数据集：多域情感数据集、Twitter情感分析数据集
竞赛：关于情感分析的比赛

8. 文本语义相似度分析

什么是文本语义相似度分析？ 文本语义相似度分析是指分析两段文本之间的语义相似度。例如，汽车和公共汽车是相似的，而汽车和燃料是相关的。

相关资源

论文：关于文本相似度测量的不同方法
论文：关于使用CNN神经网络的文本相似度分析
论文：关于使用Tree-LSTMs的文本相似度分析

9. 语种识别

什么是语种识别？ 语种识别是指区分不同语言的文本，主要依赖于语言的统计和语法属性。

相关资源

文章：关于语种识别的文章
论文：关于多种语种识别方法的研究
论文：关于深度神经网络在语种识别中的应用

10. 文本摘要

什么是文本摘要？ 文本摘要是通过提取文本的关键点并生成摘要，从而简化文本的过程。

相关资源

论文：关于基于神经注意力模型的文本摘要方法
论文：关于使用序列到序列模型的文本摘要方法
材料库：Google Brain团队的文本摘要代码库
应用：Reddit的autotldr机器人

python from gensim.summarization import summarize sentence = "Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context. There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization." print(summarize(sentence))