基于裁判文书的案件罪名预测方法研究(任务书,开题报告,论文16000字)
摘要
得益于人工智能相关理论技术的发展,“X+AI”的概念席卷了各行各业,并对传统领域的改革发展造成了深刻影响,法律人工智能就是其中一个新兴的交叉领域。在传统法律服务行业由于人力短缺、工作内容繁杂等原因难以满足日益增长的法律需求的背景下,法律人工智能因其低成本、高效率的优势成为了一种理想的解决方案,具备良好的研究价值和广泛的应用前景,其中一个重要的研究分支就是对案件罪名进行预测。然而现实场景中的案件数据,由于罪行性质和量刑考量的差异,存在着明显的样本类别不均衡问题,给准确预测罪名和识别稀少的罪名标签带来了难度。
针对以上的问题,本文基于裁判文书对案件罪名预测方法进行研究,从数据层的预处理上对原始数据进行增强来减轻数据的不均衡程度。采用了在数据空间中用同义词替换的数据增强方法,具体实现和整合了使用同义词词典、训练本地词向量和引入预训练词向量进行少数类样本过采样的算法,并与特征空间中具有代表性的数据增强算法——SMOTE算法进行对比。
在分类模型方面,本文分别使用FastText模型和SVM模型对增强后的数据集进行验证,在增强算法运行时间、分类器得分、模型训练时间和预测时间三个方面,对比分析了各数据增强方式与两种分类模型的结合算法对罪名预测的提升效果。实验结果表明,本文采用的基于本地训练词向量进行同义替换的数据增强算法,结合FastText模型进行罪名预测时,在运行效率和分类效果上取得了最佳的综合表现。
关键词:案件罪名预测;数据增强;SMOTE;FastText;SVM
Abstract
Thanks to the development of artificial intelligence-related theory and technology, the concept of “X+AI” has swept across all walks of life and has had a profound impact on the reform and development of traditional fields. Legal artificial intelligence is one of the emerging cross-cutting areas. In the context of the traditional legal service industry, which is difficult to meet the growing legal demands due to shortage of manpower and complicated work content, legal artificial intelligence has become an ideal solution due to its low cost and high efficiency.Itis worth studying and has a broad future of application.One of the important research branches is to predict the accusation of the case. However, the case data in the real-world scenario, due to the nature of the crime and the difference in sentencing considerations, there exists an obviousproblem of data imbalance in the labels of data, which makes it difficult to accurately predict the accusations and identify those rare labels.
In view of the above problems, we study the accusation prediction method based on the judgment documents, and augments the original datasetin the preprocessing stage of the data layer to reduce the data imbalance level. A data augmentation method based on synonym substitution in data space is adopted in this paper. The algorithm of using the synonym dictionary, training local word vector and introducing pre-trained word vector to perform over-sampling of minority samples is implemented and integrated. The data augmentation algorithm SMOTE is selected for comparison.
In terms of classification model, we utilizeFastText model and SVM model to verify the augmented dataset respectively. The data augmentation methods are compared and analyzed in the aspects of data augmentation algorithm’s running time, score achieved by theclassifier, training time and prediction time of the model, to find out how much improvement is made by these methods. The experimental results show that the synonymous substitutionalgorithm based on local-trained word vector fordata augmentation, combined with FastText model, has achieved thebest overall performance in terms of operational efficiency and classification effect.
Key Words:accusation predictions;data augmentation; SMOTE;FastText; SVM
目录
摘要 I
Abstract II
第1章绪论 1
1.1 选题背景与研究意义 1
1.2 国内外研究现状 2
1.3 研究目标和研究内容 2
1.4 论文结构安排 3
第2章裁判文书数据集介绍及预处理 4
2.1 数据集介绍 4
2.2 数据预处理 5
2.2.1 数据清洗 5
2.2.2 中文文本分词及去停用词 6
2.2.3 数据分析 6
2.3 实验评价指标 7
2.4 本章小结 9
第3章不均衡数据的增强方法研究 10
3.1数据不均衡问题的介绍 10
3.2数据空间中的数据增强方法 10
3.2.1基于同义词典的数据替换 11
3.2.2基于词向量相似度的数据替换 12
3.3特征空间中的数据增强方法 15
3.3.1文本的向量化表示 15
3.3.2 SMOTE算法 16
3.4 增强后数据集统计分析 16
3.5本章小结 18
第4章基于裁判文书的罪名预测实验对比分析 19
4.1 中文文本分类的基本过程 19
4.2实验模型介绍 20
4.2.1 SVM模型 20
4.4.2 FastText模型 21
4.3 实验过程与分类结果分析 22
4.3.1实验环境 22
4.3.2实验方案 22
4.3.3实验结果分析 22
4.4本章小结 25
第5章总结与展望 26
5.1 工作总结 26
5.2 未来工作展望 26
参考文献 28
致谢 30 |