Python/Project

텍스트 리뷰 데이터에서 단어 빈도수 조사하기.

영화 리뷰 데이터에서 단어의 빈도수를 조사하여 가장 빈도가 높은 단어 10개를 출력하는 코드입니다.

txtdata = open("ratings_train.txt" , "r" , encoding='UTF8')

word_dic={}

for review in txtdata :

A=review.strip().split("\t")

lst_word=A[1].split(" ")

for word in lst_word :

if word not in word_dic:

word_dic[word]=0

word_dic[word]+=1

list_word_freq= sorted(word_dic.items(), key=lambda x:x[1], reverse=True)

print(list_word_freq[0:10])

데이터의 label에서 긍정과 부정으로 나눈 것을 보고 긍정을 표현한 리뷰에서만 빈도가 높은 단어를 출력하고 싶다면

if A[-1] =="label" :

continue

if int(A[-1]) == True :

continue

라는 코드를 for 문에 사용해주면 됩니다.

리뷰데이터 (ratings_train.txt) 파일은 아래 링크를 통해 다운로드할 수 있습니다.
https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

Machine learning을 활용한 삼성전자 주식 예측하기 . (0)	2021.10.31
Machine learning을 활용한 코로나 확진자 수 예측하기. (0)	2021.10.26

Contents

새소식