๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
โœ’๏ธ Kibwa Voice Phishing Prev Project/Data Processing

[Data Processing] ๋ฒˆ์—ญ ํ›„ ํšŒ๊ท€๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(Data Augmentation) ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ

by A Lim Han 2023. 7. 12.

๐Ÿ’– ๋ฒˆ์—ญ ํ›„ ํšŒ๊ท€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์Šคํฌ๋ฆฝํŠธ

1.  Python์—์„œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์™€ ๋”ฅ๋Ÿฌ๋‹ ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ Import

import pandas as pd
from glob import glob
import os
import numpy as np
import pandas as pd
from tqdm import tqdm, tqdm_notebook

import random
import torch
import torch.nn.functional as F

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค๋ช… & ์šฉ๋„
pandas ๋ฐ์ดํ„ฐ ์กฐ์ž‘
glob ํŒŒ์ผ ๊ฒ€์ƒ‰
os ์šด์˜ ์ฒด์ œ์™€์˜ ์ƒํ˜ธ์ž‘์šฉ
numpy ์ˆ˜์น˜ ๊ณ„์‚ฐ
tqdm ์ง„ํ–‰ ์ƒํ™ฉ ์‹œ๊ฐํ™”
random ๋‚œ์ˆ˜ ์ƒ์„ฑ
torch ํŒŒ์ดํ† ์น˜ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ

2. Papago ๋ฅผ ํ†ตํ•œ KOR -> EN ๋ฒˆ์—ญ_A) ํ•„์š”ํ•œ ๊ฐ์ข… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ Import

from selenium import webdriver 
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import requests
import time
import requests
from bs4 import BeautifulSoup
import os, shutil
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

 

  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์šฉ๋„
Selenium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
webdriver WebDriver๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
ActionChains ์›น ํŽ˜์ด์ง€์—์„œ๋งˆ์šฐ์Šค ๋ฐ ํ‚ค๋ณด๋“œ
๋™์ž‘ ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Keys ํ‚ค๋ณด๋“œ ํŠน์ˆ˜ํ‚ค ์‚ฌ์šฉ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
By ์›น ์š”์†Œ ํƒ์ƒ‰ ๋ฐฉ๋ฒ•์„
์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Options ๋ธŒ๋ผ์šฐ์ € ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„
์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
WebDriverWait ํŠน์ • ์กฐ๊ฑด์„ ๋งŒ์กฑํ•  ๋•Œ๊นŒ์ง€
๋Œ€๊ธฐํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

 

  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์šฉ๋„
๊ธฐํƒ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ BeautifulSoup HTML & XML ๋ฌธ์„œ๋ฅผ ํŒŒ์‹ฑํ•˜๊ณ 
๊ฒ€์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
requests ์›น ํŽ˜์ด์ง€์— HTTP ์š”์ฒญ์„
๋ณด๋‚ด๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
time ์‹œ๊ฐ„ ์ง€์—ฐ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
os, shutil ํŒŒ์ผ ๋ฐ ๋””๋ ‰ํ„ฐ๋ฆฌ ์กฐ์ž‘์„ ์œ„ํ•œ
๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

3. Papago ๋ฅผ ํ†ตํ•œ KOR -> EN ๋ฒˆ์—ญ_B) WebDriver๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Chrome ๋ธŒ๋ผ์šฐ์ € ์ œ์–ด ์ฝ”๋“œ ์ž‘์„ฑ

driver=webdriver.Chrome('chromedriver.exe')

4. Papago ๋ฅผ ํ†ตํ•œ KOR -> EN ๋ฒˆ์—ญ_C) ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญ ํ›„ ๊ฒฐ๊ณผ ์ถœ๋ ฅ

#๊ฐ€์ƒ ๋ธŒ๋ผ์šฐ์ € url ์—ด๊ธฐ
translate_url = 'https://papago.naver.com/?sk=ko&tk=en' # sk=ko : Korean & tk=en : English
driver.get(translate_url)

time.sleep(2)

text_kor='''๋ˆ„๊ตฌ์—๊ฒŒ๋‚˜ ํ•œ๋ฒˆ์ฏค ์Ÿ์•„์ง€๋˜ ์—ฌ๋ฆ„๋น„์ฒ˜๋Ÿผ
๊ฐ‘์ž‘์Šค๋ ˆ ๋‹ค๊ฐ€์™”์—ˆ๋˜ ์‚ฌ๋ž‘์ด ์žˆ๊ฒ ์ฃ 
๋น—์†Œ๋ฆฌ์— ์ž ๋„ ๋ชป ์ž˜ ๋งŒํผ
๊ทธ๋• ๋‹ˆ๊ฐ€ ๋‚ด๊ฒ ๊ทธ๋žฌ์ฃ 
ํ™๋‚ด์Œ์„ ๋จธ๊ธˆ์€ ๋‚˜์˜ ๊ฐ์ •์ด
์Ÿ์•„๋‚ด๋“ฏ์ด ๋–จ๋ ค์˜ค๋„ค์š”
์ž๊ทธ๋งˆํ•œ ์šฐ์‚ฐ์„ ๋‚˜๋ˆ„๋ ค๊ณ  ๋‹ค๊ฐ€์˜ค๋Š”๊ฐ€ ๋ด
์—ฌ๋ฆ„๋‚ ์˜ ๊ธฐ์ ์ผ๊นŒ์š”?
์ฐฌ๋ž€ํ•˜๊ฒŒ ๋น›๋‚˜๋˜ ์‹œ๊ฐ„์ด์—ˆ๋‹ค๊ณ 
๋ง‘์€ ์—ฌ๋ฆ„๋น„์ฒ˜๋Ÿผ ๊ณ ๋งˆ์› ์—ˆ๋‹ค๊ณ 
ํ•œ์ค„๊ธฐ ๋น—๋ฌผ์ฒ˜๋Ÿผ ๋„ˆ๋ฌด ์•„๋ฆ„๋‹ค์› ๋˜
ํˆฌ๋ช…ํ•œ ์šฐ๋ฆฌ๋“ค์˜ ์ด์•ผ๊ธฐ'''

#id๊ฐ€ txtSource์ธ ๊ณณ์— ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ์„ ๋ณด๋‚ธ๋‹ค.
driver.find_element(By.ID,'txtSource').send_keys(text_kor) # Send_keys
driver

time.sleep(2)

#id๊ฐ€ txtTarget์ธ ๊ณณ์—์„œ text๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
translated_EN = driver.find_element(By.ID,'txtTarget').text
print(translated_EN)

<< ๋ฒˆ์—ญ ๊ฒฐ๊ณผ >>    

5. 4๋ฒˆ ๊ณผ์ •์„ ํ†ตํ•ด ๋‚˜์˜จ ์˜์–ด ํ…์ŠคํŠธ๋ฅผ Papago๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ณธ์–ด๋กœ ๋ฒˆ์—ญ

#๊ฐ€์ƒ ๋ธŒ๋ผ์šฐ์ € url ์—ด๊ธฐ
translate_url = 'https://papago.naver.com/?sk=en&tk=ja' # sk=en : English & tk=ja : Japan
driver.get(translate_url)

time.sleep(2)

#id๊ฐ€ txtSource์ธ ๊ณณ์— ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ์„ ๋ณด๋‚ธ๋‹ค.
driver.find_element(By.ID,'txtSource').send_keys(translated_EN) # Send_keys
driver

time.sleep(2)

#id๊ฐ€ txtTarget์ธ ๊ณณ์—์„œ text๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
translated_JA = driver.find_element(By.ID,'txtTarget').text
print(translated_JA)

<< ๋ฒˆ์—ญ ๊ฒฐ๊ณผ >>    

6. 5๋ฒˆ ๊ณผ์ •์„ ํ†ตํ•ด  ๋‚˜์˜จ ์ผ๋ณธ์–ด ํ…์ŠคํŠธ๋ฅผ Papago๋ฅผ ํ†ตํ•ด ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญ

#๊ฐ€์ƒ ๋ธŒ๋ผ์šฐ์ € url ์—ด๊ธฐ
translate_url = 'https://papago.naver.com/?sk=ja&tk=ko' # sk=en : English & tk=ja : Japan
driver.get(translate_url)

time.sleep(2)

#id๊ฐ€ txtSource์ธ ๊ณณ์— ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ์„ ๋ณด๋‚ธ๋‹ค.
driver.find_element(By.ID,'txtSource').send_keys(translated_JA) # Send_keys
driver

time.sleep(2)

#id๊ฐ€ txtTarget์ธ ๊ณณ์—์„œ text๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
translated_final = driver.find_element(By.ID,'txtTarget').text
print(translated_final)

<< ๋ฒˆ์—ญ ๊ฒฐ๊ณผ >>    

7. ์œ„ ์ ˆ์ฐจ์˜ ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Back Translation ํ•จ์ˆ˜ ๊ตฌํ˜„

#ํ•จ์ˆ˜ ๊ตฌํ˜„

def back_translation(input_text, input_type, trans_type):
    
    driver=webdriver.Chrome('chromedriver.exe')
    translate_url = 'https://papago.naver.com/?sk={0}&tk={1}'.format(input_type,trans_type)

    #๊ฐ€์ƒ ๋ธŒ๋ผ์šฐ์ € url ์—ด๊ธฐ
    driver.get(translate_url)
    time.sleep(1)    
    
    #id๊ฐ€ txtSource์ธ ๊ณณ์— ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ์„ ๋ณด๋‚ธ๋‹ค.
    driver.find_element(By.ID,'txtSource').send_keys(input_text) # Send_keys
    time.sleep(2)

    #id๊ฐ€ txtTarget์ธ ๊ณณ์—์„œ text๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
    translated_contents= driver.find_element(By.ID,'txtTarget').text
    time.sleep(2) 
    
    driver.quit()
    
    return translated_contents

 

ํ•จ์ˆ˜ Parameter
input_text ๋ฒˆ์—ญํ•  ํ…์ŠคํŠธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์ž์—ด
input_type ๋ฒˆ์—ญํ•  ํ…์ŠคํŠธ์˜ ์–ธ์–ด ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์ž์—ด
trans_type ๋ฒˆ์—ญ๋  ์–ธ์–ด ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์ž์—ด

8. ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ๋ฅผ "ํ•œ๊ตญ์–ด → ์˜์–ด → ์ผ๋ณธ์–ด → ํ•œ๊ตญ์–ด" ์ˆœ์„œ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š” back_trans_final() ํ•จ์ˆ˜ ๊ตฌํ˜„

def back_trans_final(input_text): #๋ฒˆ์—ญํ•  text
    
    #kor>en
    text_en=back_translation(input_text,'ko','en')
    
    time.sleep(1)
    #en>ja
    text_ja=back_translation(text_en,'en','ja')
    
    time.sleep(1)
    #ja>kor
    text_final=back_translation(text_ja,'ja','ko')
    
    return text_final

9.  pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

#ํ•จ์ˆ˜๊ฐ€ ์ž˜ ์ž‘๋™์ด ๋˜๋Š”์ง€ test
dfdf=pd.DataFrame(['์•ˆ๋…•','๊ฑด๊ฐ•ํ•˜๊ฒŒ ์‚ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ์šด๋™์ด ์ค‘์š”ํ•œ ์š”์†Œ์ด๋‹ค.'])
dfdf[0]

10. ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ฒซ ๋ฒˆ์งธ ์—ด์— ์žˆ๋Š” ๊ฐ’๋“ค์— back_trans_final() ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ ํ›„ ๊ฒฐ๊ณผ ์ถœ๋ ฅ

dfdf[0]=dfdf[0].apply(back_trans_final)
dfdf[0]

 


++ ์ž‘์„ฑํ•œ ์Šคํฌ๋ฆฝํŠธ๋กœ ์‹ค์ œ ๋ณด์ด์Šคํ”ผ์‹ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•ํ•ด๋ณด๊ธฐ!

-->  https://alim11.tistory.com/395

 

[Data Processing] '๋ฒˆ์—ญ ํ›„ ํšŒ๊ท€' ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(Data Augmentation) ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํ™œ์šฉํ•œ ์‹ค์ œ ๋ณด์ด์Šคํ”ผ์‹ฑ ๋ฐ

๐Ÿ“ ๋ฒˆ์—ญ ํ›„ ํšŒ๊ท€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์‹ค์ œ ๋ณด์ด์Šคํ”ผ์‹ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• 1. pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ '์‚ฌ์นญํ˜•_phising_data.csv' ํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ์˜ ์ฒซ ๋ถ€๋ถ„ ํ™•์ธ #๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค

alim11.tistory.com