๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
โœ’๏ธ Kibwa Voice Phishing Prev Project/Data Processing

[Data Processing] ๋ฌธ์žฅ ์žฌ๊ตฌ์„ฑ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(Data Augmentation) ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ

by A Lim Han 2023. 7. 13.

๐ŸŒจ๏ธ ๋ฌธ์žฅ ์žฌ๊ตฌ์„ฑ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์Šคํฌ๋ฆฝํŠธ

1๏ธโƒฃ random ๋ชจ๋“ˆ ๊ฐ€์ ธ์˜ค๊ธฐ + ๋ฌธ์žฅ ์žฌ๊ตฌ์„ฑ ํ•จ์ˆ˜ sentence_rearrangement ๊ตฌํ˜„

import random

def sentence_rearrangement(sentence):
    words = sentence.split()  # ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ๋ถ„๋ฆฌ
    random.shuffle(words)  # ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ž์Œ
    new_sentence = ' '.join(words)  # ๋‹จ์–ด๋“ค์„ ๋‹ค์‹œ ๋ฌธ์žฅ์œผ๋กœ ์กฐํ•ฉ
    return new_sentence
โ€ป random ๋ชจ๋“ˆ์ด๋ž€?

random ๋ชจ๋“ˆ์€ ํŒŒ์ด์ฌ์—์„œ ์ œ๊ณตํ•˜๋Š” ๋‚ด์žฅ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ๋‚œ์ˆ˜ ์ƒ์„ฑ ๋ฐ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.
์ด ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹ค์–‘ํ•œ ๋‚œ์ˆ˜ ์ƒ์„ฑ๊ณผ ๋ฌด์ž‘์œ„ ์š”์†Œ ์„ ํƒ, ์‹œํ€€์Šค ์„ž๊ธฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฌด์ž‘์œ„์„ฑ ์š”์†Œ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.

 

sentence_rearrangement() ํ•จ์ˆ˜
1. split() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ๋ฐ›์€ ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ  -->  ๊ฒฐ๊ณผ๋Š” words ๋ณ€์ˆ˜์— ์ €์žฅ
2. random.shuffle() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ words ๋ฆฌ์ŠคํŠธ ๋‚ด์˜ ๋‹จ์–ด๋“ค์˜ ์ˆœ์„œ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์žฌ๊ตฌ์„ฑ
3. join() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์žฌ๊ตฌ์„ฑ๋œ ๋‹จ์–ด๋“ค์„ ๋‹ค์‹œ ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์œผ๋กœ ์กฐํ•ฉ 
-->  ์กฐํ•ฉ๋œ ๋ฌธ์žฅ์€ new_sentence ๋ณ€์ˆ˜์— ์ €์žฅ
4. new_sentence ๋ณ€์ˆ˜์— ์ €์žฅ๋œ ์žฌ๊ตฌ์„ฑ ๋ฌธ์žฅ ๋ฐ˜ํ™˜

2๏ธโƒฃ ์ฆ๊ฐ•ํ•  ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ค์ • ํ›„ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์‹คํ–‰

# ์ฆ๊ฐ•ํ•  ๋ฌธ์žฅ
original_sentence = "์ €ํฌ ๊ฒ€์ฐฐ๋กœ ์˜ค์…”์„œ ๊ฐ„๋‹จํ•œ ์กฐ์‚ฌ๋ฅผ ๋ฐ›์œผ์…”์•ผ ํ•˜๋Š”๋ฐ ํ˜น์‹œ ์˜ค์‹œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ์‹œ์ฃ ?"

# ๋ฌธ์žฅ ์žฌ๊ตฌ์„ฑ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
augmented_sentences = []
for _ in range(5):  # 5๊ฐœ์˜ ์žฌ๊ตฌ์„ฑ๋œ ๋ฌธ์žฅ ์ƒ์„ฑ
    augmented_sentence = sentence_rearrangement(original_sentence)
    augmented_sentences.append(augmented_sentence)

 

original_sentence ๋ณ€์ˆ˜ ์ฆ๊ฐ•ํ•  ์›๋ณธ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ(๋ฌธ์žฅ) ์ €์žฅ
augmented_sentences ๋ฆฌ์ŠคํŠธ ์ฆ๊ฐ•๋œ ๋ฌธ์žฅ๋“ค์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๊ณต๋ฐฑ ๋ฆฌ์ŠคํŠธ

 

for ๋ฃจํ”„
1. sentence_rearrangement() ํ•จ์ˆ˜ ํ˜ธ์ถœ
2. original_sentence ์žฌ๊ตฌ์„ฑ
3. ์ฆ๊ฐ•๋œ ๋ฌธ์žฅ์„ augmented_sentences ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€

3๏ธโƒฃ ์ฆ๊ฐ• ๊ฒฐ๊ณผ ์ถœ๋ ฅ

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("์›๋ณธ ๋ฌธ์žฅ:", original_sentence)
print("์ฆ๊ฐ•๋œ ๋ฌธ์žฅ:")
for sentence in augmented_sentences:
    print(sentence)

 


 

๐ŸŒจ๏ธ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์Šคํฌ๋ฆฝํŠธ ์ฝ”๋“œ ์ „๋ฌธ

import random

def sentence_rearrangement(sentence):
    words = sentence.split()  # ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ๋ถ„๋ฆฌ
    random.shuffle(words)  # ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ž์Œ
    new_sentence = ' '.join(words)  # ๋‹จ์–ด๋“ค์„ ๋‹ค์‹œ ๋ฌธ์žฅ์œผ๋กœ ์กฐํ•ฉ
    return new_sentence

# ์ฆ๊ฐ•ํ•  ๋ฌธ์žฅ
original_sentence = "์ €ํฌ ๊ฒ€์ฐฐ๋กœ ์˜ค์…”์„œ ๊ฐ„๋‹จํ•œ ์กฐ์‚ฌ๋ฅผ ๋ฐ›์œผ์…”์•ผ ํ•˜๋Š”๋ฐ ํ˜น์‹œ ์˜ค์‹œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ์‹œ์ฃ ?"

# ๋ฌธ์žฅ ์žฌ๊ตฌ์„ฑ์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
augmented_sentences = []
for _ in range(5):  # 5๊ฐœ์˜ ์žฌ๊ตฌ์„ฑ๋œ ๋ฌธ์žฅ ์ƒ์„ฑ
    augmented_sentence = sentence_rearrangement(original_sentence)
    augmented_sentences.append(augmented_sentence)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("์›๋ณธ ๋ฌธ์žฅ:", original_sentence)
print("์ฆ๊ฐ•๋œ ๋ฌธ์žฅ:")
for sentence in augmented_sentences:
    print(sentence)

 

์›๋ณธ ๋ฌธ์žฅ: ์ €ํฌ ๊ฒ€์ฐฐ๋กœ ์˜ค์…”์„œ ๊ฐ„๋‹จํ•œ ์กฐ์‚ฌ๋ฅผ ๋ฐ›์œผ์…”์•ผ ํ•˜๋Š”๋ฐ ํ˜น์‹œ ์˜ค์‹œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ์‹œ์ฃ ?
์ฆ๊ฐ•๋œ ๋ฌธ์žฅ:
ํ˜น์‹œ ์ €ํฌ ์‹œ๊ฐ„์ด ์กฐ์‚ฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„๋‹จํ•œ ๊ฑธ๋ฆฌ์‹œ์ฃ ? ์˜ค์‹œ๋Š”๋ฐ ๋ฐ›์œผ์…”์•ผ ์˜ค์…”์„œ ๊ฒ€์ฐฐ๋กœ ํ•˜๋Š”๋ฐ
๋ฐ›์œผ์…”์•ผ ๊ฒ€์ฐฐ๋กœ ํ˜น์‹œ ์ €ํฌ ์˜ค์…”์„œ ์กฐ์‚ฌ๋ฅผ ์˜ค์‹œ๋Š”๋ฐ ๊ฐ„๋‹จํ•œ ์‹œ๊ฐ„์ด ํ•˜๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ์‹œ์ฃ ?
๋ฐ›์œผ์…”์•ผ ๊ฒ€์ฐฐ๋กœ ๊ฐ„๋‹จํ•œ ํ˜น์‹œ ์กฐ์‚ฌ๋ฅผ ์˜ค์‹œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ์‹œ์ฃ ? ์–ผ๋งˆ๋‚˜ ์˜ค์…”์„œ ํ•˜๋Š”๋ฐ ์ €ํฌ
๊ฑธ๋ฆฌ์‹œ์ฃ ? ์˜ค์‹œ๋Š”๋ฐ ์‹œ๊ฐ„์ด ํ˜น์‹œ ๊ฒ€์ฐฐ๋กœ ๊ฐ„๋‹จํ•œ ์–ผ๋งˆ๋‚˜ ๋ฐ›์œผ์…”์•ผ ์ €ํฌ ์˜ค์…”์„œ ์กฐ์‚ฌ๋ฅผ ํ•˜๋Š”๋ฐ
ํ•˜๋Š”๋ฐ ํ˜น์‹œ ๊ฐ„๋‹จํ•œ ๊ฒ€์ฐฐ๋กœ ์‹œ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ์˜ค์‹œ๋Š”๋ฐ ๋ฐ›์œผ์…”์•ผ ์˜ค์…”์„œ ์ €ํฌ ์กฐ์‚ฌ๋ฅผ ๊ฑธ๋ฆฌ์‹œ์ฃ ?