๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
โœ’๏ธ Kibwa Voice Phishing Prev Project/Data Processing

[Data Processing] STT ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ๋ฅผ ์œ„ํ•œ ์ž๋™ํ™” ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ

by A Lim Han 2023. 6. 23.

๐Ÿ€ STT ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ๋ฅผ ์œ„ํ•œ ์ž๋™ํ™” ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ

 

โ‘  pip ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•ด boto3 ํŒจํ‚ค์ง€ ์„ค์น˜ & boto3 ๋ชจ๋“ˆ ์ž„ํฌํŠธ

!pip install boto3

import boto3

 

โ€ป boto3 ๋ž€?

AWS ์„œ๋น„์Šค์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” Python SDK
--> ์ž„ํฌํŠธํ•œ boto3 ์ด ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ AWS S3 ํด๋ผ์ด์–ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Œ

 

 

โ‘ก AWS ๊ณ„์ • ์ž๊ฒฉ ์ฆ๋ช…๊ณผ AWS ๋ฆฌ์ „ ์„ค์ •

# AWS ๊ณ„์ • ์ž๊ฒฉ ์ฆ๋ช… ๋ฐ AWS ๋ฆฌ์ „ ์„ค์ •
aws_access_key_id = "๊ณ„์ •์˜ ์—‘์„ธ์Šค ํ‚ค๊ฐ’"
aws_secret_access_key = "๊ณ„์ •์˜ ์‹œํฌ๋ฆฟ ์—‘์„ธ์Šค ํ‚ค๊ฐ’"
aws_region = "๊ณ„์ •์˜ AWS Region"

 

โ€ป AWS ๋ฆฌ์ „(Region) ์ด๋ž€?

: AWS ํด๋ผ์šฐ๋“œ์—์„œ ์ œ๊ณตํ•˜๋Š” ์ง€๋ฆฌ์ ์ธ ์˜์—ญ.
๋…๋ฆฝ๋œ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ์ง‘ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋‹ค์ˆ˜์˜ ๊ฐ€์šฉ ์˜์—ญ(Availability Zone)์œผ๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ๋‹ค.

 

 

โ€ป AWS ๊ณ„์ • ์ž๊ฒฉ ์ฆ๋ช…์ด๋ž€?

: AWS ๋ฆฌ์†Œ์Šค์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ๋Š” ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•˜๋Š” ์ •๋ณด๋กœ, ์ผ๋ฐ˜์ ์œผ๋กœ ์•ก์„ธ์Šค ํ‚ค์™€ ์‹œํฌ๋ฆฟ ํ‚ค(๋˜๋Š” ์•ก์„ธ์Šค ํ† ํฐ)๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

 

AWS ๊ณ„์ • ์ž๊ฒฉ ์ฆ๋ช…
์—‘์„ธ์Šค ํ‚ค ์‹œํฌ๋ฆฟ ์—‘์„ธ์Šค ํ‚ค
ํ”„๋กœ๊ทธ๋ž˜๋ฐ์  ์—‘์„ธ์Šค ์‹œ ์‚ฌ์šฉ๋˜๋Š” ์‹๋ณ„์ž ํ”„๋กœ๊ทธ๋ž˜๋ฐ์  ์—‘์„ธ์Šค ์‹œ ์‚ฌ์šฉ๋˜๋Š” ๋น„๋ฐ€๋ฒˆํ˜ธ

 

 

โ‘ข JSON ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•  download_json_file() ํ•จ์ˆ˜ ๊ตฌํ˜„

def download_json_file(job_number, output_path):
    # AWS ๊ณ„์ • ์ž๊ฒฉ ์ฆ๋ช… ๋ฐ AWS ๋ฆฌ์ „ ์„ค์ •
    aws_access_key_id = "๊ณ„์ •์˜ ์—‘์„ธ์Šค ํ‚ค๊ฐ’"
    aws_secret_access_key = "๊ณ„์ •์˜ ์‹œํŠธ๋ฆฟ ์—‘์„ธ์Šค ํ‚ค๊ฐ’"
    aws_region = "๊ณ„์ •์˜ AWS Region"

    # AWS S3 ํด๋ผ์ด์–ธํŠธ ์ƒ์„ฑ
    s3 = boto3.client("s3", aws_access_key_id=aws_access_key_id,
                            aws_secret_access_key=aws_secret_access_key,
                            region_name=aws_region)

    # ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ ๋ฐ˜๋ณต๋ฌธ
    for job_number in job_numbers:
    
        # ๋‹ค์šด๋กœ๋“œํ•  ํŒŒ์ผ ๊ฒฝ๋กœ ์ƒ์„ฑ
        job_url = f"ํŒŒ์ผ์ด ์žˆ๋Š” URL ์ฃผ์†Œ"
        file_name = f"์ˆ˜์‚ฌ๊ธฐ๊ด€ ์‚ฌ์นญํ˜•_{job_number}.json"
        file_path = f"{output_path}/{file_name}"

        # ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
        s3.download_file(Bucket="bucket_name", Key=file_path, Filename=file_name)

 

โ€ป Line 2 ~ 5

โ‘ก๋ฒˆ ๊ณผ์ •์—์„œ ์„ค์ •ํ•œ ๊ณ„์ • ์ •๋ณด ์ž‘์„ฑ

 

โ€ป Line 7 ~ 10

โ‘ก๋ฒˆ ๊ณผ์ •์—์„œ ์„ค์ •ํ•œ ๊ณ„์ • ์ •๋ณด๋ฅผ ๋ฐ›์•„ S3 ํด๋ผ์ด์–ธํŠธ ์ƒ์„ฑ

 

โ€ป Line 12 ~ 13

download_json_file ํ•จ์ˆ˜ ๋‚ด๋ถ€์— ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๋ฅผ ์œ„ํ•œ ๋ฐ˜๋ณต๋ฌธ ๊ตฌํ˜„
-->  job_numbers ๋ฆฌ์ŠคํŠธ์˜ ๊ฐ ์š”์†Œ์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

 

โ€ป Line 15 ~ 18

job_url, file_name, file_path ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ผ ๊ฒฝ๋กœ ์ƒ์„ฑ

 

๋ณ€์ˆ˜๋ช…
job_url file_name file_path
AWS Transcribe ์ž‘์—…์˜ URL ์ƒ์„ฑ ๋‹ค์šด๋กœ๋“œํ•  ํŒŒ์ผ์˜ ์ด๋ฆ„ ์ง€์ • ๋‹ค์šด๋กœ๋“œํ•  ํŒŒ์ผ์ด ์ €์žฅ๋  ๊ฒฝ๋กœ ์ง€์ •

 

โ€ป Line 20 ~ 21

s3.download_file ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
-->  AWS S3์—์„œ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ file_path์— ์ง€์ •๋œ ๊ฒฝ๋กœ๋กœ ์ €์žฅ

 

 

โ‘ฃ ๋‹ค์šด๋กœ๋“œํ•  ํŒŒ์ผ์˜ ๋ฒˆํ˜ธ & ๋‹ค์šด๋กœ๋“œ ๊ฒฝ๋กœ ์„ค์ •

# ๋‹ค์šด๋กœ๋“œํ•  ํŒŒ์ผ ๋ฒˆํ˜ธ์™€ ์ €์žฅํ•  ๊ฒฝ๋กœ ์„ค์ •
job_numbers = [8, 9, 10, 11, 12, 13, 14, 15, 16]  # ๋‹ค์šด๋ฐ›์„ ํŒŒ์ผ ๋ฒˆํ˜ธ๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ์ง€์ •
output_path = "๋‹ค์šด๋กœ๋“œ๋ฐ›์„ ๊ฒฝ๋กœ"

 

 

โ‘ค download_json_file ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ JSON ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ

# JSON ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ ์‹คํ–‰
download_json_file(job_numbers, output_path)