내일배움캠프 본캠프 51일차

TIL

내일배움캠프 본캠프 51일차 - 웹 크롤링

수현조 2025. 2. 11. 22:59

오늘 팀 회의를 했는데 우리 LLM 과제 뭐할거냐고 하다가 최적의 프롬프트 생성기를 만들기로 했다.

그리고 내가 맡은 것은 데이터 수집이었다.

그래서 하나하나 복사 붙여넣기를 하자니 너무 싫어서 Selenium을 사용법을 찾아보았다

일단 이거 다 설치해야함

셀레니움 설치 : pip install selenium
구글 드라이버 설치 : brew install --cask chromedriver
웹드라이버 매니저 설치 : pip3 install webdriver_manager

1️⃣ Selenium을 활용한 웹 크롤링 기본 개념

Selenium을 사용하면 웹사이트를 자동으로 탐색하고 데이터를 추출할 수 있음.
React 기반 사이트는 JavaScript로 데이터를 로드하기 때문에 일반적인 BeautifulSoup 크롤링이 어려움.
해결 방법: WebDriverWait, execute_script(), Keys.END(스크롤) 등을 활용하여 JavaScript가 실행된 후 데이터를 가져옴.

2️⃣ https://careerhackeralex.com/prompt_explorer 크롤링

✅ 문제점

사이트가 React 기반이라서, 처음 HTML을 가져오면 내용이 보이지 않음.
제목과 일부 내용만 보이고, 전체 내용을 가져오려면 개별 글을 클릭해야 함.
overflow: hidden으로 인해 글 일부만 표시되는 문제 발생.
더보기 버튼을 눌러야 전체 내용이 보이는 경우가 있음.

✅ 해결 방법

스크롤을 자동으로 내리면서 모든 데이터 로딩

for _ in range(15):  
    driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)
    time.sleep(3)

JavaScript 실행을 기다려서 데이터가 완전히 로드될 때까지 대기

WebDriverWait(driver, 20).until(
    EC.presence_of_element_located((By.XPATH, "//h6"))
)

'더보기' 버튼이 있으면 자동 클릭

more_buttons = driver.find_elements(By.XPATH, "//button[contains(text(), '더보기')]")
for btn in more_buttons:
    driver.execute_script("arguments[0].click();", btn)
    time.sleep(3)

개별 글을 클릭해서 전체 내용 가져오기

article_links = [article.find_element(By.XPATH, "./ancestor::a").get_attribute("href") for article in articles]
driver.get(article_links[i])  # 클릭하여 해당 글로 이동

CSS 수정(overflow: hidden 제거)

driver.execute_script("""
    var elements = document.querySelectorAll('*');
    for (var i = 0; i < elements.length; i++) {
        elements[i].style.overflow = 'visible';
        elements[i].style.maxHeight = 'none';
    }
""")

3️⃣ 데이터 저장 (CSV & TXT)

✅ CSV 파일로 저장 (엑셀에서 열기)

with open("careerhackeralex_prompts.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["제목", "전체 내용", "태그"])  # CSV 헤더 추가
    
    for i in range(len(titles)):
        writer.writerow([titles[i].text.strip(), contents[i].text.strip(), tags[i].text.strip()])

📂 파일 위치: 코드 실행 디렉터리에 careerhackeralex_prompts.csv로 저장됨.

✅ TXT 파일로 저장

with open("careerhackeralex_prompts.txt", "w", encoding="utf-8") as file:
    for i in range(len(titles)):
        file.write(f"📌 제목: {titles[i].text.strip()}\n")
        file.write(f"📝 내용: {contents[i].text.strip() if i < len(contents) else 'N/A'}\n")
        file.write(f"🏷 태그: {tags[i].text.strip() if i < len(tags) else 'N/A'}\n")
        file.write("=" * 50 + "\n")

📂 파일 위치: careerhackeralex_prompts.txt 파일로 저장됨.

4️⃣ 근데 실패함

잘 안됐음

좀 열받긴 한데 어떻게든 되겠지

이런 식으로 잘려서

해결 방법

✅ 1️⃣ 개별 글 클릭 후 전체 내용 가져오기

Selenium을 사용해 글을 하나씩 클릭하고 전체 내용을 크롤링.
driver.back()을 사용해 목록으로 다시 돌아오기.

✅ 2️⃣ JavaScript 실행 대기 시간 늘리기

time.sleep(15) 대신 WebDriverWait으로 글이 완전히 로드될 때까지 기다림.
글 개수가 많으면 스크롤을 끝까지 내린 후 클릭.

그리고 결과를 cvs 파일로 저장하기로 함

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import csv

# ✅ ChromeDriver 경로 설정
service = Service("/opt/homebrew/bin/chromedriver")

# ✅ Chrome 실행 옵션 설정
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 브라우저 창 없이 실행 (테스트용)
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')

# ✅ Selenium 웹드라이버 실행
driver = webdriver.Chrome(service=service, options=options)

# ✅ 웹페이지 열기
url = "https://careerhackeralex.com/prompt_explorer"
driver.get(url)

# ✅ JavaScript 데이터 로딩 대기 (최대 20초)
try:
    WebDriverWait(driver, 60).until(
        EC.presence_of_element_located((By.XPATH, "//h6"))  # 제목 요소 대기
    )
    print("🎯 요소 로딩 완료!")
except:
    print("⚠ 요소를 찾을 수 없음!")

# ✅ 스크롤을 끝까지 내려 모든 콘텐츠 로드
scroll_pause_time = 3  # 기존 2초 → 3초로 증가
for _ in range(15):  # 기존 10회 → 15회로 증가 (더 많은 데이터 로드)
    driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)
    time.sleep(scroll_pause_time)

# ✅ 글 제목 리스트 가져오기 (클릭할 요소들 찾기)
article_elements = driver.find_elements(By.XPATH, "//h6")  # 제목 요소 가져오기
article_links = []

# ✅ 클릭 가능한 링크 리스트 만들기
for article in article_elements:
    try:
        link = article.find_element(By.XPATH, "./ancestor::a").get_attribute("href")  # 제목을 포함하는 링크 찾기
        if link:
            article_links.append(link)
    except:
        continue  # 링크가 없으면 스킵

print(f"📌 {len(article_links)}개의 글을 발견했습니다!")

# ✅ CSV 파일 저장 준비
with open("careerhackeralex_prompts.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["제목", "전체 내용", "태그"])  # CSV 헤더 추가

    # ✅ 개별 글 클릭하여 전체 내용 가져오기
    for i, link in enumerate(article_links):
        driver.get(link)  # 개별 글 페이지로 이동
        time.sleep(5)  # 페이지 로드 대기

        try:
            # ✅ 제목 가져오기
            title_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, "//h1"))  # 개별 글 제목이 보일 때까지 대기
            )
            title = title_element.text.strip()

            # ✅ 전체 내용 가져오기 (본문)
            content_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'MuiBox-root')]"))  # 전체 글 내용 요소 찾기
            )
            content = content_element.text.strip()

            # ✅ 태그 가져오기
            tags_elements = driver.find_elements(By.XPATH, "//span[contains(@class, 'MuiChip-label')]")
            tags = ", ".join([tag.text.strip() for tag in tags_elements])

            # ✅ CSV에 저장
            writer.writerow([title, content, tags])

            print(f"✅ [{i+1}/{len(article_links)}] '{title}' 저장 완료!")

        except Exception as e:
            print(f"⚠ 오류 발생: {e}")

        # ✅ 목록으로 돌아가기
        driver.back()
        time.sleep(3)  # 목록 페이지 로딩 대기

print("✅ 모든 데이터가 'careerhackeralex_prompts.csv' 파일로 저장되었습니다!")

# ✅ Selenium 종료
driver.quit()

코드 개선 포인트

✅ 1️⃣ 개별 글 클릭 후 전체 내용 가져오기

기존에는 목록에서 일부 내용만 가져왔음 → 이제는 글을 클릭해서 전체 내용을 가져옴.
article_links 리스트를 만들어서 제목을 클릭할 수 있는 링크를 저장.
driver.get(link)로 해당 글로 이동 후 전체 내용 가져오기.

✅ 2️⃣ 더 긴 대기 시간 추가

WebDriverWait(driver, 20) → JavaScript 로딩이 끝날 때까지 기다림.
time.sleep(5) 추가 → 개별 글이 완전히 로드되도록 기다림.

✅ 3️⃣ '뒤로 가기(driver.back())' 사용

글을 크롤링한 후 목록으로 다시 돌아가서 다음 글을 클릭.
이전 방식처럼 새로고침하는 것이 아니라 더 빠름!

✅ 4️⃣ CSV 파일 저장

["제목", "전체 내용", "태그"] 형식으로 데이터를 저장.
엑셀에서 바로 열어볼 수 있음.

근데 이것도 실패함ㅋㅋ

아무래도 뒤로가기 안해도 되는 건데 뒤로가기 해서 실패한 것 같음

실패를 여러번 하니까 좀 지쳐서 쉬어야겠음

내일 해야지