본문 바로가기
Appz Knowledge/Python

Crawling/Naver blog 게시물

by 스쳐가는인연 2023. 2. 22.
일부 블로그는 게시물의 복사가 제한되어 있는데, Naver 블로그도 그 중 하나인듯 ...
파이썬 연습을 위해 게시물을 일단, 내 블로그로 퍼온다음.
 
import requests
from bs4 import BeautifulSoup
import urllib.request as req

blog_url = "https://m.blog.naver.com/myholywish/223024003010"

def get_text(blog_url):
    try:
        
        res = req.urlopen(blog_url)
        soup = BeautifulSoup(res, 'html.parser')
        temp = soup.select("#se_textarea")

        # Blog title
        title_raw = soup.findAll("h3", {"class":"tit_h3"})

        for a in title_raw:
            # print(a)
            title = a.get_text()
            # print(title)

        # Blog text
        temp = soup.find_all("div", {"id":"viewTypeSelector"})

        for a in temp:
            text = a.get_text()
            #print(text)

            #change_text = text.replace(u'\xa0', '\n')
            change_text1 = text.replace(u'\n', '') # 불필요한 줄바꿈 제거
            change_text2 = change_text1.replace(u'\xa0', '') # 특이문자/줄바꿈 제거
            change_text3 = change_text2.replace(u'(', '\n(') # 괄호 전 줄바꿈
            change_text4 = change_text3.replace(u')', ')\n') # 괄호 후 줄바꿈
            change_text5 = change_text4.replace(u'  ', '\n') # 2칸 띄어진경우 줄바꿈
            print(change_text5)

    except:
        print("failed crawling")

get_text(blog_url)

참조자료:
대상 원본글 출처: https://m.blog.naver.com/antmint/30084068899
파이썬 코드 출처: https://blog.naver.com/zilu11/221256857392





 

반응형