利用python爬取淘宝的评论数据
发布于2020-06-18 22:56
阅读(1269)
评论(0)
点赞(18)
收藏(3)
以淘宝的联想拯救者为例,界面如下
目标是爬取下方的评论数据
代码如下,首先载入必须的库
然后右键检查,依次点击network,搜索符号,然后在框框里复制粘贴一段评论,如下图
然后找到general下面url,将其复制下来(这里是真实的网址,也就是藏有评论信息的网址)
将这个网址赋值粘贴赋值给url,如下:
url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=616836000618&spuId=1628872455&sellerId=126446588&order=3¤tPage=1&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvi9vavNUvUvCkvvvvvjiPnLqW1j1nn2SpzjthPmP91jt8nLdhtjDvP2s91jiPdphvhIovd8ivvvCxcmZNLXcXrb8qKOhCvv147tgvUn147DdYEY%2FrvpvBCvheU0pvvnvQEBYb3Oy3%2B2eCvpvW7D6e9Wsw7Di4YVjNdphvhUWC8AYLvvCHZbhSpaVxnsItvpvhvvCvpvwCvva47rMNzHlZiQhvCvvvpZoEvpvVvpCmp%2F2WuphvmvvvpLP0vIi8Kphv8vvvphvvvvvvvvCVB9vvvxhvvhXVvvmCWvvvByOvvUhwvvCVB9vv9BQEvpCWvrqITC0xdBKKdox%2Ftj7KHd8rakS6D40OV8tK2O71n3oAdcZIibmAdXuKNxYrSBh7rEgDNrBl5tu4V5xPAWv4VBOqb64B9Cka%2Bfvsx9hCvvOv9hCvvvvPvpvhvv2MMsyCvvpvvhCv3QhvCvmvphmrvpvBCUV45uhvvv7YEBYb3Oy3%2B2ervpvEvvjigLZvvW31dphvmpvCTNynvv28Q46Cvvyv9OVZi9vvL29tvpvhvvCvp86Cvvyv9EkaJvvv6ZptvpvhvvCvp86Cvvyv9E8ZmQvv6TArvpvo3vHufTwvvnOQEBYnDae6%2BdKt9phvHHifDp2vzHi473L5tMsd7ux40nYERphvCvvvphmCvpvZ7D11v8jw7Di48Lf5MEi49lusz6kCvpvW7D%2B0vvbw7Di4bEdN&needFold=0&_ksTS=1592317241348_1901&callback=jsonp1902'
然后去找游览器信息,去模拟游览器,如下依次点击,然后在Headers下面,找到referer,user-agent和cookie复制下来
像这样放入Headers里面,注意加入引号
'referer': 'https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.1.5549375alBtq95&id=616836000618&areaId=330300&standard=1&user_id=126446588&cat_id=2&is_b=1&rn=83c67105:103646;20122:15515349',
'user-agent': 'Moz x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'cookie': 'sm4=330300; cna=KJpgF1FQkiwCAXAOZ/s8g0Y1; dnk=%5Cu9ED1%5Cu8840zzy; hng=HK%7Czh-TW%7CHKD%7C344; uc1=pas=0&cookie21=VT5L2FSpccLuJBreK%2BBd&cookie14=UoTV7gLdFWuX4g%3D%3D&existShop=false&cookie16=W5iHLLyFl=eBgLgff7QYN',
像这样设置好了之后,就可以正式爬取,比较安全
url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=616836000618&spuId=1628872455&sellerId=126446588&order=3¤tPage=1&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvi9vavNUvUvCkvvvvvjiPnLqW1j1nn2SpzjthPmP91jt8nLdhtjDvP2s91jiPdphvhIovd8ivvvCxcmZNLXcXrb8qKOhCvv147tgvUn147DdYEY%2FrvpvBCvheU0pvvnvQEBYb3Oy3%2B2eCvpvW7D6e9Wsw7Di4YVjNdphvhUWC8AYLvvCHZbhSpaVxnsItvpvhvvCvpvwCvva47rMNzHlZiQhvCvvvpZoEvpvVvpCmp%2F2WuphvmvvvpLP0vIi8Kphv8vvvphvvvvvvvvCVB9vvvxhvvhXVvvmCWvvvByOvvUhwvvCVB9vv9BQEvpCWvrqITC0xdBKKdox%2Ftj7KHd8rakS6D40OV8tK2O71n3oAdcZIibmAdXuKNxYrSBh7rEgDNrBl5tu4V5xPAWv4VBOqb64B9Cka%2Bfvsx9hCvvOv9hCvvvvPvpvhvv2MMsyCvvpvvhCv3QhvCvmvphmrvpvBCUV45uhvvv7YEBYb3Oy3%2B2ervpvEvvjigLZvvW31dphvmpvCTNynvv28Q46Cvvyv9OVZi9vvL29tvpvhvvCvp86Cvvyv9EkaJvvv6ZptvpvhvvCvp86Cvvyv9E8ZmQvv6TArvpvo3vHufTwvvnOQEBYnDae6%2BdKt9phvHHifDp2vzHi473L5tMsd7ux40nYERphvCvvvphmCvpvZ7D11v8jw7Di48Lf5MEi49lusz6kCvpvW7D%2B0vvbw7Di4bEdN&needFold=0&_ksTS=1592317241348_1901&callback=jsonp1902'
url2 = 'https://rate.tmall.com/list_detail_rate.htm?itemId=616836000618&spuId=1628872455&sellerId=126446588&order=3¤tPage='+str(i)+'&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvi9vavNUvUvCkvvvvvjiPnLqW1j1nn2SpzjthPmP91jt8nLdhtjDvP2s91jiPdphvhIovd8ivvvCxcmZNLXcXrb8qKOhCvv147tgvUn147DdYEY%2FrvpvBCvheU0pvvnvQEBYb3Oy3%2B2eCvpvW7D6e9Wsw7Di4YVjNdphvhUWC8AYLvvCHZbhSpaVxnsItvpvhvvCvpvwCvva47rMNzHlZiQhvCvvvpZoEvpvVvpCmp%2F2WuphvmvvvpLP0vIi8Kphv8vvvphvvvvvvvvCVB9vvvxhvvhXVvvmCWvvvByOvvUhwvvCVB9vv9BQEvpCWvrqITC0xdBKKdox%2Ftj7KHd8rakS6D40OV8tK2O71n3oAdcZIibmAdXuKNxYrSBh7rEgDNrBl5tu4V5xPAWv4VBOqb64B9Cka%2Bfvsx9hCvvOv9hCvvvvPvpvhvv2MMsyCvvpvvhCv3QhvCvmvphmrvpvBCUV45uhvvv7YEBYb3Oy3%2B2ervpvEvvjigLZvvW31dphvmpvCTNynvv28Q46Cvvyv9OVZi9vvL29tvpvhvvCvp86Cvvyv9EkaJvvv6ZptvpvhvvCvp86Cvvyv9E8ZmQvv6TArvpvo3vHufTwvvnOQEBYnDae6%2BdKt9phvHHifDp2vzHi473L5tMsd7ux40nYERphvCvvvphmCvpvZ7D11v8jw7Di48Lf5MEi49lusz6kCvpvW7D%2B0vvbw7Di4bEdN&needFold=0&_ksTS=1592317241348_1901&callback=jsonp1902'
time.sleep(random.randint(3,9))
data = requests.get(url2,headers = headers).text
pat = re.compile('"rateContent":"(.*?)","fromMall"')
pata = re.compile('"rateDate":"(.*?)","rateContent"')
patb = re.compile('"auctionSku":"(.*?)","anony"')
type_one.extend(pat.findall(data))
type_two.extend(pata.findall(data))
type_three.extend(patb.findall(data))
print('第'+ str(i) + '页爬取完毕')
new_frame = pd.DataFrame(dict)
上面是爬取了总共2页,可以通过设置range里的值,比如range(1,10)就是爬取9页了,这里注意time.sleep函数的作用是模拟人观看网址的状态,因为爬虫速度可以很快,瞬间爬完一页面就可以翻页了,这样很容易被网址判断为爬虫,然后你的IP就会被记住,网址就暂时不对你开放了,所以我设置了随机生成3-9秒,当然你可以设置更小一点,不过安全性就会降低,这里我为了保险起见,还是设置比较高。
上图为爬取结果,如果有任何问题,欢迎留言。
原文链接:https://blog.csdn.net/z463544804/article/details/106797987
所属网站分类:
技术文章 >
博客