本文将详细介绍如何解析自定义字体,绘制点阵图,并通过图像识别技术提取出字符信息,最后校对后构建正确的字符映射关系,从而获取到所需的数据。了解本文后,你将能够从容应对各种形式的字体反爬技术。
首先,我们需要明确自定义字体与普通字体的区别。自定义字体包含了一些特殊的Unicode编码对应的点阵图数据,而普通字体只定义了标准编码的显示形式。因此,普通字体可以直接复制出正确的文本,而自定义字体只能复制出对应的Unicode编码。浏览器通过自定义字体的对应关系,渲染出相应的点阵图以显示字符。
以某个团购网站为例,我们来演示如何解析自定义字体。该网站使用自定义字体来隐藏一些关键信息,如商铺名称等。
为了获取页面内容,我们使用Python的requests
库来发送HTTP请求,并通过BeautifulSoup
库来解析HTML文档。以下是基本步骤:
```python import requests from bs4 import BeautifulSoup
headers = { "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "zh-CN,zh;q=0.9" } session = requests.Session() session.headers = headers
res = session.get("http://www.dianping.com/shenzhen/ch30") soup = BeautifulSoup(res.text, 'html5lib') ```
通过BeautifulSoup库,我们可以轻松地从HTML文档中提取出所需的元素。例如,我们可以提取分类列表和地点列表:
```python typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href']))
arealist = [] for atag in soup.select("div#region-nav > a"): arealist.append((atag.span.text, a_tag['href'])) ```
自定义字体的定义通常位于带有特定关键词的CSS文件中。我们可以通过查找这些关键词来定位CSS文件:
```python def getcssurl(soup): for link in soup.select("head > link[rel=stylesheet]"): href = link['href'] if "svgtextcss" in href: return href
cssurl = getcss_url(soup) ```
在CSS文件中,我们可以找到每个字体名称对应的字体文件URL。我们使用正则表达式来提取这些信息:
```python import re
def parseCssFontUrl(cssurl): res = session.get(cssurl) rule = {} fontface = {} for name, value in re.findall("([^{}]+){([^{}]+)}", res.text): name = name.strip() for row in value.split(";"): if ":" not in row: continue k, v = row.split(":") k, v = k.strip(), v.strip(' "'') if name == "@font-face": if k == "font-family": fontname = v elif k == "src": fontface.setdefault(fontname, []).extend(re.findall("url("([^()]+)")", v)) else: rule[name] = v fonturls = {} for classname, tagname in rule.items(): fonturls[classname] = fontface[tagname][0] return fonturls
fonturls = parseCssFontUrl(cssurl) ```
获取到字体URL后,我们可以下载这些字体文件进行后续处理:
```python import requests
def download_font(url, filename): with open(filename, "wb") as f: f.write(requests.get(url).content)
for classname, url in fonturls.items(): downloadfont(url, f"{classname}.woff") ```
为了识别自定义字体中的字符,我们需要将字体文件加载到内存中,并将其点阵图与字符进行匹配。这里我们可以使用fontTools
和PIL
库来实现:
```python from fontTools.ttLib import TTFont from PIL import ImageFont, Image, ImageDraw
def getCustomFontGroupImgs(fontfile, unilist=None, groupnum=25): if unilist is None: tfont = TTFont(fontfile) unilist = tfont.getGlyphOrder()[2:] imgs = [] font = ImageFont.truetype(fontfile, 20) for i in range(0, len(unilist), groupnum): im = Image.new(mode='RGB', size=(20*groupnum+10, 22), color="white") draw = ImageDraw.Draw(im=im) unknownchars = "".join(unilist[i:i + groupnum]).replace("uni", "u") unknownchars = unknownchars.encode().decode("unicodeescape") draw.text(xy=(5, -4), text=unknown_chars, fill=0, font=font) imgs.append(im) return imgs ```
为了识别点阵图中的字符,我们可以使用OCR技术。这里我们使用ddddocr
库来实现:
```python import ddddocr
def getimgbytes(img): imgbyte = BytesIO() img.save(imgbyte, format='JPEG') return img_byte.getvalue()
def batchrecognize(imgs): ocr = ddddocr.DdddOcr() result = [] for im in imgs: imgbyte = getimgbytes(im) text = ocr.classification(img_byte) result.append(text) return result ```
最后,我们将所有数据提取出来并保存到Excel文件中:
```python import pandas as pd import random import time
def parsedata(soup): result = [] for litag in soup.select("div#shop-all-list div.txt"): title = litag.selectone("div.tit>a>h4").text url = litag.selectone("div.tit>a")["href"] starclass = litag.selectone("div.comment>div.nebulastar>div.staricon>span")["class"] star = int(re.findall("d+", " ".join(starclass))[0]) // 10
comment_tag = li_tag.select_one("div.comment>a.review-num>b")
comment_num = comment_tag.text if comment_tag else None
mean_price_tag = li_tag.select_one("div.comment>a.mean-price>b")
mean_price = mean_price_tag.text if mean_price_tag else None
fun_type = li_tag.select_one("div.tag-addr>a:nth-of-type(1)>span.tag").text
area = li_tag.select_one("div.tag-addr>a:nth-of-type(2)>span.tag").text
result.append((title, star, comment_num, mean_price, fun_type, area, url))
return result
def fixtext(soup): cssurl = getcssurl(soup) for svgmtsi in soup.findall('svgmtsi'): classname = svgmtsi['class'][0] fontmap = getFontMapFromClassName(classname, cssurl) chars = [] for c in svgmtsi.text: char = c.encode("unicodeescape").decode()[2:] chars.append(font_map[char]) svgmtsi.replaceWith("".join(chars))
headers = { "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "zh-CN,zh;q=0.9" } session = requests.Session() session.headers = headers baseurl = "http://www.dianping.com/shenzhen/ch30" res = session.get(baseurl)
soup = BeautifulSoup(res.text, 'html5lib') typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href'] + 'r91172'))
result = [] for typename, url in typelist: res = session.get(url) soup = BeautifulSoup(res.text, 'html5lib') fixtext(soup) result.extend(parsedata(soup)) time.sleep(random.randint(2, 4))
df = pd.DataFrame(result, columns=["标题", "星级", "评论数", "均价", "娱乐类型", "区域", "链接"]) df.评论数 = df.评论数.apply(lambda x: int(x) if x else pd.NA) df.均价 = df.均价.str[1:].apply(lambda x: int(x) if x else pd.NA) df.dropduplicates(inplace=True) df.toexcel("华南城娱乐.xlsx", index=False) ```
通过以上步骤,我们可以成功解析并提取出自定义字体中的字符信息。希望本文对你有所帮助,让你能够应对各种复杂的字体反爬技术。