2万字硬核剖析网页自定义字体解析（css样式表解析、字体点阵图绘制与本地图像识别等）

信烁良金
2021-12-08 16:49:27 9

+关注

本文将详细介绍如何解析自定义字体，绘制点阵图，并通过图像识别技术提取出字符信息，最后校对后构建正确的字符映射关系，从而获取到所需的数据。了解本文后，你将能够从容应对各种形式的字体反爬技术。

解析自定义字体的深度剖析

自定义字体简介

首先，我们需要明确自定义字体与普通字体的区别。自定义字体包含了一些特殊的Unicode编码对应的点阵图数据，而普通字体只定义了标准编码的显示形式。因此，普通字体可以直接复制出正确的文本，而自定义字体只能复制出对应的Unicode编码。浏览器通过自定义字体的对应关系，渲染出相应的点阵图以显示字符。

以某个团购网站为例，我们来演示如何解析自定义字体。该网站使用自定义字体来隐藏一些关键信息，如商铺名称等。

Python加载页面

为了获取页面内容，我们使用Python的requests库来发送HTTP请求，并通过BeautifulSoup库来解析HTML文档。以下是基本步骤：

```python import requests from bs4 import BeautifulSoup

headers = { "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "zh-CN,zh;q=0.9" } session = requests.Session() session.headers = headers

res = session.get("http://www.dianping.com/shenzhen/ch30") soup = BeautifulSoup(res.text, 'html5lib') ```

使用BeautifulSoup解析DOM树

通过BeautifulSoup库，我们可以轻松地从HTML文档中提取出所需的元素。例如，我们可以提取分类列表和地点列表：

```python typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href']))

arealist = [] for atag in soup.select("div#region-nav > a"): arealist.append((atag.span.text, a_tag['href'])) ```

解析字体对应的CSS下载URL

自定义字体的定义通常位于带有特定关键词的CSS文件中。我们可以通过查找这些关键词来定位CSS文件：

```python def getcssurl(soup): for link in soup.select("head > link[rel=stylesheet]"): href = link['href'] if "svgtextcss" in href: return href

cssurl = getcss_url(soup) ```

解析CSS文件获取字体URL

在CSS文件中，我们可以找到每个字体名称对应的字体文件URL。我们使用正则表达式来提取这些信息：

```python import re

def parseCssFontUrl(cssurl): res = session.get(cssurl) rule = {} fontface = {} for name, value in re.findall("([^{}]+){([^{}]+)}", res.text): name = name.strip() for row in value.split(";"): if ":" not in row: continue k, v = row.split(":") k, v = k.strip(), v.strip(' "'') if name == "@font-face": if k == "font-family": fontname = v elif k == "src": fontface.setdefault(fontname, []).extend(re.findall("url("([^()]+)")", v)) else: rule[name] = v fonturls = {} for classname, tagname in rule.items(): fonturls[classname] = fontface[tagname][0] return fonturls

fonturls = parseCssFontUrl(cssurl) ```

下载字体文件

获取到字体URL后，我们可以下载这些字体文件进行后续处理：

```python import requests

def download_font(url, filename): with open(filename, "wb") as f: f.write(requests.get(url).content)

for classname, url in fonturls.items(): downloadfont(url, f"{classname}.woff") ```

建立自定义字体映射关系

为了识别自定义字体中的字符，我们需要将字体文件加载到内存中，并将其点阵图与字符进行匹配。这里我们可以使用fontTools和PIL库来实现：

```python from fontTools.ttLib import TTFont from PIL import ImageFont, Image, ImageDraw

def getCustomFontGroupImgs(fontfile, unilist=None, groupnum=25): if unilist is None: tfont = TTFont(fontfile) unilist = tfont.getGlyphOrder()[2:] imgs = [] font = ImageFont.truetype(fontfile, 20) for i in range(0, len(unilist), groupnum): im = Image.new(mode='RGB', size=(20*groupnum+10, 22), color="white") draw = ImageDraw.Draw(im=im) unknownchars = "".join(unilist[i:i + groupnum]).replace("uni", "u") unknownchars = unknownchars.encode().decode("unicodeescape") draw.text(xy=(5, -4), text=unknown_chars, fill=0, font=font) imgs.append(im) return imgs ```

图像识别与字符映射

为了识别点阵图中的字符，我们可以使用OCR技术。这里我们使用ddddocr库来实现：

```python import ddddocr

def getimgbytes(img): imgbyte = BytesIO() img.save(imgbyte, format='JPEG') return img_byte.getvalue()

def batchrecognize(imgs): ocr = ddddocr.DdddOcr() result = [] for im in imgs: imgbyte = getimgbytes(im) text = ocr.classification(img_byte) result.append(text) return result ```

批量下载与数据提取

最后，我们将所有数据提取出来并保存到Excel文件中：

```python import pandas as pd import random import time

def parsedata(soup): result = [] for litag in soup.select("div#shop-all-list div.txt"): title = litag.selectone("div.tit>a>h4").text url = litag.selectone("div.tit>a")["href"] starclass = litag.selectone("div.comment>div.nebulastar>div.staricon>span")["class"] star = int(re.findall("d+", " ".join(starclass))[0]) // 10

    comment_tag = li_tag.select_one("div.comment>a.review-num>b")
    comment_num = comment_tag.text if comment_tag else None

    mean_price_tag = li_tag.select_one("div.comment>a.mean-price>b")
    mean_price = mean_price_tag.text if mean_price_tag else None

    fun_type = li_tag.select_one("div.tag-addr>a:nth-of-type(1)>span.tag").text
    area = li_tag.select_one("div.tag-addr>a:nth-of-type(2)>span.tag").text
    result.append((title, star, comment_num, mean_price, fun_type, area, url))
return result

def fixtext(soup): cssurl = getcssurl(soup) for svgmtsi in soup.findall('svgmtsi'): classname = svgmtsi['class'][0] fontmap = getFontMapFromClassName(classname, cssurl) chars = [] for c in svgmtsi.text: char = c.encode("unicodeescape").decode()[2:] chars.append(font_map[char]) svgmtsi.replaceWith("".join(chars))

soup = BeautifulSoup(res.text, 'html5lib') typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href'] + 'r91172'))

result = [] for typename, url in typelist: res = session.get(url) soup = BeautifulSoup(res.text, 'html5lib') fixtext(soup) result.extend(parsedata(soup)) time.sleep(random.randint(2, 4))

df = pd.DataFrame(result, columns=["标题", "星级", "评论数", "均价", "娱乐类型", "区域", "链接"]) df.评论数 = df.评论数.apply(lambda x: int(x) if x else pd.NA) df.均价 = df.均价.str[1:].apply(lambda x: int(x) if x else pd.NA) df.dropduplicates(inplace=True) df.toexcel("华南城娱乐.xlsx", index=False) ```