2万字硬核剖析网页自定义字体解析(css样式表解析、字体点阵图绘制与本地图像识别等)
作者头像
  • 信烁良金
  • 2021-12-08 16:49:27 9

本文将详细介绍如何解析自定义字体,绘制点阵图,并通过图像识别技术提取出字符信息,最后校对后构建正确的字符映射关系,从而获取到所需的数据。了解本文后,你将能够从容应对各种形式的字体反爬技术。

解析自定义字体的深度剖析

自定义字体简介

首先,我们需要明确自定义字体与普通字体的区别。自定义字体包含了一些特殊的Unicode编码对应的点阵图数据,而普通字体只定义了标准编码的显示形式。因此,普通字体可以直接复制出正确的文本,而自定义字体只能复制出对应的Unicode编码。浏览器通过自定义字体的对应关系,渲染出相应的点阵图以显示字符。

以某个团购网站为例,我们来演示如何解析自定义字体。该网站使用自定义字体来隐藏一些关键信息,如商铺名称等。

Python加载页面

为了获取页面内容,我们使用Python的requests库来发送HTTP请求,并通过BeautifulSoup库来解析HTML文档。以下是基本步骤:

```python import requests from bs4 import BeautifulSoup

headers = { "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "zh-CN,zh;q=0.9" } session = requests.Session() session.headers = headers

res = session.get("http://www.dianping.com/shenzhen/ch30") soup = BeautifulSoup(res.text, 'html5lib') ```

使用BeautifulSoup解析DOM树

通过BeautifulSoup库,我们可以轻松地从HTML文档中提取出所需的元素。例如,我们可以提取分类列表和地点列表:

```python typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href']))

arealist = [] for atag in soup.select("div#region-nav > a"): arealist.append((atag.span.text, a_tag['href'])) ```

解析字体对应的CSS下载URL

自定义字体的定义通常位于带有特定关键词的CSS文件中。我们可以通过查找这些关键词来定位CSS文件:

```python def getcssurl(soup): for link in soup.select("head > link[rel=stylesheet]"): href = link['href'] if "svgtextcss" in href: return href

cssurl = getcss_url(soup) ```

解析CSS文件获取字体URL

在CSS文件中,我们可以找到每个字体名称对应的字体文件URL。我们使用正则表达式来提取这些信息:

```python import re

def parseCssFontUrl(cssurl): res = session.get(cssurl) rule = {} fontface = {} for name, value in re.findall("([^{}]+){([^{}]+)}", res.text): name = name.strip() for row in value.split(";"): if ":" not in row: continue k, v = row.split(":") k, v = k.strip(), v.strip(' "'') if name == "@font-face": if k == "font-family": fontname = v elif k == "src": fontface.setdefault(fontname, []).extend(re.findall("url("([^()]+)")", v)) else: rule[name] = v fonturls = {} for classname, tagname in rule.items(): fonturls[classname] = fontface[tagname][0] return fonturls

fonturls = parseCssFontUrl(cssurl) ```

下载字体文件

获取到字体URL后,我们可以下载这些字体文件进行后续处理:

```python import requests

def download_font(url, filename): with open(filename, "wb") as f: f.write(requests.get(url).content)

for classname, url in fonturls.items(): downloadfont(url, f"{classname}.woff") ```

建立自定义字体映射关系

为了识别自定义字体中的字符,我们需要将字体文件加载到内存中,并将其点阵图与字符进行匹配。这里我们可以使用fontToolsPIL库来实现:

```python from fontTools.ttLib import TTFont from PIL import ImageFont, Image, ImageDraw

def getCustomFontGroupImgs(fontfile, unilist=None, groupnum=25): if unilist is None: tfont = TTFont(fontfile) unilist = tfont.getGlyphOrder()[2:] imgs = [] font = ImageFont.truetype(fontfile, 20) for i in range(0, len(unilist), groupnum): im = Image.new(mode='RGB', size=(20*groupnum+10, 22), color="white") draw = ImageDraw.Draw(im=im) unknownchars = "".join(unilist[i:i + groupnum]).replace("uni", "u") unknownchars = unknownchars.encode().decode("unicodeescape") draw.text(xy=(5, -4), text=unknown_chars, fill=0, font=font) imgs.append(im) return imgs ```

图像识别与字符映射

为了识别点阵图中的字符,我们可以使用OCR技术。这里我们使用ddddocr库来实现:

```python import ddddocr

def getimgbytes(img): imgbyte = BytesIO() img.save(imgbyte, format='JPEG') return img_byte.getvalue()

def batchrecognize(imgs): ocr = ddddocr.DdddOcr() result = [] for im in imgs: imgbyte = getimgbytes(im) text = ocr.classification(img_byte) result.append(text) return result ```

批量下载与数据提取

最后,我们将所有数据提取出来并保存到Excel文件中:

```python import pandas as pd import random import time

def parsedata(soup): result = [] for litag in soup.select("div#shop-all-list div.txt"): title = litag.selectone("div.tit>a>h4").text url = litag.selectone("div.tit>a")["href"] starclass = litag.selectone("div.comment>div.nebulastar>div.staricon>span")["class"] star = int(re.findall("d+", " ".join(starclass))[0]) // 10

    comment_tag = li_tag.select_one("div.comment>a.review-num>b")
    comment_num = comment_tag.text if comment_tag else None

    mean_price_tag = li_tag.select_one("div.comment>a.mean-price>b")
    mean_price = mean_price_tag.text if mean_price_tag else None

    fun_type = li_tag.select_one("div.tag-addr>a:nth-of-type(1)>span.tag").text
    area = li_tag.select_one("div.tag-addr>a:nth-of-type(2)>span.tag").text
    result.append((title, star, comment_num, mean_price, fun_type, area, url))
return result

def fixtext(soup): cssurl = getcssurl(soup) for svgmtsi in soup.findall('svgmtsi'): classname = svgmtsi['class'][0] fontmap = getFontMapFromClassName(classname, cssurl) chars = [] for c in svgmtsi.text: char = c.encode("unicodeescape").decode()[2:] chars.append(font_map[char]) svgmtsi.replaceWith("".join(chars))

headers = { "Connection": "keep-alive", "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "zh-CN,zh;q=0.9" } session = requests.Session() session.headers = headers baseurl = "http://www.dianping.com/shenzhen/ch30" res = session.get(baseurl)

soup = BeautifulSoup(res.text, 'html5lib') typelist = [] for atag in soup.select("div#classfy > a"): typelist.append((atag.span.text, a_tag['href'] + 'r91172'))

result = [] for typename, url in typelist: res = session.get(url) soup = BeautifulSoup(res.text, 'html5lib') fixtext(soup) result.extend(parsedata(soup)) time.sleep(random.randint(2, 4))

df = pd.DataFrame(result, columns=["标题", "星级", "评论数", "均价", "娱乐类型", "区域", "链接"]) df.评论数 = df.评论数.apply(lambda x: int(x) if x else pd.NA) df.均价 = df.均价.str[1:].apply(lambda x: int(x) if x else pd.NA) df.dropduplicates(inplace=True) df.toexcel("华南城娱乐.xlsx", index=False) ```

总结

通过以上步骤,我们可以成功解析并提取出自定义字体中的字符信息。希望本文对你有所帮助,让你能够应对各种复杂的字体反爬技术。

    本文来源:图灵汇
责任编辑: : 信烁良金
声明:本文系图灵汇原创稿件,版权属图灵汇所有,未经授权不得转载,已经协议授权的媒体下载使用时须注明"稿件来源:图灵汇",违者将依法追究责任。
    分享
解析字体点阵绘制剖析样式识别图像定义本地
    下一篇