跳转至

为什么使用 Pydantic?

今天,Pydantic 每月被下载多次,并且被世界上一些最大和最知名的组织所使用。

很难知道为什么自六年前诞生以来有如此多的人采用了 Pydantic,但这里有一些猜测。

类型提示为模式验证提供支持

Pydantic 所依据的模式通常是由 Python 类型提示定义的。

类型提示对此非常有用,因为如果正在编写现代 Python,那么已经知道如何使用它们。使用类型提示还意味着 Pydantic 与 mypy 和 pyright 等静态类型工具以及 pycharm 和 vscode 等 IDE 很好地集成。

示例 - 仅输入类型提示

(此示例需要 Python 3.9+)

from typing import Annotated, Dict, List, Literal, Tuple

from annotated_types import Gt

from pydantic import BaseModel


class Fruit(BaseModel):
    name: str  # (1)!
    color: Literal['red', 'green']  # (2)!
    weight: Annotated[float, Gt(0)]  # (3)!
    bazam: Dict[str, List[Tuple[int, bool, float]]]  # (4)!


print(
    Fruit(
        name='Apple',
        color='red',
        weight=4.2,
        bazam={'foobar': [(1, True, 0.1)]},
    )
)
#> name='Apple' color='red' weight=4.2 bazam={'foobar': [(1, True, 0.1)]}

  1. The name field is simply annotated with str - any string is allowed.
  2. The Literal type is used to enforce that color is either 'red' or 'green'.
  3. Even when we want to apply constraints not encapsulated in python types, we can use Annotated and annotated-types to enforce constraints without breaking type hints.
  4. I'm not claiming "bazam" is really an attribute of fruit, but rather to show that arbitrarily complex types can easily be validated.

了解更多

查看关于支持类型的文档。

性能

Pydantic 的核心验证逻辑在一个单独的包 pydantic-core 中实现,大多数类型的验证是在 Rust 中实现的。

因此,Pydantic 是 Python 中最快的数据验证库之一。

性能示例 - Pydantic 与专用代码

一般来说,专用代码应该比通用验证器快得多,但在这个示例中,Pydantic 在解析 JSON 和验证 URL 时比专用代码快 300%以上。

Performance Example
import json
import timeit
from urllib.parse import urlparse

import requests

from pydantic import HttpUrl, TypeAdapter

reps = 7
number = 100
r = requests.get('https://api.github.com/emojis')
r.raise_for_status()
emojis_json = r.content


def emojis_pure_python(raw_data):
    data = json.loads(raw_data)
    output = {}
    for key, value in data.items():
        assert isinstance(key, str)
        url = urlparse(value)
        assert url.scheme in ('https', 'http')
        output[key] = url


emojis_pure_python_times = timeit.repeat(
    'emojis_pure_python(emojis_json)',
    globals={
        'emojis_pure_python': emojis_pure_python,
        'emojis_json': emojis_json,
    },
    repeat=reps,
    number=number,
)
print(f'pure python: {min(emojis_pure_python_times) / number * 1000:0.2f}ms')
#> pure python: 5.32ms

type_adapter = TypeAdapter(dict[str, HttpUrl])
emojis_pydantic_times = timeit.repeat(
    'type_adapter.validate_json(emojis_json)',
    globals={
        'type_adapter': type_adapter,
        'HttpUrl': HttpUrl,
        'emojis_json': emojis_json,
    },
    repeat=reps,
    number=number,
)
print(f'pydantic: {min(emojis_pydantic_times) / number * 1000:0.2f}ms')
#> pydantic: 1.54ms

print(
    f'Pydantic {min(emojis_pure_python_times) / min(emojis_pydantic_times):0.2f}x faster'
)
#> Pydantic 3.45x faster

与用编译语言编写的其他以性能为中心的库不同,Pydantic 也对通过函数式验证器进行自定义验证有着出色的支持。

了解更多

塞缪尔·科尔文在 2023 年 PyCon 的演讲解释了 pydantic-core 是如何工作的以及它如何与 Pydantic 集成。

序列化

Pydantic 提供了以三种方式对模型进行序列化的功能:

  1. 对于由相关 Python 对象组成的 Python dict

  2. 对于仅由“可 JSON 化”类型组成的 Python dict

  3. To a JSON string

在所有这三种模式下,输出都可以通过排除特定字段、排除未设置的字段、排除默认值以及排除 None 值来进行定制

示例 - 序列化 3 种方式
from datetime import datetime

from pydantic import BaseModel


class Meeting(BaseModel):
    when: datetime
    where: bytes
    why: str = 'No idea'


m = Meeting(when='2020-01-01T12:00', where='home')
print(m.model_dump(exclude_unset=True))
#> {'when': datetime.datetime(2020, 1, 1, 12, 0), 'where': b'home'}
print(m.model_dump(exclude={'where'}, mode='json'))
#> {'when': '2020-01-01T12:00:00', 'why': 'No idea'}
print(m.model_dump_json(exclude_defaults=True))
#> {"when":"2020-01-01T12:00:00","where":"home"}

了解更多

请参阅序列化文档。

JSON Schema

JSON 模式可为任何 Pydantic 模式生成——可实现自文档化的 API 并与支持 JSON 模式的各种工具集成。

示例 - JSON 模式
from datetime import datetime

from pydantic import BaseModel


class Address(BaseModel):
    street: str
    city: str
    zipcode: str


class Meeting(BaseModel):
    when: datetime
    where: Address
    why: str = 'No idea'


print(Meeting.model_json_schema())
"""
{
    '$defs': {
        'Address': {
            'properties': {
                'street': {'title': 'Street', 'type': 'string'},
                'city': {'title': 'City', 'type': 'string'},
                'zipcode': {'title': 'Zipcode', 'type': 'string'},
            },
            'required': ['street', 'city', 'zipcode'],
            'title': 'Address',
            'type': 'object',
        }
    },
    'properties': {
        'when': {'format': 'date-time', 'title': 'When', 'type': 'string'},
        'where': {'$ref': '#/$defs/Address'},
        'why': {'default': 'No idea', 'title': 'Why', 'type': 'string'},
    },
    'required': ['when', 'where'],
    'title': 'Meeting',
    'type': 'object',
}
"""

Pydantic 生成 JSON Schema 版本 2020-12,该标准的最新版本与 OpenAPI 3.1 兼容。

了解更多

查看关于 JSON Schema 的文档。

严格模式和数据强制

默认情况下,Pydantic 对常见的不正确类型是宽容的,并将数据强制转换为正确的类型——例如,传递给 int 字段的数字字符串将被解析为 int

Pydantic 也有 strict=True 模式——也被称为“严格模式”——在这种模式下类型不会被强制转换,除非输入数据完全与模式或类型提示匹配,否则会引发验证错误。

但严格模式在验证 JSON 数据时会非常无用,因为 JSON 没有与许多常见的 Python 类型(如 datetimeUUIDbytes )匹配的类型。

为了解决这个问题,Pydantic 可以在一步中解析和验证 JSON。这允许像 RFC3339(又名 ISO8601)字符串到 datetime 对象这样合理的数据转换。由于 JSON 解析是在 Rust 中实现的,因此它的性能也非常高。

示例 - 真正有用的严格模式
from datetime import datetime

from pydantic import BaseModel, ValidationError


class Meeting(BaseModel):
    when: datetime
    where: bytes


m = Meeting.model_validate({'when': '2020-01-01T12:00', 'where': 'home'})
print(m)
#> when=datetime.datetime(2020, 1, 1, 12, 0) where=b'home'
try:
    m = Meeting.model_validate(
        {'when': '2020-01-01T12:00', 'where': 'home'}, strict=True
    )
except ValidationError as e:
    print(e)
    """
    2 validation errors for Meeting
    when
      Input should be a valid datetime [type=datetime_type, input_value='2020-01-01T12:00', input_type=str]
    where
      Input should be a valid bytes [type=bytes_type, input_value='home', input_type=str]
    """

m_json = Meeting.model_validate_json(
    '{"when": "2020-01-01T12:00", "where": "home"}'
)
print(m_json)
#> when=datetime.datetime(2020, 1, 1, 12, 0) where=b'home'

了解更多

请参阅严格模式的文档。

数据类、类型字典等

Pydantic 提供了四种创建模式以及进行验证和序列化的方式:

  1. BaseModel ——具有许多可通过实例方法使用的常用实用程序的 Pydantic 自身的超类。

  2. pydantic.dataclasses.dataclass ——是围绕标准数据类的一个包装器,在初始化数据类时执行验证。

  3. [TypeAdapter][pydantic.type_adapter.TypeAdapter]——一种用于对任何类型进行验证和序列化的通用方法。这使得像TypedDictNamedTuple这样的类型以及像inttimedelta这样的简单标量值都可以进行验证——所有支持的类型都可以与TypeAdapter一起使用。

  4. validate_call ——用于在调用函数时执行验证的装饰器。

基于 TypedDict 的示例 - 模式
from datetime import datetime

from typing_extensions import NotRequired, TypedDict

from pydantic import TypeAdapter


class Meeting(TypedDict):
    when: datetime
    where: bytes
    why: NotRequired[str]


meeting_adapter = TypeAdapter(Meeting)
m = meeting_adapter.validate_python(  # (1)!
    {'when': '2020-01-01T12:00', 'where': 'home'}
)
print(m)
#> {'when': datetime.datetime(2020, 1, 1, 12, 0), 'where': b'home'}
meeting_adapter.dump_python(m, exclude={'where'})  # (2)!

print(meeting_adapter.json_schema())  # (3)!
"""
{
    'properties': {
        'when': {'format': 'date-time', 'title': 'When', 'type': 'string'},
        'where': {'format': 'binary', 'title': 'Where', 'type': 'string'},
        'why': {'title': 'Why', 'type': 'string'},
    },
    'required': ['when', 'where'],
    'title': 'Meeting',
    'type': 'object',
}
"""
  1. TypeAdapter for a TypedDict performing validation, it can also validate JSON data directly with validate_json
  2. dump_python to serialise a TypedDict to a python object, it can also serialise to JSON with dump_json
  3. TypeAdapter can also generate JSON Schema

定制

功能验证器和序列化器,以及用于自定义类型的强大协议,意味着 Pydantic 的运作方式可以在每个字段或每个类型的基础上进行定制。

自定义示例 - 包装验证器

“包装验证器”是 Pydantic V2 中的新功能,也是自定义 Pydantic 验证最强大的方法之一。

from datetime import datetime, timezone

from pydantic import BaseModel, field_validator


class Meeting(BaseModel):
    when: datetime

    @field_validator('when', mode='wrap')
    def when_now(cls, input_value, handler):
        if input_value == 'now':
            return datetime.now()
        when = handler(input_value)
        # in this specific application we know tz naive datetimes are in UTC
        if when.tzinfo is None:
            when = when.replace(tzinfo=timezone.utc)
        return when


print(Meeting(when='2020-01-01T12:00+01:00'))
#> when=datetime.datetime(2020, 1, 1, 12, 0, tzinfo=TzInfo(+01:00))
print(Meeting(when='now'))
#> when=datetime.datetime(2032, 1, 2, 3, 4, 5, 6)
print(Meeting(when='2020-01-01T12:00'))
#> when=datetime.datetime(2020, 1, 1, 12, 0, tzinfo=datetime.timezone.utc)

了解更多

请参阅有关验证器、自定义序列化程序和自定义类型的文档。

生态

在撰写本文时,GitHub 上有 214100 个存储库,PyPI 上有 8119 个依赖于 Pydantic 的包。

一些依赖于 Pydantic 的著名库:

更多使用 Pydantic 的图书馆可以在 Kludex/awesome-pydantic 找到 。

谁在使用 Pydantic

一些使用 Pydantic 的知名公司和组织以及关于我们如何知道它们在使用 Pydantic 的原因/方式的评论。

以下组织被包含在内是因为它们符合以下一个或多个标准:

  • 在公共存储库中使用 pydantic 作为依赖项

  • 将流量引导至 pydantic 文档站点,来自组织内部域的特定引荐者不包括在内,因为它们通常不在公共领域

  • Pydantic 团队与组织所雇用的工程师之间关于在组织内使用 Pydantic 的直接沟通

我们已经在适当的地方包含了一些额外的细节,并且这些已经在公共领域中。

Adobe

adobe/dy-sql uses Pydantic.

Amazon and AWS

Anthropic

anthropics/anthropic-sdk-python uses Pydantic.

Apple

(Based on the criteria described above)

ASML

(Based on the criteria described above)

AstraZeneca

Multiple repos in the AstraZeneca GitHub org depend on Pydantic.

Cisco Systems

Comcast

(Based on the criteria described above)

Datadog

  • Extensive use of Pydantic in DataDog/integrations-core and other repos
  • Communication with engineers from Datadog about how they use Pydantic.

Facebook

Multiple repos in the facebookresearch GitHub org depend on Pydantic.

GitHub

GitHub sponsored Pydantic $750 in 2022

Google

Extensive use of Pydantic in google/turbinia and other repos.

HSBC

(Based on the criteria described above)

IBM

Multiple repos in the IBM GitHub org depend on Pydantic.

Intel

(Based on the criteria described above)

Intuit

(Based on the criteria described above)

Intergovernmental Panel on Climate Change

Tweet explaining how the IPCC use Pydantic.

JPMorgan

(Based on the criteria described above)

Jupyter

  • The developers of the Jupyter notebook are using Pydantic for subprojects
  • Through the FastAPI-based Jupyter server Jupyverse
  • FPS's configuration management.

Microsoft

  • DeepSpeed deep learning optimisation library uses Pydantic extensively
  • Multiple repos in the microsoft GitHub org depend on Pydantic, in particular their
  • Pydantic is also used in the Azure GitHub org
  • Comments on GitHub show Microsoft engineers using Pydantic as part of Windows and Office

Molecular Science Software Institute

Multiple repos in the MolSSI GitHub org depend on Pydantic.

NASA

Multiple repos in the NASA GitHub org depend on Pydantic.

NASA are also using Pydantic via FastAPI in their JWST project to process images from the James Webb Space Telescope, see this tweet.

Netflix

Multiple repos in the Netflix GitHub org depend on Pydantic.

NSA

The nsacyber/WALKOFF repo depends on Pydantic.

NVIDIA

Mupltiple repos in the NVIDIA GitHub org depend on Pydantic.

Their "Omniverse Services" depends on Pydantic according to their documentation.

OpenAI

OpenAI use Pydantic for their ChatCompletions API, as per this discussion on GitHub.

Anecdotally, OpenAI use Pydantic extensively for their internal services.

Oracle

(Based on the criteria described above)

Palantir

(Based on the criteria described above)

Qualcomm

(Based on the criteria described above)

Red Hat

(Based on the criteria described above)

Revolut

Anecdotally, all internal services at Revolut are built with FastAPI and therefore Pydantic.

Robusta

The robusta-dev/robusta repo depends on Pydantic.

Salesforce

Salesforce sponsored Samuel Colvin $10,000 to work on Pydantic in 2022.

Starbucks

(Based on the criteria described above)

Texas Instruments

(Based on the criteria described above)

Twilio

(Based on the criteria described above)

Twitter

Twitter's the-algorithm repo where they open sourced their recommendation engine uses Pydantic.

UK Home Office

(Based on the criteria described above)


本文总阅读量