Define extraction strategy schema typings
Currently, the extraction strategies schemas are type Dict[str, Any] which requires devs to look at the source code of the extraction strategy to see which values are expected and then try to figure out what they do. The documentation on this is still lacking and does not even mention everything that JsonCssExtrationStrategy can do for example..
I've generated the following types to help with my use of JsonCssExtractionStrategy and I would like to see types be used to convey expected input.
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Callable, Dict, Generic, List, Optional, Pattern, TypeVar, Union
class SelectorType(str, Enum):
TEXT = "text"
LIST = "list"
NESTED = "nested"
NESTED_LIST = "nested_list"
ATTRIBUTE = "attribute"
HTML = "html"
REGEX = "regex"
COMPUTED = "computed"
class Transform(str, Enum):
LOWERCASE = "lowercase"
UPPERCASE = "uppercase"
STRIP = "strip"
@dataclass(kw_only=True)
class BaseField:
name: str
type: SelectorType = field(init=False)
default: Optional[Any] = None
selector: Optional[str] = None
transform: Optional[Transform] = None
@dataclass(kw_only=True)
class TextField(BaseField):
def __post_init__(self):
self.type = SelectorType.TEXT
@dataclass(kw_only=True)
class HtmlField(BaseField):
def __post_init__(self):
self.type = SelectorType.HTML
@dataclass(kw_only=True)
class AttributeField(BaseField):
attribute: str
def __post_init__(self):
self.type = SelectorType.ATTRIBUTE
@dataclass(kw_only=True)
class RegexField(BaseField):
pattern: Union[str, Pattern]
def __post_init__(self):
self.type = SelectorType.REGEX
@dataclass(kw_only=True)
class ComputedField(BaseField):
expression: Optional[str] = None
function: Optional[Callable[[Dict[str, Any]], Any]] = None
def __post_init__(self):
self.type = SelectorType.COMPUTED
if not (self.expression or self.function):
raise ValueError("ComputedField must have either expression or function")
T = TypeVar("T", bound=BaseField)
@dataclass(kw_only=True)
class ListField(BaseField, Generic[T]):
fields: List[T]
def __post_init__(self):
self.type = SelectorType.LIST
@dataclass(kw_only=True)
class NestedField(BaseField, Generic[T]):
fields: List[T]
def __post_init__(self):
self.type = SelectorType.NESTED
@dataclass(kw_only=True)
class NestedListField(BaseField, Generic[T]):
fields: List[T]
def __post_init__(self):
self.type = SelectorType.NESTED_LIST
@dataclass(kw_only=True)
class Schema:
name: str
baseSelector: str
fields: List[
Union[
TextField,
HtmlField,
AttributeField,
RegexField,
ComputedField,
ListField[Any],
NestedField[Any],
NestedListField[Any],
]
]
def to_dict(self) -> Dict[str, Any]:
"""Convert Schema to dictionary format for JsonCssExtractionStrategy"""
def field_to_dict(field: BaseField) -> Dict[str, Any]:
result: Dict[str, Any] = {"name": field.name, "type": field.type.value}
if field.selector:
result["selector"] = field.selector
if field.transform:
result["transform"] = field.transform.value
if field.default is not None:
result["default"] = field.default
if isinstance(field, AttributeField):
result["attribute"] = field.attribute
if isinstance(field, RegexField):
result["pattern"] = field.pattern
if isinstance(field, ComputedField):
if field.expression:
result["expression"] = field.expression
if field.function:
result["function"] = field.function
if isinstance(field, (ListField, NestedField, NestedListField)):
result["fields"] = [field_to_dict(f) for f in field.fields]
return result
return {
"name": self.name,
"baseSelector": self.baseSelector,
"fields": [field_to_dict(f) for f in self.fields],
}
Example usage:
schema = Schema(
name="Example",
baseSelector="#detail",
fields=[
NestedField(
name="contact",
selector="#detail-right",
fields=[
TextField(name="agency", selector="h3:first-of-type"),
TextField(name="agent", selector="h3:last-of-type"),
TextField(name="phone", selector="ul li:first-child"),
TextField(name="city", selector="ul li:nth-child(2)"),
TextField(name="address", selector=".address"),
],
),
ListField(
name="images",
selector=".detail-ad-info-photos ol li a",
fields=[AttributeField(name="url", attribute="href")],
),
],
)
extraction_strategy = JsonCssExtractionStrategy(schema.to_dict())
@arnm To be honest, this is very beautiful. I agree with you - tbh even myself I have to check the code to remember. Would you like to create a pull request for the current version and we can continue the discussion there? I really like what you've done here. Also, feel free to join our Discord channels if you'd like - just send me your email address and we can continue the conversation, test this and potentially add it to the next release.
I actually have a plan to create two things for this extraction strategy. One is to automate schema generation using a language model that analyzes web pages based on what you're looking for (e.g. I want all agencies contact details). This would make the process automated, and pain-less! Another plan in the roadmap is a Chrome extension that lets users choose what they want and get the schema on the fly. Perhaps thats is one of the reason why I didn't invest more time on current version.
If you're interested, you can join and help handle this. Let me know.