crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

Define extraction strategy schema typings

Open arnm opened this issue 1 year ago • 1 comments

Currently, the extraction strategies schemas are type Dict[str, Any] which requires devs to look at the source code of the extraction strategy to see which values are expected and then try to figure out what they do. The documentation on this is still lacking and does not even mention everything that JsonCssExtrationStrategy can do for example..

I've generated the following types to help with my use of JsonCssExtractionStrategy and I would like to see types be used to convey expected input.

from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Callable, Dict, Generic, List, Optional, Pattern, TypeVar, Union


class SelectorType(str, Enum):
    TEXT = "text"
    LIST = "list"
    NESTED = "nested"
    NESTED_LIST = "nested_list"
    ATTRIBUTE = "attribute"
    HTML = "html"
    REGEX = "regex"
    COMPUTED = "computed"


class Transform(str, Enum):
    LOWERCASE = "lowercase"
    UPPERCASE = "uppercase"
    STRIP = "strip"


@dataclass(kw_only=True)
class BaseField:
    name: str
    type: SelectorType = field(init=False)

    default: Optional[Any] = None
    selector: Optional[str] = None
    transform: Optional[Transform] = None


@dataclass(kw_only=True)
class TextField(BaseField):
    def __post_init__(self):
        self.type = SelectorType.TEXT


@dataclass(kw_only=True)
class HtmlField(BaseField):
    def __post_init__(self):
        self.type = SelectorType.HTML


@dataclass(kw_only=True)
class AttributeField(BaseField):
    attribute: str

    def __post_init__(self):
        self.type = SelectorType.ATTRIBUTE


@dataclass(kw_only=True)
class RegexField(BaseField):
    pattern: Union[str, Pattern]

    def __post_init__(self):
        self.type = SelectorType.REGEX


@dataclass(kw_only=True)
class ComputedField(BaseField):
    expression: Optional[str] = None
    function: Optional[Callable[[Dict[str, Any]], Any]] = None

    def __post_init__(self):
        self.type = SelectorType.COMPUTED
        if not (self.expression or self.function):
            raise ValueError("ComputedField must have either expression or function")


T = TypeVar("T", bound=BaseField)


@dataclass(kw_only=True)
class ListField(BaseField, Generic[T]):
    fields: List[T]

    def __post_init__(self):
        self.type = SelectorType.LIST


@dataclass(kw_only=True)
class NestedField(BaseField, Generic[T]):
    fields: List[T]

    def __post_init__(self):
        self.type = SelectorType.NESTED


@dataclass(kw_only=True)
class NestedListField(BaseField, Generic[T]):
    fields: List[T]

    def __post_init__(self):
        self.type = SelectorType.NESTED_LIST


@dataclass(kw_only=True)
class Schema:
    name: str
    baseSelector: str
    fields: List[
        Union[
            TextField,
            HtmlField,
            AttributeField,
            RegexField,
            ComputedField,
            ListField[Any],
            NestedField[Any],
            NestedListField[Any],
        ]
    ]

    def to_dict(self) -> Dict[str, Any]:
        """Convert Schema to dictionary format for JsonCssExtractionStrategy"""

        def field_to_dict(field: BaseField) -> Dict[str, Any]:
            result: Dict[str, Any] = {"name": field.name, "type": field.type.value}

            if field.selector:
                result["selector"] = field.selector

            if field.transform:
                result["transform"] = field.transform.value

            if field.default is not None:
                result["default"] = field.default

            if isinstance(field, AttributeField):
                result["attribute"] = field.attribute

            if isinstance(field, RegexField):
                result["pattern"] = field.pattern

            if isinstance(field, ComputedField):
                if field.expression:
                    result["expression"] = field.expression
                if field.function:
                    result["function"] = field.function

            if isinstance(field, (ListField, NestedField, NestedListField)):
                result["fields"] = [field_to_dict(f) for f in field.fields]

            return result

        return {
            "name": self.name,
            "baseSelector": self.baseSelector,
            "fields": [field_to_dict(f) for f in self.fields],
        }

Example usage:

schema = Schema(
    name="Example",
    baseSelector="#detail",
    fields=[
        NestedField(
            name="contact",
            selector="#detail-right",
            fields=[
                TextField(name="agency", selector="h3:first-of-type"),
                TextField(name="agent", selector="h3:last-of-type"),
                TextField(name="phone", selector="ul li:first-child"),
                TextField(name="city", selector="ul li:nth-child(2)"),
                TextField(name="address", selector=".address"),
            ],
        ),
        ListField(
            name="images",
            selector=".detail-ad-info-photos ol li a",
            fields=[AttributeField(name="url", attribute="href")],
        ),
    ],
)

extraction_strategy = JsonCssExtractionStrategy(schema.to_dict())

arnm avatar Nov 04 '24 21:11 arnm

@arnm To be honest, this is very beautiful. I agree with you - tbh even myself I have to check the code to remember. Would you like to create a pull request for the current version and we can continue the discussion there? I really like what you've done here. Also, feel free to join our Discord channels if you'd like - just send me your email address and we can continue the conversation, test this and potentially add it to the next release.

I actually have a plan to create two things for this extraction strategy. One is to automate schema generation using a language model that analyzes web pages based on what you're looking for (e.g. I want all agencies contact details). This would make the process automated, and pain-less! Another plan in the roadmap is a Chrome extension that lets users choose what they want and get the schema on the fly. Perhaps thats is one of the reason why I didn't invest more time on current version.

If you're interested, you can join and help handle this. Let me know.

unclecode avatar Nov 06 '24 06:11 unclecode