dataclass-csv Override constructor with transformation function

This generalises #17.

Currently any class not special-cased by the library must have a constructor that takes a single string.

Even for built-in types this leads to workarounds like the following.

class Time(datetime.time):
    def __new__(cls, value: str):
        hour, minute = value.split(":")
        return super().__new__(cls, int(hour), int(minute))

I think supporting transformation functions would add much-needed flexibility.

One possible syntax might be this.

def strptime(value: str):
    hour, minute = value.split(":")
    return datetime.time(int(hour), int(minute))

reader.map('time').using(strptime)  # With or without .to()

Jul 23 '20 14:07 tewe

I actually have been working adding a similar functionality but only for datetime values where I could see the most use cases. The way I'm implementing is through a new decorator called (temporarily) dtfunc which you can specify a function that will be used for parsing every datetime value.

The reason I have implemented only for datetime and not for every type is that it is possible to achieve similar functionality overriding the __post_init__ in the dataclass and modify its value.

For instance, if I had a CSV file with a column firstname that I would like to map to the field name in a dataclass User, and convert to every value to uppercase, I could do:

CSV:

firstname
daniel

Code:

from dataclass_csv import DataclassReader
from dataclasses import dataclass

@dataclass
class User:
    name :str

    def __post_init__(self):
        self.name = self.name.upper()


def main():
    with open("users.csv") as f:
        reader = DataclassReader(f, User)
        reader.map("firstname").to("name")
        data = list(reader)
        print(data)


if __name__ == "__main__":
    main()

Output:

[User(name='DANIEL')]

That way it doesn't change so much the way of working with dataclasses. The method __post_init__ is called for every row that is processed so it doesn't slow down the process of creating the instances of the dataclass.

Jul 27 '20 18:07 dfurtado

Your upper example does not change the type of the attribute. My use-case is classes that cannot be instantiated with a string.

It is possible to use __post_init__ for this. But every such method needs to branch to support both strings and the actual type of the attribute. Otherwise the dataclass becomes unusable by code not related to CSV parsing. I feel like that defeats the purpose of the library.

Jul 28 '20 00:07 tewe

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

Feb 28 '22 02:02 karthicraghupathi

Hello @karthicraghupathi , thanks a lot for the kind words about my project. Really appreciate it.

Yes, I like this solution. I am actually working on something along those lines, trying out different solutions. I want to do something that will not feel unfamiliar when it comes to dataclasses usage and also the usage of the dataclass-csv package.

I'll ping you in this issue when I have something done.

Mar 02 '22 18:03 dfurtado

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

I got this error when using .using.

Traceback (most recent call last):
  File "D:\code\python\cnki_crawler_playwright\main.py", line 202, in <module>
    main()
  File "D:\code\python\cnki_crawler_playwright\main.py", line 188, in main
    w.map("paper_name").using(format_link)
AttributeError: 'HeaderMapper' object has no attribute 'using'

And from the code, it seems using is not exists.

https://github.com/dfurtado/dataclass-csv/blob/2dc71be81cb253eb10aba5ba70c6cebe42ab0301/dataclass_csv/header_mapper.py#L4-L19

https://github.com/dfurtado/dataclass-csv/blob/2dc71be81cb253eb10aba5ba70c6cebe42ab0301/dataclass_csv/field_mapper.py#L4-L18

Nov 08 '22 17:11 liudonghua123

@liudonghua123 You are right. It does not exist yet. This thread is to discuss @tewe's proposal and other ways of achieving that. We'll need to wait on @dfurtado to see which direction they take.

Nov 08 '22 18:11 karthicraghupathi

Hi @karthicraghupathi and @liudonghua123 thanks for the contribution to this thread and using the lib.

Yes, I see the use case for this for sure. I will try to put something together and create a PR.

I think the first suggestion seems great, however it might make the API a bit complicate. The .map function is used when we have a column in the CSV file is named differently from the dataclass field name. Eg.:

Let's say we have a column First Name in the CSV and the dataclass is defined as firstname

reader.map("First Name").to("firstname")

In a case that I don't have any differences it would make the API inconsistent since the argument to map is the name of the column in the CSV, eg.:

reader.map("firstname").using(fn)

So it would be difficult to use .map in these cases. Perhaps we would need a second function like reader.transform("field").using(fn) or a decorator, eg.:


from dataclass_csv import transform


def fn(value):
    ....

@dataclass
@transform("firstname", fn)
class User:
    firstname: str
    lastname: srt

Please, share your thoughts about these solutions.

Nov 09 '22 07:11 dfurtado

To me "map using" isn't any less intuitive than "map to using", so I'd avoid introducing another name like transform.

The decorator way breaks down when you have two kinds of CSV you want to map to the same class.

Nov 09 '22 11:11 tewe

@dfurtado thanks for continuing to work on this. I agree with @tewe. It just feels intuitive and pythonic when I see map.using() or map.to().using().

Nov 09 '22 16:11 karthicraghupathi

Hello 👋🏼 ,

As I have explained above having something like reader.map("name").using(fn) with name here being the name of the dataclass property would be a breaking change since the argument of .map is the name of the column in the CSV file. I really don't want to change that because I know there are a lot of code out there that would break.

It could work do something like reader.map("name").to("name").using(fn), however, it would look strange specially when the dataclass property name matches the name of the column in the CSV file. In this particular case it would be requiring the the user's to add code that is not necessary and repetitive.

It would be fine when reader.map("First name").to("name").using(fn) but when the names match seems wrong to write that explicitly when the lib does all the mapping automatically.

I have to put more though on this one to find a good solution that will look nice without breaking the current functionality. 🤔

Nov 09 '22 19:11 dfurtado

Sorry, I didn't catch that distinction the first time. But what's wrong with reader.map("csv_column_that_matches_a_dataclass_attribute").using(f)?

Nov 10 '22 16:11 tewe

I have another question, is there any ways to split some complex object properties to different columns?

Say If I have the following classes for serialization to csv.

@dataclass
class Link:
    title: str
    url: str
    
@dataclass
class  SearchResult:	
    paper_name: Link
    authors: list[Link]
    publication: Link

I would expected to have split paper_name into paper_name.title and paper_name.url columns.

Nov 15 '22 07:11 liudonghua123

@liudonghua123 I think that is a separate issue.

I haven't tried if mapping a field twice already works.

writer.map("paper_name").to("title")
writer.map("paper_name").to("url")

But you'd additionally need something like the proposed API.

writer.map("paper_name").to("url").using(lambda n: f"https://doi/{n}")

Nov 16 '22 02:11 tewe

@tewe Thanks, I will open a new issue to track. 😄

Nov 16 '22 02:11 liudonghua123

Hey, I took a look and this and it's possible to do this currently by overriding the type_hints attribute on the Reader class to do this with an ordinary function.

test.csv:

name,values
A,1;2;3
B,8;9
C,3

then run:

from dataclass_csv import DataclassReader
import dataclasses

@dataclasses.dataclass
class Variable:
    name: str
    values: list[int]


fh = open("test_split.csv")
reader = DataclassReader(fh, Variable)

# define our conversion function
read_vals =  lambda s: [int(x) for x in s.split(";")]

# monkey patch reader
reader.type_hints["values"] = read_vals

for var in reader:
    print(var)

Of course you can package them up in a nice method if you want :)

Mar 23 '23 20:03 mgperry