Override constructor with transformation function
This generalises #17.
Currently any class not special-cased by the library must have a constructor that takes a single string.
Even for built-in types this leads to workarounds like the following.
class Time(datetime.time):
def __new__(cls, value: str):
hour, minute = value.split(":")
return super().__new__(cls, int(hour), int(minute))
I think supporting transformation functions would add much-needed flexibility.
One possible syntax might be this.
def strptime(value: str):
hour, minute = value.split(":")
return datetime.time(int(hour), int(minute))
reader.map('time').using(strptime) # With or without .to()
I actually have been working adding a similar functionality but only for datetime values where I could see the most use cases. The way I'm implementing is through a new decorator called (temporarily) dtfunc which you can specify a function that will be used for parsing every datetime value.
The reason I have implemented only for datetime and not for every type is that it is possible to achieve similar functionality overriding the __post_init__ in the dataclass and modify its value.
For instance, if I had a CSV file with a column firstname that I would like to map to the field name in a dataclass User, and convert to every value to uppercase, I could do:
CSV:
firstname
daniel
Code:
from dataclass_csv import DataclassReader
from dataclasses import dataclass
@dataclass
class User:
name :str
def __post_init__(self):
self.name = self.name.upper()
def main():
with open("users.csv") as f:
reader = DataclassReader(f, User)
reader.map("firstname").to("name")
data = list(reader)
print(data)
if __name__ == "__main__":
main()
Output:
[User(name='DANIEL')]
That way it doesn't change so much the way of working with dataclasses. The method __post_init__ is called for every row that is processed so it doesn't slow down the process of creating the instances of the dataclass.
Your upper example does not change the type of the attribute. My use-case is classes that cannot be instantiated with a string.
It is possible to use __post_init__ for this. But every such method needs to branch to support both strings and the actual type of the attribute. Otherwise the dataclass becomes unusable by code not related to CSV parsing. I feel like that defeats the purpose of the library.
@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.
However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:
import sys
from typing import List
from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter
@dataclass
class Score:
subject: str
grade: str
def __str__(self):
return "{} - {}".format(self.subject, self.grade)
@dataclass
class Student:
name: str
scores: List[Score] = field(default_factory=list)
def __str__(self):
return self.name
s = Student(
name="Student 1",
scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)
with sys.stdout as csv_file:
writer = DataclassWriter(csv_file, [s], Student)
writer.write()
The output for this is:
name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"
Following @tewe's example, to achieve an output like this:
name,scores
Student 1,Science-A|Math-A
We could do something like this:
def format_scores(value) -> str:
return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])
with sys.stdout as csv_file:
writer = DataclassWriter(csv_file, [s], Student)
writer.map("scores").using(format_scores)
writer.write()
Or is there a better way to achieve this?
Hello @karthicraghupathi , thanks a lot for the kind words about my project. Really appreciate it.
Yes, I like this solution. I am actually working on something along those lines, trying out different solutions. I want to do something that will not feel unfamiliar when it comes to dataclasses usage and also the usage of the dataclass-csv package.
I'll ping you in this issue when I have something done.
@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.
However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:
import sys from typing import List from dataclasses import dataclass, field from dataclass_csv import DataclassWriter @dataclass class Score: subject: str grade: str def __str__(self): return "{} - {}".format(self.subject, self.grade) @dataclass class Student: name: str scores: List[Score] = field(default_factory=list) def __str__(self): return self.name s = Student( name="Student 1", scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")], ) with sys.stdout as csv_file: writer = DataclassWriter(csv_file, [s], Student) writer.write()The output for this is:
name,scores Student 1,"[('Science', 'A'), ('Math', 'A')]"Following @tewe's example, to achieve an output like this:
name,scores Student 1,Science-A|Math-AWe could do something like this:
def format_scores(value) -> str: return "|".join(["{}-{}".format(item.subject, item.grade) for item in value]) with sys.stdout as csv_file: writer = DataclassWriter(csv_file, [s], Student) writer.map("scores").using(format_scores) writer.write()Or is there a better way to achieve this?
I got this error when using .using.
Traceback (most recent call last):
File "D:\code\python\cnki_crawler_playwright\main.py", line 202, in <module>
main()
File "D:\code\python\cnki_crawler_playwright\main.py", line 188, in main
w.map("paper_name").using(format_link)
AttributeError: 'HeaderMapper' object has no attribute 'using'
And from the code, it seems using is not exists.
https://github.com/dfurtado/dataclass-csv/blob/2dc71be81cb253eb10aba5ba70c6cebe42ab0301/dataclass_csv/header_mapper.py#L4-L19
https://github.com/dfurtado/dataclass-csv/blob/2dc71be81cb253eb10aba5ba70c6cebe42ab0301/dataclass_csv/field_mapper.py#L4-L18
@liudonghua123 You are right. It does not exist yet. This thread is to discuss @tewe's proposal and other ways of achieving that. We'll need to wait on @dfurtado to see which direction they take.
Hi @karthicraghupathi and @liudonghua123 thanks for the contribution to this thread and using the lib.
Yes, I see the use case for this for sure. I will try to put something together and create a PR.
I think the first suggestion seems great, however it might make the API a bit complicate. The .map function is used when we have a column in the CSV file is named differently from the dataclass field name. Eg.:
Let's say we have a column First Name in the CSV and the dataclass is defined as firstname
reader.map("First Name").to("firstname")
In a case that I don't have any differences it would make the API inconsistent since the argument to map is the name of the column in the CSV, eg.:
reader.map("firstname").using(fn)
So it would be difficult to use .map in these cases. Perhaps we would need a second function like reader.transform("field").using(fn) or a decorator, eg.:
from dataclass_csv import transform
def fn(value):
....
@dataclass
@transform("firstname", fn)
class User:
firstname: str
lastname: srt
Please, share your thoughts about these solutions.
To me "map using" isn't any less intuitive than "map to using", so I'd avoid introducing another name like transform.
The decorator way breaks down when you have two kinds of CSV you want to map to the same class.
@dfurtado thanks for continuing to work on this. I agree with @tewe. It just feels intuitive and pythonic when I see map.using() or map.to().using().
Hello 👋🏼 ,
As I have explained above having something like reader.map("name").using(fn) with name here being the name of the dataclass property would be a breaking change since the argument of .map is the name of the column in the CSV file. I really don't want to change that because I know there are a lot of code out there that would break.
It could work do something like reader.map("name").to("name").using(fn), however, it would look strange specially when the dataclass property name matches the name of the column in the CSV file. In this particular case it would be requiring the the user's to add code that is not necessary and repetitive.
It would be fine when reader.map("First name").to("name").using(fn) but when the names match seems wrong to write that explicitly when the lib does all the mapping automatically.
I have to put more though on this one to find a good solution that will look nice without breaking the current functionality. 🤔
Sorry, I didn't catch that distinction the first time. But what's wrong with reader.map("csv_column_that_matches_a_dataclass_attribute").using(f)?
I have another question, is there any ways to split some complex object properties to different columns?
Say If I have the following classes for serialization to csv.
@dataclass
class Link:
title: str
url: str
@dataclass
class SearchResult:
paper_name: Link
authors: list[Link]
publication: Link
I would expected to have split paper_name into paper_name.title and paper_name.url columns.
@liudonghua123 I think that is a separate issue.
I haven't tried if mapping a field twice already works.
writer.map("paper_name").to("title")
writer.map("paper_name").to("url")
But you'd additionally need something like the proposed API.
writer.map("paper_name").to("url").using(lambda n: f"https://doi/{n}")
@tewe Thanks, I will open a new issue to track. 😄
Hey, I took a look and this and it's possible to do this currently by overriding the type_hints attribute on the Reader class to do this with an ordinary function.
test.csv:
name,values
A,1;2;3
B,8;9
C,3
then run:
from dataclass_csv import DataclassReader
import dataclasses
@dataclasses.dataclass
class Variable:
name: str
values: list[int]
fh = open("test_split.csv")
reader = DataclassReader(fh, Variable)
# define our conversion function
read_vals = lambda s: [int(x) for x in s.split(";")]
# monkey patch reader
reader.type_hints["values"] = read_vals
for var in reader:
print(var)
Of course you can package them up in a nice method if you want :)