portfolio icon indicating copy to clipboard operation
portfolio copied to clipboard

Indpendently developed external pluggable PDF/CSV Extractors

Open sebnapi opened this issue 2 years ago • 8 comments

Is your feature request related to a problem? Please describe. While researching how to build my own extractor I noticed the architecture is a bit rigid. I think with people developing their own extractors as plugins it would make your development and maintenance much easier. And external people could actually focus on their own solution for it, keep it up to date and maintain it. This way as a plugin maintainer I don't have to rebuild the whole portfolio-performance but just link it to the current version and see if it's working. Ideally portfolio-performance would not care about the language in which it happens.

One thing to remember is that there is a magnitude of broker outputs, but the extractors must pass through your hands, so it's something that is centralized, while on the other hand a decentralized way of dealing with it could actually enable a Thai or Nigerian with a totally unknown broker in Germany to build his extractor with a little bit of knowledge in python and use portfolio performance. He could share it on github and others with the same broker and a bit of knowledge could pick up the work.

Describe the solution you'd like I'd like to plug in my own extractor into portfolio-performance with a clear interface. It should give me the extracted text and raw data or file pointer and define a clear interface to perform all actions you came across until now.

sebnapi avatar Apr 23 '22 13:04 sebnapi

There was a try a while ago to implement JSON plug in but @Nirus2000 and @buchen decides not to implement external plug in (e.g., via JSON) as some PDF's to complicated to parse out of the Java universe. So @Nirus2000 has removed this corresponding code from PP.

Morpheus1w3 avatar Apr 23 '22 17:04 Morpheus1w3

This is not entirely correct. There is still the possibility to create a JSON importer. We are currently working on simplifying and standardizing the PDF importers to make them very easy to customize. The goal is to provide ready-made sections that only the appropriate lines need to be parsed and the actual work takes place in PDFExtractorUtils.java. For date, time, fees and taxes we have already succeeded well. Currently a pull request for foreign currency is running. In your Issue Request #2823, I demonstrated how easy the extension is with two examples.

Alex

Nirus2000 avatar Apr 24 '22 06:04 Nirus2000

@Nirus2000 It's seems to be that you misunderstood the request. The community seeking for the opportunity to create PDF importers without to utilise Java/Eclipse for instance.

The previous try for a JSON driven PDF rule importer had the possibility to read JSON PDF parsing rule from local user directory. This was a try to move a part of the Java environment to a plug in able configuration files, which are storable at the user client. But the chance is over.

Morpheus1w3 avatar Apr 24 '22 10:04 Morpheus1w3

No I guess he didn't. I'm not looking for a config file where I can define the regex. I guess that gets out of hand quickly as you mentioned.

I would like to link a program to portfolio perfomance (pp), saying to pp call this (my_little_cpp_extr -f 'abc.pdf' --extracted 'PDFBox...') it will return the models you understand (might be JSON).

I'm not familiar with your project, but under the hood you must have some database Models for each transaction that is possible to track. Just accept those in a easy format. Just pass the file, or file pointer and the extracted text that you already create and let the plugin developer do whatever he wants with it.

sebnapi avatar Apr 24 '22 14:04 sebnapi

@sebnapi as a work around, create a python pdf parser for your broker and store the extracted results into the PP csv format which can be easily imported.

kavatari avatar May 12 '22 19:05 kavatari

This would be a killer feature. At the moment I manually modify my bank downloaded CSV according to Portfolio Performance items. This sometimes takes a long time. An API interface to Portfolio Performance could enable users to build parsers for every bank account and share them with the community.

cla-azzarello avatar May 22 '22 12:05 cla-azzarello

At the moment I manually modify my bank downloaded CSV according to Portfolio Performance items.

@cla-azzarello Can you share what modifications to the CSV you are doing manually?

Already today, one can save a specific import configuration, export that configuration as JSON and import into another PP installation. But what is not possible is manipulation of entries.

buchen avatar Jun 19 '22 13:06 buchen

Reading thru project issues, it's a common situation that people want to do things beyond PP's capabilities. And somehow they always proceed with demanding that PortfolioPerformance should do that for them or at least enable it for them in some special ways of plugin frameworks, REST APIs (#2085), custom scripts (#2374), or whatever. In all cases, PP is treated as a center of the universe, capable to grant or deny wishes. That's not scalable or sustainable approach, due to cost of developing and maintenance of all those features.

Instead, the only thing PP should do is stop holding the user data captive in its proprietary XML, ProtocolBuffers, etc. formats. The user data in a structured accessible format should be the center of the universe, with a collection of tools around it. With PP being the leading such tool, because for what it does, it's astonishingly, unbelievably great. But for what it doesn't do, it's of little help due to the mentioned high complexity of developing software in Java.

Instead, users should be able to work directly with the data, bypassing PP completely. That idea is not new, with https://github.com/portfolio-performance/portfolio/issues/2216 being older than this ticket. As typical, no progress was done on that part, due to the same complexity of dealing with Java development. But some scalable path to achieve that was proposed: https://github.com/portfolio-performance/portfolio/issues/2216#issuecomment-1597780143 , and I just finally released some independent code allowing to import/export PP XML data to/from database, with good roundtripping capabilities. And of course, between import and export, you can perform any operations on the database, with any tooling that you like.

pfalcon avatar Sep 02 '23 13:09 pfalcon