Support exporting a lot of content
Same as with the import (solved in https://github.com/collective/collective.exportimport/pull/4), the export of a lot of data would eat up all the memory on the machine. The python dict that is created holds the data during export before being written to file as json can be quite large if you choose to to include base64-encoded binary data.
I have the use-case to export 60GB of content in files.
Options:
- Export the blob-path and load each blob from the filesystem. That could be quite efficient.
- Use https://pypi.org/project/jsonlines as format and write one object at a time to the filesystem. This would rquire changes in the import since jsoinlines is not readable by json or ijson.
- Fake using jsonlines by writing one object at a time into a file but add a comma at the end of each line and wrap it in []. This would create valid json file the the import could read.
I am hoping for the first one, so that you could simply copy the blobstorage. Otherwise the third option seems best.
collective.jsonify solved this by dumping each item in its own json file, and putting 1000 files together in one directory.
Maybe export 1000 items per json file. Or stop at 100 MB or at 1 GB. Create Documents_000_999.json, Documents_1000_1999.json`, etc. That cannot really be downloaded, except when at the end you zip it, which might again trip a memory limit.
For saving on the server this would be fine though.
I think at some point, next to the more granular, split-up approach we have now, we will want one button that exports all (all types and all extras like relations and members) and save these files to the server, and one button to import all those files (import/*.json) in the proper order (first members, then folders, then other content, then other extras).
Export the blob-path and load each blob from the filesystem. That could be quite efficient.
This would be a "post-export" that would insert in the json, already on disk, with the blob reference with the base64-encoded binary data itself?
Use https://pypi.org/project/jsonlines as format and write one object at a time to the filesystem.
You'll need to pin this in setup.py to 1.2.0 since it's the last version that supports Python 2.
collective.jsonify solved this by dumping each item in its own json file, and putting 1000 files together in one directory.
How hard would be from the import side to adapt if this would be accomplished? This has the advantage of using the same "API" that collective.jsonnify is using (The exported JSON is a collective.transmogrifier friendly format. ).
An "ugly" workaround about the memory problem is using a custom exporter, and a "counter" (or even a date object containing when the export began) somewhere in Plone (like a portal_property) that is read when running the custom exporter . You make a query that orders by modified date, from oldest do newest, and the exporter that you call multiple times with a custom range (like 10k of objects) will export only those after the counter/date object.
... or to avoid using the counter to see what can be exported or not, you can use an idea like obj.modification_date_migrated, but obj.exported and a subscriber that sets it to False when it's modified (if an object was modified, it needs to be exported again).
Both approaches have the disadvantage of needing to call export view multiple times (the import too) but seems doable.
I ran into this issue today where writing a resulting 7.2Gb File.json to the server directory (var) took almost 16Gb memory. 8Gb was not enough.
With some help from @mauritsvanrees I was able to turn the export_content method into a generator and json.dumps/write each content item to a file stream. Memory usage stayed <10% on the 16Gb provisioned server. I'll create a branch/pull request in a few days.
Doing the same for downloading through the browser might be a bit trickier, I don't know if/how normal browser view supports any Streaming. Maybe we need to take inspiration for from how plone.namedfile supports streaming blobs. The view currently does:
return response.write(safe_bytes(data))
I have run into one other issue with supporting a lot of content: On import 'something' is generating temp files in /tmp for every single File that is imported. And they are not cleaned up, at least not until the end of the transaction. So with 10Gb of File data, you need 10+ Gb space in /tmp, or you'll see a pretty 'No space left on device'. And most servers don't have an unlimited /tmp directory and a separate mount for it out of security/DoS concerns
I have now rerouted TMPDIR to a mounted volume with lots more space on the virtual machine I'm running the import on, but this 'should not happen'(tm).
- It is not the (re)-indexer that is doing this, I have disabled indexing with collective.noindexing. I suspect the deserializer of plone.restapi.
- There is support voor 'Resumable uploads' in plone.restapi which has its own TUS_TMP_FILE_DIR environment variable. I also configured this one, but to a different directory and that directory doesn't get filled either...
Here are my custom serializers to export the blob-path instead of base64-encoded bytes:
Import is next.
# -*- coding: UTF-8 -*-
from customexport.interfaces import ICustomexportLayer
from plone.app.blob.interfaces import IBlobField
from plone.app.blob.interfaces import IBlobImageField
from plone.dexterity.interfaces import IDexterityContent
from plone.namedfile.interfaces import INamedFileField
from plone.namedfile.interfaces import INamedImageField
from plone.restapi.interfaces import IFieldSerializer
from plone.restapi.serializer.atfields import DefaultFieldSerializer as ATDefaultFieldSerializer
from plone.restapi.serializer.converters import json_compatible
from plone.restapi.serializer.dxfields import DefaultFieldSerializer
from Products.Archetypes.interfaces import IBaseObject
from zope.component import adapter
from zope.interface import implementer
def get_at_blob_path(obj):
oid = obj.getBlob()._p_oid
tid = obj._p_serial
db = obj._p_jar.db()
fshelper = db._storage.fshelper
blobfilepath = fshelper.layout.getBlobFilePath(oid, tid)
return blobfilepath
def get_dx_blob_path(obj):
oid = obj._blob._p_oid
tid = obj._p_serial
db = obj._p_jar.db()
fshelper = db._storage.fshelper
blobfilepath = fshelper.layout.getBlobFilePath(oid, tid)
return blobfilepath
@adapter(IBlobImageField, IBaseObject, ICustomexportLayer)
@implementer(IFieldSerializer)
class ATImageFieldSerializerWithBlobs(ATDefaultFieldSerializer):
def __call__(self):
file_obj = self.field.get(self.context)
if not file_obj:
return None
blobfilepath = get_at_blob_path(file_obj)
result = {
"filename": self.field.getFilename(self.context),
"content-type": self.field.getContentType(self.context),
"blob_path": blobfilepath,
}
return json_compatible(result)
@adapter(IBlobField, IBaseObject, ICustomexportLayer)
@implementer(IFieldSerializer)
class ATFileFieldSerializerWithBlobs(ATDefaultFieldSerializer):
def __call__(self):
file_obj = self.field.get(self.context)
if not file_obj:
return None
blobfilepath = get_at_blob_path(file_obj)
result = {
"filename": self.field.getFilename(self.context),
"content-type": self.field.getContentType(self.context),
"blob_path": blobfilepath,
}
return json_compatible(result)
@adapter(INamedFileField, IDexterityContent, ICustomexportLayer)
@implementer(IFieldSerializer)
class DXFileFieldSerializer(DefaultFieldSerializer):
def __call__(self):
namedfile = self.field.get(self.context)
if namedfile is None:
return None
blobfilepath = get_dx_blob_path(namedfile)
result = {
"filename": namedfile.filename,
"content-type": namedfile.contentType,
"size": namedfile.getSize(),
"blob_path": blobfilepath,
}
return json_compatible(result)
@adapter(INamedImageField, IDexterityContent, ICustomexportLayer)
@implementer(IFieldSerializer)
class DXImageFieldSerializer(DefaultFieldSerializer):
def __call__(self):
image = self.field.get(self.context)
if image is None:
return None
blobfilepath = get_dx_blob_path(namedfile)
width, height = image.getImageSize()
result = {
"filename": image.filename,
"content-type": image.contentType,
"size": image.getSize(),
"width": width,
"height": height,
"blob_path": blobfilepath,
}
return json_compatible(result)
Register the serializers in configure.zcml:
<!-- Custom AT Serializers -->
<adapter zcml:condition="installed plone.app.blob"
factory=".serializer.ATFileFieldSerializerWithBlobs" />
<adapter zcml:condition="installed Products.Archetypes"
factory=".serializer.ATImageFieldSerializerWithBlobs" />
<!-- Custom DX Serializers -->
<adapter zcml:condition="installed plone.dexterity"
factory=".serializer.DXFileFieldSerializer" />
<adapter zcml:condition="installed plone.dexterity"
factory=".serializer.DXImageFieldSerializer" />
Still have to create a PR with my streaming export fixes.... :-$
I update the export (see above) since getting the right oid was not as straightforward as I had thought.
And finally here is the rather simple import and it works like a charm. I check the blob-path in the dict_hook to prevent creating files without blobs.
from collective.exportimport.import_content import ImportContent
from logging import getLogger
from pathlib import Path
from plone.namedfile.file import NamedBlobFile
logger = getLogger(__name__)
BLOB_HOME = "/Users/pbauer/workspace/projectx/plone4/var/blobstorage"
class CustomImportContent(ImportContent):
def dict_hook_file(self, item):
blob_path = item["file"]["blob_path"]
abs_blob_path = Path(BLOB_HOME) / blob_path
if not abs_blob_path.exists():
logger.info(f"Blob path for {item['@id']} does not exist!")
return
return item
def dict_hook_image(self, item):
blob_path = item["image"]["blob_path"]
abs_blob_path = Path(BLOB_HOME) / blob_path
if not abs_blob_path.exists():
logger.info(f"Blob path for {item['@id']} does not exist!")
return
return item
def obj_hook_file(self, obj, item):
blob_path = item["file"]["blob_path"]
abs_blob_path = Path(BLOB_HOME) / blob_path
filename = item["file"]["filename"]
content_type = item["file"]["content-type"]
obj.file = NamedBlobFile(data=abs_blob_path.read_bytes(), contentType=content_type, filename=filename)
def obj_hook_image(self, obj, item):
blob_path = item["image"]["blob_path"]
abs_blob_path = Path(BLOB_HOME) / blob_path
filename = item["image"]["filename"]
content_type = item["image"]["content-type"]
obj.image = NamedBlobImage(data=abs_blob_path.read_bytes(), contentType=content_type, filename=filename)
@pbauer See https://github.com/collective/collective.exportimport/tree/yield_export with the small modifications need to make the export content use a generator. This avoids collecting all json in a python string, which ballooned the python process to the size of the export json file.
I merged https://github.com/collective/collective.exportimport/pull/41 so now we have working solutions to export and import very large json-files. I still strongly suggest using my approach with blob_path when dealing with a lot of binary data.
Yesterday I exported a complete site 200000 content items (10GB Data.fs and 90GB blobs) to a 600MB json file.
For these blob sizes inline blobs are out of the question anyway. What we discussed previously: I noticed exponential time when importing larger Files. My import with a 7Gb Files.json froze for 14-20 minutes on one 340Mb file. That’s probably also skipped completely with restoring blobstorage paths instead.
[edit] inline storage is nice for small/medium sites with a few Gb of blobstorage and for archival purposes.
From @fredvd:
I have run into one other issue with supporting a lot of content: On import 'something' is generating temp files in /tmp for every single File that is imported.
Good news, there's a workaround. Make your import script or Plone instance script use this environment variable:
TMPDIR=~/tmp ./optplone/deployments/601a/bin/client1 import mysite.com ~/srvplone/601a/client1/import
I had to nail this because on my system /tmp is a RAM file system that cannot be overridden and cannot be shadowed by a bind mount atop it.
EDIT: this should be added to the documentation!
@Rudd-O This is indeed the work around, but I think I also mentionned TMPDIR in the comment you are referencing.
However, this was in June and I think 1-2 months later Philip finished another feature to support referencing the blob files in the exported json instead of base64 including them. See his comment here from september. Then on import you configure a pointer to the existing source blobstorage, the import step reads the reference from the json and copies/recreates the blob in the target blobstorage from the source blobstorage. This speeds up the import process tremendously and also skips the TMPDIR 'polution'.
Storing blobs inline as base64 is only useful in staging/copying/archiving partial content structures where you want to keep all information in the .json, but IMHO for migrations.
Yes, tutorial style 'documentation' is lacking to demonstrate a full run on a typical site. But we also consider c.exportimport to not be feature complete yet, allthough it's a catch-22 to get more attention for the tool and hopefully also more contributors to code and documentation.
If we add some remarks about TMPDIR to the main README I think we can close this issue as the biggest problems identified are solved.
https://github.com/collective/collective.exportimport/#notes-on-speed-and-large-migrations has a note on TMPDIR
Note I had to use the TMPDIR trick even when using blob references in the exported file. It still would fill up my /tmp.