JMSSerializerBundle icon indicating copy to clipboard operation
JMSSerializerBundle copied to clipboard

JMS Serializer perfomance issues with more thant 10000 entries

Open malcomhamelin opened this issue 5 years ago • 6 comments

Currently I'm building a PHP command that can update my ElasticSearch indices.

But, a big thing I've noticed is that serializing entities when my array holds more than 10000 of them is taking way too much time. I thought it would be linear, but either 6 or 9k entities takes like a minute (not much difference between 6 or 9k), but when you go past 10k, it just slows down to the point of taking up to 10 minutes.

...
				// we iterate on the documents previously requested to the sql database
    			foreach($entities as $index_name => $entity_array) {
    				$underscoreClassName = $this->toUnderscore($index_name); // elasticsearch understands underscored names
					$camelcaseClassName = $this->toCamelCase($index_name); // sql understands camelcase names

					// we get the serialization groups for each index from the config file
					$groups = $indexesInfos[$underscoreClassName]['types'][$underscoreClassName]['serializer']['groups']; 

    				foreach($entity_array as $entity) {
	    				// each entity is serialized as a json array
						$data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));
						// each serialized entity as json is converted as an Elastica document
						$documents[$index_name][] = new \Elastica\Document($entityToFind[$index_name][$entity->getId()], $data);
					}
    			}
...

There's a whole class around that but that's what is taking the most of the time.

I can get that serialiazing is a heavy operation and that it takes time, but why is there next to no difference between 6, 7, 8 or 9k, but when above 10k entities it juste takes a lot of time ?

PS : for reference, I've asked the same thing on StackOverflow.

malcomhamelin avatar Jul 17 '19 12:07 malcomhamelin

The code you have posted mentions "entities", are you using doctrine?

If you are using doctrine, I do not see any memory cleanup in the loops. This means that the memory goes up on each iteration since doctrine has to instantiate all the visited objects, thus inevitably you will see slowdowns.

goetas avatar Jul 22 '19 14:07 goetas

Here some info on how to clean up memory in doctrine when dealing with big datasets https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/batch-processing.html

goetas avatar Jul 22 '19 14:07 goetas

@goetas We're building a simple DTO for serialization to XML, which has an array with about 4k objects in it. So no Doctrine involved, Serialization takes 2 minutes, and it doesn't seem to scale linearly. I have also attached a flamegraph provided by xdebug. (this was for 2k or 4k products, not sure anymore) out3

Products Serialize (s) Serialize (MiB)
520 0.8 68
976 2.5 76
1964 13 94
3961 115 131

edit: The structure of the DTO also affects performance of course, number of properties on it, and the depth. I'm not sure how much I'm allowed to share here, but the array items have a handful of properties, and the depth you can derive from the flamegraph, so it's about 4 or 5 levels.

afraca avatar Apr 19 '24 08:04 afraca

@afraca - in general it seems to be a kinda big set of data 😅 If you will find any possible improvements - feel free to create an MR for it. From my side: Please check if you can disable some features - for example clean event listeners if you are not using them, exclusion strategies, etc. Also using some handlers to serialise some of the classes might improve performance.

scyzoryck avatar Apr 19 '24 13:04 scyzoryck

Hey @scyzoryck , thanks for replying! I only now realized this is in the bundle repository, not the actual serializer github repo. Sorry about that! If you want I can open a bug there or we continue here.

I find the quadratic behaviour the most interesting. If it would grow linearly it would be fine. If for 100 products it's 1 second, and for 1k products it's 10s that's fine with me, I can schedule a job in a queue and anywhere < 10 minutes is fine. But it grows too quickly. (For all our products > 1 day... ) That implies that it's scanning something too much somewhere.

I have tried to comment out all kinds of low hanging fruit in \JMS\Serializer\GraphNavigator\SerializationGraphNavigator::accept :

  • if ($this->dispatcher->hasListeners('serializer.pre_serialize...
  • if (null !== $handler = $this->handlerRegistry->getHandle...
  • if ($metadata->usingExpression && null === $this->expressionExclusionStrategy) {
  • if (null !== $this->exclusionStrategy && $this->exclusionStrategy->shouldSkipClas...
  • if (null !== $this->expressionExclusionStrategy && $this->expressionExclusionStrategy...
  • foreach ($metadata->preSerializeMethods as $method) {
  • $this->afterVisitingObject($metadata, $data, $type);

Unfortunately no real results.... I know exception handling can slow things down quite a bit sometimes, and I saw the serializer library makes use of exceptions to communicate "normal" stuff as well, but that did not get me anywhere.

One thing currently on my mind is about the foreach in \JMS\Serializer\XmlSerializationVisitor::visitArray. Maybe the appending of child etc slows down quite a lot....

afraca avatar Apr 19 '24 13:04 afraca

I suspect 'JMS\Serializer\XmlSerializationVisitor' class. Serializer package has some performance tests - looking at them xml is 50% slower for same data set in compare to json.

Please also make sure that you use latest serialiser package. Last year I merged few improvements to memory usage and performance.

If you are going to work on the big data sets I'm not sure if serialiser is the best choice. I would check flow-php library that offers ETL pattern.

scyzoryck avatar Apr 19 '24 14:04 scyzoryck