selimelawwa
selimelawwa
First implementation of correct_mistakes and correct_spacing methods Added unit tests for both Implementation using [symspellpy](https://github.com/mammothb/symspellpy) closes #14
**Overview** We need to have a method that take in a pd.Series of text data and be able to summarize it, identify topic, important entities and figures. **Approach** Research deep...
Initial commit on blog for handling PDF files. Please check and let me know your comments Remaining: - Example with: [pdfminer](https://github.com/pdfminer/pdfminer.six) - Example with: [Apache Tika](https://github.com/chrismattmann/tika-python) - Conclusion - Maybe...
PDF, PowerPoint presentations and other unstructured text, contain very valuable data that can be used for analysis. There are many tools providing this features. It would be nice if we...
Temp files created by GoogleHadoopSyncableOutputStream are not deleted after output stream is closed
I am using the hadoop GCS connector to read/write files using hadoop filesystem, and there seems to be an issue related to GoogleHadoopSyncableOutputStream, as temp files are not deleted. Is...
In my Java application I have an implementation for a file-system, where my file class is a wrapper for Hadoop filesystem methods. I am upgrading the from [hadoop3-1.9.17](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-1.9.17) to [hadoop3-2.2.8](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-2.2.8)...
When upgrading the from hadoop3-1.9.17 to hadoop3-2.2.8 (using the shaded jar of the new version) I faced performance degradation almost doubling the time of my tests. I also created this...