TorCrawl.py File name mismatch when using the Extract option

File name mismatch when using the Extract option

Open Alessi0X opened this issue 1 year ago • 0 comments

Describe the bug When using the extract option (i.e., -e), there is a file name mismatch. In fact, the software expects to read from a file called links.txt, but it writes a file with the format <date>_links.txt.

To Reproduce In order to reproduce the problem, it's just as easy as running one of the examples on the homepage, that is (after minor modifications): python3 torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 0 -e -w and the output will be

## Your IP: A.B.C.D.
## URL: http://www.github.com/
## Folder created: www.github.com
## Crawler started from http://www.github.com/ with 2 depth crawl, and 0 second(s) delay.
## Step 1 completed with: 40 result(s)
## Step 2 completed with: 857 result(s)
## File created on /Users/user/TorCrawl.py/www.github.com/links.txt
Error: [Errno 2] No such file or directory: 'www.github.com/links.txt'
## Can't open: www.github.com/links.txt
Traceback (most recent call last):
  File "/Users/user/TorCrawl.py/torcrawl.py", line 210, in <module>
    main()
  File "/Users/user/TorCrawl.py/torcrawl.py", line 199, in main
    extractor(
  File "/Users/user/TorCrawl.py/modules/extractor.py", line 206, in extractor
    cinex(input_file, out_path, selection_yara)
  File "/Users/user/TorCrawl.py/modules/extractor.py", line 72, in cinex
    for line in file:
TypeError: 'type' object is not iterable

in fact, by browsing the newly-created www.github.com folder, we have a file called 20240626_links.txt rather than simply links.txt.

Expected behavior That TypeError should not appear.

Desktop (please complete the following information):

OS: macOS 14.5
Python Version: 3.12.4

Fix The fix is quite straightforward. In torcrawl.py, the line

        if args.extract:
            input_file = out_path + "/links.txt"
            extractor(
                website, args.crawl, output_file, input_file, out_path, selection_yara
            )

should be replaced with

        if args.extract:
            input_file = out_path + "/" + now + "_links.txt"
            extractor(
                website, args.crawl, output_file, input_file, out_path, selection_yara
            )

Jun 26 '24 13:06 Alessi0X

TorCrawl.py TorCrawl.py copied to clipboard

File name mismatch when using the Extract option

TorCrawl.py
TorCrawl.py copied to clipboard