zero-epwing When run on kenkyuusha certain headers are incomplete

When run on kenkyuusha certain headers are incomplete

Open rtega opened this issue 5 years ago • 8 comments

The heading of たしなむ is "heading": "嗜む" while it should be "たしなむ【嗜む】"

Jul 20 '18 09:07 rtega

I added the following lines in line 143 of book.c: if(strstr(result,"嗜む")) { printf("boef: %s %i %i\n",result,position->page,position->offset); } which yields the following result: boef: tashinamu ＜たしなむ【嗜む】＞ 30827 984 boef: たしなむ【嗜む】 <..> 138094 1506

boef: たしなむ【嗜む】 33548 130 boef: たしなむ【嗜む】 <..> 138094 1506

boef: 嗜む 38028 1326 boef: たしなむ【嗜む】 <..> 138094 1506 Basically whats happening is that there are three headers in the dictionary which all refer to the same article. Only the last header is exported.

Jul 20 '18 18:07 rtega

Basically, things go wrong in book_undupe(book); We need to be smarter about what we are removing.

Jul 20 '18 18:07 rtega

I would propose to save the heading with the largest content when removing in book_undupe(book). I don't understand your code at first view. Could you have a look at it?

Jul 20 '18 19:07 rtega

I changed the undupe code with this quicksort and removeduplicates. The resulting file is a bit smaller but it seems to work as it should. `void swap(Book_Entry* a, Book_Entry* b) { Book_Entry t = *a; *a = *b; *b = t; }

int partition_entries(Book_Entry arr[], int low, int high) { Book_Entry * pivot = &arr[high]; // pivot int i = (low - 1); // Index of smaller element

for (int j = low; j <= high- 1; j++)
{
    // If current element is smaller than or
    // equal to pivot
    if (arr[j].text.page < pivot->text.page)
    {
        i++;    // increment index of smaller element
        swap(&arr[i], &arr[j]);
    }
if(arr[j].text.page == pivot->text.page)
{
	if(arr[j].text.offset < pivot->text.offset)
	{
		i++;
		swap(&arr[i],&arr[j]);
		if(arr[j].text.offset == pivot->text.offset)
		{
			if(strlen(arr[j].heading.text) <= strlen(pivot->heading.text))
			{
				i++;
				swap(&arr[i],&arr[j]);
			}
		}
	}
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);

}

/* The main function that implements QuickSort arr[] --> Array to be sorted, low --> Starting index, high --> Ending index / void quickSort_entries(Book_Entry arr[], int low, int high) { if (low < high) { / pi is partitioning index, arr[p] is now at right place */ int pi = partition_entries(arr, low, high);

    // Separately sort elements before
    // partition and after partition
    quickSort_entries(arr, low, pi - 1);
    quickSort_entries(arr, pi + 1, high);
}

}

int removeDuplicates_subbook(Book_Subbook* subbook) { int n = subbook->entry_count; Book_Entry * arr = subbook->entries; // Return, if array is empty // or contains a single element if (n==0 || n==1) return n;

Book_Entry * temp = malloc(n*sizeof(Book_Entry));

// Start traversing elements
int j = 0;
for (int i=0; i<n-1; i++)

    // If current element is not equal
    // to next element then store that
    // current element
    if ((arr[i].text.page != arr[i+1].text.page) || (arr[i].text.offset != arr[i+1].text.offset))
        temp[j++] = arr[i];

// Store the last element as whether
// it is unique or repeated, it hasn't
// stored previously
temp[j++] = arr[n-1];

// Modify original array
for (int i=0; i<j; i++)
    arr[i] = temp[i];

subbook->entry_count = j;
free(temp);
return j;

}

static void subbook_undupe(Book_Subbook* subbook) { quickSort_entries(subbook->entries,0,subbook->entry_count -1); removeDuplicates_subbook(subbook); `

Jul 21 '18 06:07 rtega

It crashes on gakken though.

Jul 21 '18 07:07 rtega

And doesn't work as it should. Working on an updated version.

Jul 21 '18 18:07 rtega

I think the easiest fix is just to check lengths when looking for dupes. If there is a dupe with a longer header length, swap it with the current entry and delete the dupe. You shouldn't have to sort anything.

That being said, I'm not sure you actually want to use headers for anything. All of that information can be found in the entry text, and you are going to have to parse all of that stuff out with regex anyway. Honestly, if anything, this made me wonder if I should even be exporting the headers out of zero-epwing as AFAIK they are just some weird artifact of the EPWING format.

Jul 22 '18 00:07 FooSoft

For reference articles you don't have a header in the entry text itself: "heading": "¶両三日＜りょう２【両】＞", "text": "・両三日　two or three days; a couple of days\n" I guess you really want to keep the info in the heading in that case. Take the example of 普通高等学校: "heading": "¶普通高等学校＜こうとうがっこう【高等学校】＞", "text": "普通高等学校　a general [an ordinary, an academic] high school.\nこうとうかん【高等官】 {{w_46695}}(k{{n_41528}}t{{n_41528}}kan)\n" The heading is referring to 高等学校 while the text is referring to 高等官. You want to keep the info in the heading I think.

Looking at your code to remove dupes, I don't see how you can get at the entry which you are comparing from a Page-pointer solely.

Jul 22 '18 19:07 rtega

zero-epwing zero-epwing copied to clipboard

When run on kenkyuusha certain headers are incomplete

zero-epwing
zero-epwing copied to clipboard