[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/jp/ - Otaku Culture

Search:


View post   

>> No.17549462 [View]
File: 234 KB, 572x279, file.png [View same] [iqdb] [saucenao] [google]
17549462

I just added OCR output repair logic to Spark Reader.

I don't really want to post about this because I'm probably going to get called a shill and hurt Spark Reader's reputation (if it had one), but I guess this might help someone more than it hurts SR's reputation. Correcting OCR manually is really fucking annoying. I see people complain about OCR like capture2text all the time because it's bad, but if it worked better it would be easier for beginners to read manga without furigana.

This takes advantage of the fact that SR has a sort-of kind-of intelligent parser (better than chiitrans and kanji.moe, at least) to look for characters that are problematic for OCR and try to replace them with similar-looking characters that give the sentence a better-seeming parse. The logic is really stupid but it works well for the worst cases. This makes capture2text usable, basically, and it's probably one of the few uses of parsing that aren't bad for learning.

Before:

>元の場所まで
>っれてってゃろ

After:

>元の場所まで
>つれてってやろ

Before:

>大丈夫`おれが
>手をっかんでて
>ゃろからょ!

After:

>大丈夫`おれが
>手をつかんでて
>やろからよ!

(It doesn't delete stray nonsense characters like `)

It messes up sometimes, so it runs on a per-line basis, not on everything SR has loaded on screen.

Before:

>ゅめゆめ
>藁悟しておく
>ように、 ニ ヤ

After:

>ゆめゆめ
>藁悟レておく
>ように、ニャ

Here you would run it on the first and third lines and not the second one. If you use capture2text, change capture2text's settings so that it doesn't remove newlines.

It's in the right click menu for lines of text, right under "I know this word". https://github.com/wareya/Spark-Reader/releases/tag/rollingtestrelease

The examples are from the first few pages of 夢喰いメリー.

Navigation
View posts[+24][+48][+96]