Algorithm for fixing spaceless text

#1

I’m looking for code that will take a string like

theevidencewascreatedandinlightofwhichitshouldbeevaluated

and parse it into words (with the expectation that it won’t be perfect, but still faster than doing it manually). I’ve had little luck finding existing code to handle this, mostly because it’s hard to find search terms that accurately describe the problem.

Does anyone have code or a source they’d recommend? JS preferred since I will probably be implementing it in Copied as a text transformation.

1 Like
#2

There’s a Python based solution, ‘wordninja’, that does this sort of thing.

You could potentially port that algorithm to JavaScript if you have enough knowledge of both languages.

1 Like
#3

Thanks will take a look - I may just stick with python and see if I can add it to an Alfred workflow easily, as I don’t urgently need a cross-platform solution.

1 Like
#4

I’m curious: What causes the text to have no spaces? I’d want to go to the source and try and fix that.

1 Like
#5

What causes the text to have no spaces?

A journal publisher that produces badly crafted PDFs.

1 Like
#6

I ended up using wordninja (which has some refinements from the original code posted on stack overflow) and added some of the code here

to better handle punctuation.

1 Like
#7

According to the Cheeseman/Soghoian book, AppleScript can also end up confecting text without spaces.

#8

At this point I’ve built what I need in Python which I can run in Alfred on the Mac or Pythonista on iOS…but I’d love to see the AppleScript solution if it’s available.