Need advice parsing, parsing, parsing

iPersuade · January 7, 2020, 9:14pm

My mother spent the better part of the past few years collecting and documenting all our family recipes and put them in a Microsoft Word document that she’s been printing and binding for the family. I wanted to take those recipes and convert them into YAML and import them into Paprika for her. I want to automate the process, but I’m not sure where to begin. All the recipes list the ingredients first and then the instructions (nothing surprising about this). There is no demarcation (e.g., a heading) between ingredients and instructions. So, how I program something to recognize the dividing line between ingredient and instruction is my main mystery. Same with the dividing line between the last recipe’s instructions and the next recipe’s title, although that might be a little easier to solve. I’d like to do it with the least manual labor (e.g., prep work in the Word document) as possible–even though, that may mean more work on the “solution.”

This seems like it could be a good project for Awk or Sed or maybe a PERL program. (I’m tool agnostic, and if I could do it on Shortcuts in iOS that would be fun.) If anybody has advice on how I might accomplish this, I’d be most appreciative. I don’t need anybody to do it for me; I’m looking at this as an opportunity to develop a skill. But if this wheel has already been invented, I’m happy to use what’s already out there.

sylumer · January 7, 2020, 9:28pm

If there’s reliable separator or delimiters, how do you know where one section/recipe starts, and where one ends? Is it something you could translate programmatically, or is this more a machine learning sort of question?

It sounds like the first step might be to get things out of Word and into some sort of text format, which should be as easy as copy and paste. After that though it’s simply hard to guess how the data could be processed.

If it is just plain text, then pretty much anything you can read a file into, pattern match on, and write a file out would be a reasonable starting point. The choice is probably more down to personal preference given that it is text processing. The only thing that sounds particularly challenging is that there may be no way to distinguish blocks of data … in which case, the best approach would probably be to add some in manually.

roosterboy · January 7, 2020, 10:31pm

If these are .docx Word files, maybe you could use pandoc to convert them into a different format that would be easier to convert to YAML.

If they are .doc files, I don’t think pandoc will work.

iPersuade · January 7, 2020, 10:51pm

I think you are right, absent some kind of machine learning algorithm, I probably cannot programmatically distinguish between ingredients and instructions. On the other hand, the big distinguishing factor between ingredient and instruction is that instructions go longer than a line. Maybe I could treat any line longer that XX characters as an instruction. That might do the trick.

iPersuade · January 7, 2020, 10:52pm

Oooh, very good call.

sylumer · January 7, 2020, 11:07pm

Except that it is just one Word document according to the original post. Save as text or copy and paste is probably fine for one file. Automating the conversion would be overkill for a one shot and done.

cpac · January 9, 2020, 6:23pm

Have you tried just copying and pasting the text into Paprika and seeing how well Paprika is able to parse it? (Paprika does a great job at parsing recipes on random web pages, so it might well be able to handle the text from the .doc files.

iPersuade · January 9, 2020, 6:59pm

I considered that, but there are too many to go one-by-one. Unless you know of a way to do it as a batch.

dfay · January 9, 2020, 8:08pm

I’d drag the Word doc into Scrivener then manually Split Document at Selection… then export. It’s quick enough that you could probably get through 100 recipes in an hour or so (and way faster than copying and pasting into new documents), and it could give you all of them in separate documents in (within reason) your choice of format.

iPersuade · January 9, 2020, 9:11pm

That’s a fantastic idea, because I also planned to put the book into Scrivener anyway. That way, I can help her do other things with it as a book.