I would like to know if there is a grammar rule(s) that defines whether a word is gramatically legal or not. I understand a word is given meaning by a human and anyone can give meaning to anything. Therefore I realize it is probably impossible to create a set of laws that can absolutely define the legality of a string of letters. Barring that extreme example, is there a practical/general set of such rules?
For example, I remember my grade 2 teacher saying that if a word does not contain at the minimum 1 vowel, then it is not a legal word. Based on that principle, I might claim that the word 'lkjsdlf' is not a legal word.
Is there a generally accepted set of grammatical parameters that define whether a word is legal or not (apart from looking it up in a dictionary)?
The reason I'm asking this is to determine if it's possible to programmatically validate a word (rather than using a list of 100,000+ words from a dictionary). The goal is to categorize 'lkjsdlf' and 'apple' as 'invalid' and 'valid' respectively.
Answer
Not so much a grammar rule but people have analysed the frequency of all the letter combinations of various lengths in samples of English text. They then used this to randomly generate a kind of pseudo English.
I'm not sure where I originally saw this, I think it was a little more scholarly, but here's an example of someone's generated pseudo-English: http://ibbly.com/Pseudo-words.html
and here's someone else's attempt: http://www.fourteenminutes.com/fun/words/
But you could use the same frequency data to quantify how typically "English" a word is, i.e. how probable it is as a word in English.
Of course there's more to words than just a unstructured letter sequence as @curiousdannii has pointed out, so there are further considerations possible in this kind of analysis.
Comments
Post a Comment