proper regex to tokenize sentence with leading dash -
here regex i'm using tokenizer: [^a-za-z\'-]+
however, if want apply sentence this: -this test. -yes, it's test self-consciousness
result ['-this', 'is', 'a', 'test', '-yes', "it's", 'a', 'test', 'for', 'self-consciousness']
there leading -
ahead of this
, yes
. there gonna way eliminate leading -
? maybe modification on regex i'm using?
you'd need qualify dash in middle.
since using negatives split up, have allow
wrong dashes matched.
(?:[^a-za-z'-]|(?<![a-za-z'])-|-(?![a-za-z']))+
https://regex101.com/r/ql7lwq/1
(?: [^a-za-z'-] # not of these | # or, (?<! # allow dash if not preceded 1 of others [a-za-z'] ) - | # or, - # allow dash if not followed 1 of others (?! [a-za-z'] ) )+
Comments
Post a Comment