java - Encoding problems with Stanford NER. Which encoding should I use? -
i'm having hard time find right encoding portuguese result appear properly. used command tag small sample model:
ps > java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.crfclassifier -loadclassifier ner-model.ser.gz -textfile tweets.txt
as example, 1 of phrases follows:
meus pais na época em que eram casados e minha mãe era viva !. em praia de copacabana - posto 5 .
see special characters in example: "época" , "mãe", turn out in final result:
meus/o pais/o na/o época/o em/o que/o eram/o casados/o e/o minha/o mãe/o era/o viva/o !/o ./o em/o praia/b-location de/i-location copacabana/i-location -/o posto/b-location 5/i-location ./o
época = ├⌐poca
mãe = m├úe
not result expected.
it can happening when train model serious concern. tried use -encoding flag various options:
- utf-8
- iso-8859-15
- iso-8859-1
i have success utf-8 or iso-8859-1, time having same result 1 showed above. made sure file encoded utf-8, tried export file encoding in utf-8 well, same result.
i don't know if has influence, i'm using powershell run commands.
what should solve problem?
thanks in advance.
Comments
Post a Comment